Meeting mission goals by modernizing data architecture with AWS

A data-driven approach can help public sector agencies react quickly to unforeseen events, improve decision making, and provide the public with accurate information for their activities and well-being. For example, the Centers for Disease Control and Prevention (CDC) used a data-driven strategy when they launched the Data Modernization Initiative (DMI) to create a modern, integrated, and real-time public health data and surveillance system during health threats such as COVID-19.

However, public sector agencies can experience obstacles to creating a data-driven strategy. Data within organizations can be highly fragmented and each of these systems may have their own guardians and their own processes for data management. These issues compound for sharing data outside the organization, as this requires organizations to address several rules and regulations that govern the access and sharing of data.

In this blog post, learn key Amazon Web Services (AWS) concepts and services that can help agencies modernize their cloud and data architecture. First, learn two fundamental concepts that agencies need to examine regardless of their technical approach. Then, discover the AWS services that enable agencies to apply these concepts to meet mission needs.

Fundamental concepts in modernizing data

Decentralized data architecture

Over the last few decades, several agencies adopted centralized data architecture to enable access and share data across the organization. Examples of this centralized data architecture include data warehouses and data lakes.

Figure 1: A single node with multiple data sources (left) added to a mesh (right).

Figure 1. A single node with multiple data sources (left) added to a mesh (right).

Some agencies use centralized enterprise data warehouses to bring operational data from various systems into a single platform for data analytics and reporting. However, sharing data across an organization can be challenging. Traditional data warehouses or data lakes can be difficult to scale or handle the growth of data, implement technology innovations, and meet the everchanging needs of organizations . From a technology standpoint, a single physical or logical platform can be expensive and slow to change. From an organizational standpoint, staffing a central team to manage the different types of data across different business domains can be taxing.

Some agencies use centralized data lakes to store and analyze vast amounts of structured and unstructured data. While a centralized data lake can provide a single source of truth for the organization’s data, each business unit or data domain may have specific requirements; managing all of them centrally may not be appropriate for every business need.

Agencies can resolve these issues by adopting a decentralized data architecture. In this pattern, each business unit that produces data within the agency can have high autonomy and ownership of their data domain. These data producers can create meaningful data products that they can share across the organization using a decentralized governance framework. A decentralized framework is managed by the data owners and can choose what is shared within a node and what is shared across the data mesh. Compliance and governance is implemented at the node level to allow for specific controls to be managed by the teams who own the data.

Business units that need access to data from other units, the data consumers, can request access to the data products and seek approvals or changes directly from data owners using the decentralized governance framework. The role of the central team pivots from managing the movement of data to managing the decentralized governance framework. As a result, everyone gets faster access to relevant data, while operational bottlenecks and strain on any one team is reduced.

Reduce manual data movement

Agencies can further improve upon a decentralized governing framework by reducing the manual movement of data. The right set of technologies can either automate data movement or completely remove the need to transfer data.

Figure 2. A modern data strategy to catalog and govern data processes.

Figure 2. A modern data strategy to catalog and govern data processes.

To automate data movement, each business unit implements methods to automate the transfer of data across data stores. For example, let’s say a data producing business unit creates a data store for access by downstream data consumers. The data producer can utilize technologies that automatically transfer data from the sources to the data store without needing to write expensive and time-consuming extract, transform, and load (ETL) jobs. Similarly, the data consumer can utilize technologies that automate data transfer from the data producer’s data stores.

To remove the need to transfer data, agencies can mitigate data transfer by using technologies that enable the querying of data in place. These technologies typically use data source connectors that automatically translate data across different data sources.

AWS services to modernize data architectures

Several AWS services can help agencies implement these concepts. The services you choose depends on the required use case, the complexity, and the intended goals.

Figure 3. AWS services that support a modern data strategy.

Figure 3. AWS services that support a modern data strategy.

AWS services for decentralized data architecture

AWS Lake Formation can help agencies to set up a decentralized data architecture using centrally defined security, governance, and auditing policies in one place. AWS Glue Data Catalog makes it simple for data consumers to discover and understand the data products created by data producers by creating a glue crawler that can crawl multiple data stores in a single run. The crawler creates or updates one or more tables in your Data Catalog. Extract, transform, and load (ETL) jobs that you define in AWS Glue use these Data Catalog tables as sources and targets to identify data types, fields and relevant metadata that can be published into the data catalog.

With Amazon EMR, organizations can use open-source big data frameworks like Apache Spark, Apache Hive, Presto for petabyte-scale data processing, interactive analytics, and machine learning. Amazon DataZone simplifies data subscription and approval by providing a centralized platform for data consumers to search for data products and submit subscription requests, and for data owners to approve them. In cases where domain experts do not want to share raw data with data consumers, teams can use AWS Clean Rooms, which helps customers and their partners more simply and securely collaborate and analyze their collective datasets—without sharing or copying one another’s underlying data. Using AWS Clean Rooms, domain experts can enforce strict access policies on the type of queries that can be executed, as well as limit the ability to manipulate or misuse the data, thus making sure the data is used only for its intended purposes. For data visualization, organizations can use Amazon QuickSight to unify business intelligence from the same source of truth through interactive dashboards and natural language queries.

AWS services for data movement

AWS offers agencies options to minimize or eliminate manual data movement. AWS offers zero-ETL integration between purpose built databases like Amazon Aurora and data stores such as Amazon Simple Storage Service (Amazon S3), and Amazon Redshift, a high-performance data warehousing and analytics service. Amazon Redshift integration for Apache Spark makes it simple to use AWS analytics and machine learning (ML) services to build and run Apache Spark applications on data from Amazon Redshift. Amazon Aurora ML and Amazon Redshift ML enable agencies use Amazon SageMaker for ML-powered use cases, without moving data between services. Additionally, AWS provides seamless data ingestion from AWS streaming services, like Amazon Kinesis and Amazon Managed Streaming for Apache Kafka (Amazon MSK), into a wide range of AWS data stores, such as Amazon S3 and Amazon OpenSearch Service, so customers can analyze data as soon as it is available.

Conclusion

Modernizing data architectures can help public sector agencies improve their ability to meet mission goals by expanding the use of data. By examining two fundamental concepts—decentralizing data and reducing the manual movement of data—agencies can consider the appropriate technical approach for their needs.

AWS offers multiple services that can help agencies apply these concepts to meet mission needs. To learn more about these AWS services, or engage in a proof of concept, contact your AWS account team or reach out to the AWS Public Sector team for more information.

Subscribe to the AWS Public Sector Blog newsletter to get the latest in AWS tools, solutions, and innovations from the public sector delivered to your inbox, or contact us.

Please take a few minutes to share insights regarding your experience with the AWS Public Sector Blog in this survey, and we’ll use feedback from the survey to create more content aligned with the preferences of our readers.