AWS Government, Education, & Nonprofits Blog

From data silos to data domains – Bringing common data together

Although more customers every day are using AWS to build, run and gain insights from data lakes, the journey to a data lake for many organizations can be one of uncertainty, and a path that may even seem too difficult to attempt.

In this blog, we will discuss the challenges organizations can face with siloed data and how data from disparate systems can be brought together using data domains.

Data, data, everywhere – The value of data lakes

A common challenge we hear from customers is that their organization has data in multiple systems or locations, including data warehouses, spreadsheets, and databases. Not only is the variety of the data expanding, but also the volume of data is growing exponentially. Add the complexity of mandatory data security and governance, user access, and the demands of the analytics and reporting teams, and an organization can easily become overwhelmed.

Here is where the value of a data lake becomes evident. A data lake is a central repository used to store data, regardless of its source or format. You can then use a variety of analytic tools to extract value quickly and inform key decision makers. Because of the growing variety and volume of data, data lakes are an emerging and powerful architectural approach, especially as organizations look to capture value from newer data sources.

Many public sector customers are looking to data lakes to help with a variety of use cases, such as:

  • Weather analysis: Weather services across the globe collect terabytes of data from sensors. These data sources are openly available. Scientists and government agencies can use this data to look for trends that help with disaster preparedness for floods and hurricanes.
  • Social services: Publicly stored data is used by service agencies to improve operating efficiencies and reduce fraud. For example, social services are finding new ways to apply public records and public assistance statistics to reduce operating costs and claim settlement times, while ensuring citizens receive their benefits.
  • Health services: Using data from hospitals, accident reports, disease center reports, and social services case files, government agencies can assess healthcare needs. Cross-tabulating medical and environmental data also points to potential environmental hazards, medical trends, or health risks related to regional conditions.

Whether the use cases relate to financial planning, resource allocation, or predictive analytics, public sector organizations continue to find ways to benefit from accessible data insights.

Finally, there is an approach that addresses all these challenges. It begins by looking at multiple data sources in the context of data domains rather than by data source or application.

What is a data domain?

A data domain is a topic or subject area that a dataset may be related to. For example, an organization has a number of data sources:

  • A commercial database with tables including:
    • Building information (address, building manager)
    • Room information
    • Networking device registers
  • API data from a Cisco wireless system (CMX)
  • A spreadsheet containing site information including geospatial data

There are three data sources – a database, file, and an API – each with their own authentication and storage location, and with no connection between them.

Figure 1 – Three separate, but relatable data sources

Using data domains, the siloed data sources can be grouped into two data domains, creating a conceptual relationship.

Figure 2 – Example data domains Wireless and Facility

Amazon S3 as your data domain environment

Using Amazon S3, we can replicate data domains and dataset relationships using a hierarchy of folders and sub folders to represent each relationship and dataset.

Figure 3 – Amazon S3 data domain structure

Using Amazon S3 also gives you:

  • Security from Day 0: The moment data is stored in an Amazon S3 location, it can be automatically encrypted, is integrated with Amazon Identity and Access Management (IAM), can have access logs recorded, can be locked for deletion, and it can have version control. Learn more about security and Amazon S3.
  • Data governance: Metadata is recorded against each object stored in Amazon S3. Custom metadata can be added, including data classification, data owner, lineage, and checksums. Learn more about metadata and Amazon S3.
  • Queryability: Data in Amazon S3 is directly queryable, either directly or via AWS Glue Data Catalog, using Amazon Athena, Amazon EMR, Amazon Redshift Spectrum and Amazon QuickSight.

Using Amazon S3 as your data domain environment, datasets are no longer siloed by a system or application, but are freed within an environment that scales in size and supports thousands of concurrent users and queries.

From data domain to data insights

As the folder structure within a data domain partitions the data, querying the data using a service such as Amazon Athena efficiently queries only the data needed to return a result.

For example, if a user wished to query wireless data only for cmx devices, the folder structure would be traversed so that only the data-domain-bucket -> wireless -> cmx path would be queried. Using SQL with Amazon Athena, the query could be:

Select * from data-domain-bucket
Where 
partition_0=’wireless’
	and 
partition_1=’cmx’

 

Next steps

If you are ready to get started with a data lake, please download the project here. And join us at this year’s AWS Immersion Day, held across eight cities in Australia!  Have your questions answered, engage with our team of solutions architects, and discover solutions to your technology challenges.


A post by Paul Macey, Specialist Solutions Architect, Big Data and Analytics, AWS