Architecting Persona-centric Data Platform with On-premises Data Sources

Many organizations are moving their data from silos and aggregating it in one location. Collecting this data in a data lake enables you to perform analytics and machine learning on that data. You can store your data in purpose-built data stores, like a data warehouse, to get quick results for complex queries on structured data.

In this post, we show how to architect a persona-centric data platform with on-premises data sources by using AWS purpose-built analytics services and Apache NiFi. We will also discuss Lake House architecture on AWS, which is the next evolution from data warehouse and data lake-based solutions.

Data movement services

AWS provides a wide variety of services to bring data into a data lake:

AWS Data Migration Service (AWS DMS) can connect to a variety of operational Relational Database Management System (RDBMS) and NoSQL databases. It ingests the data into Amazon Simple Storage Service (S3)
AWS Lake Formation provides blueprints to ingest data from AWS native or on-premises database sources into Amazon S3
Amazon Kinesis ingests streaming data into Amazon S3
AWS Identity and Access Management (IAM) and AWS Lake Formation can offer you flexible coarse-grained and fine-grained access controls on your data storage and the analytics solution that you need

You may want to bring on-premises data into the AWS Cloud to take advantage of AWS purpose-built analytics services, derive insights, and make timely business decisions. Apache NiFi is an open source tool that enables you to move and process data using a graphical user interface.

For this use case and solution architecture, we use Apache NiFi to ingest data into Amazon S3 and AWS purpose-built analytics services, based on user personas.

Building persona-centric data platform on AWS

When you are building a persona-centric data platform for analytics and machine learning, you must first identify your user personas. Who will be using your platform? Then choose the appropriate purpose-built analytics services. Envision a data platform analytics architecture as a stack of seven layers:

User personas: Identify your user personas for data engineering, analytics, and machine learning
Data ingestion layer: Bring the data into your data platform and data lineage lifecycle view, while ingesting data into your storage layer
Storage layer: Store your structured and unstructured data
Cataloging layer: Store your business and technical metadata about datasets from the storage layer
Processing layer: Create data processing pipelines
Consumption layer: Enable your user personas for purpose-built analytics
Security and Governance: Protect your data across the layers

Reference architecture

The following diagram illustrates how to architect a persona-centric data platform with on-premises data sources by using AWS purpose-built analytics services and Apache NiFi.

Figure 1. Example architecture for persona-centric data platform with on-premises data sources

Architecture flow:

1. Identify user personas: You must first identify user personas to derive insights from your data platform. Let’s start with identifying your users:
  - Enterprise data service users who would like to consume data from your data lake into their respective applications.
  - Business users who would like to like create business intelligence dashboards by using your data lake datasets.
  - IT users who would like to query data from your data lake by using traditional SQL queries.
  - Data scientists who would like to run machine learning algorithms to derive recommendations.
  - Enterprise data warehouse users who would like to run complex SQL queries on your data warehouse datasets.
2. Data ingestion layer: Apache NiFi scans the on-premises data stores and ingest the data into your data lake (Amazon S3). Apache NiFi can also transform the data in transit. It supports both Extract, Transform, Load (ETL) and Extract, Load, Transform (ELT) data transformations. Apache NiFi also supports data lineage lifecycle while ingesting data into Amazon S3.
3. Storage layer: For your data lake storage, we recommend using Amazon S3 to build a data lake. It has unmatched 11 nines of durability and 99.99% availability. You can also create raw, transformed, and enriched storage layers depending upon your use case.
4. Cataloging layer: AWS Lake Formation provides the central catalog to store and manage metadata for all datasets hosted in the data lake by AWS Glue Data Catalog. AWS services such as AWS Glue, Amazon EMR, and Amazon Athena natively integrate with Lake Formation. They automate discovering and registering dataset metadata into the Lake Formation catalog.
5. Processing layer: Amazon EMR processes your raw data and places them into a new S3 bucket. Use AWS Glue DataBrew and AWS Glue to process the data as needed.
6. Consumption layer or persona-centric analytics: Once data is transformed:
  - AWS Lambda and Amazon API Gateway will allow you to develop data services for enterprise data service users
  - You can develop user-friendly dashboards for your business users using Amazon QuickSight
  - Use Amazon Athena to query transformed data for your IT users
  - Your data scientists can utilize AWS Glue DataBrew to clean and normalize the data and Amazon SageMaker for machine learning models
  - Your enterprise data warehouse users can use Amazon Redshift to derive business intelligence
7. Security and governance layer: AWS IAM provides users, groups, and role-level identity, in addition to the ability to configure coarse-grained access control for resources managed by AWS services in all layers. AWS Lake Formation provides fine-grained access controls and you can grant/revoke permissions at the database- or table- or column-level access.

Lake House architecture on AWS

The vast majority of data lakes are built on Amazon S3. At the same time, customers are leveraging purpose-built analytics stores that are optimized for specific use cases. Customers want the freedom to move data between their centralized data lakes and the surrounding purpose-built analytics stores. And they want to get insights with speed and agility in a seamless, secure, and compliant manner. We call this modern approach to analytics the Lake House architecture.

Figure 2. Lake House architecture on AWS

Refer to the whitepaper Derive Insights from AWS Lake house for various design patterns to derive persona-centric analytics by using the AWS Lake House approach. Check out the blog post Build a Lake House Architecture on AWS for a Lake House reference architecture on AWS.

Conclusion

In this post, we show you how to build a persona-centric data platform on AWS with a seven-layered approach. This uses Apache NiFi as a data ingestion tool and AWS purpose-built analytics services for persona-centric analytics and machine learning. We have also shown how to build persona-centric analytics by using the AWS Lake House approach.

With the information in this post, you can now build your own data platform on AWS to gain faster and deeper insights from your data. AWS provides you the broadest and deepest portfolio of purpose-built analytics and machine learning services to support your business needs.

Read more and get started on building a data platform on AWS:

AWS purpose-built analytics services
AWS Lake House
Amazon SageMaker and other Artificial intelligence (AI) services for data prediction

AWS Architecture Blog

Architecting Persona-centric Data Platform with On-premises Data Sources

Data movement services

Building persona-centric data platform on AWS

Reference architecture

Lake House architecture on AWS

Conclusion

Resources

Follow

Learn

Resources

Developers

Help