Data warehouse and business intelligence technology consolidation using AWS

Organizations have been using data warehouse and business intelligence (DWBI) workloads to support business decision making for many years. These workloads are brought to the Amazon Web Services (AWS) platform to utilize the benefit of AWS cloud. However, these workloads are built using multiple vendor tools and technologies, and the customer faces the burden of administrative overhead.

This post provides architectural guidance to consolidate multiple DWBI technologies to AWS Managed Services to help reduce the administrative overhead, bring operational ease, and business efficiency. Two scenarios are explored:

Upstream transactional databases are already on AWS
Upstream transactional databases are present at on-premise datacenter

Challenges faced by an organization

Organizations are engaged in managing multiple DWBI technologies due to acquisitions, mergers, and the lift-and-shift of workloads. These workloads use extract, transform, and load (ETL) tools to read relational data from upstream transactional databases, process it, and store it in a data warehouse. Thereafter, these workloads use business intelligence tools to generate valuable insight and present it to users in form of reports and dashboards.

These DWBI technologies are generally installed and maintained on their own server. Figure 1 demonstrates the increased the administrative overhead for the organization but also creates challenges in maintaining the team’s overall knowledge.

Figure 1. DWBI workload with multiple tools

Therefore, organizations are looking to consolidate technology usage and continue supporting important business functions.

Scenario 1

As we know, three major functions of DWBI workstream are:

ETL data using a tool
Store/manage the data in a data warehouse
Generate information from the data using business intelligence

Each of these functions can be performed efficiently using an AWS service. For example, AWS Glue can be used for ETL, Amazon Redshift for data warehouse, and Amazon QuickSight for business intelligence.

With the use of mentioned AWS services, organizations will be able to consolidate their DWBI technology usage. Organizations also will be able to quickly adapt to these services, as their engineering team can more easily use their DWBI knowledge with these services. For example, using SQL knowledge in AWS Glue jobs with SprakSQL, in Amazon Redshift queries, and in Amazon QuickSight dashboards.

Figure 2 demonstrates the redesigned the architecture of Figure 1 using AWS services. In this architecture, ETL functions are consolidated in AWS Glue. An AWS Glue crawler is used to auto-catalogue the source and target table metadata; then, AWS Glue ETL jobs use these catalogues to read data from source and write to target (data warehouse). AWS Glue jobs also apply necessary transformations (such as join, filter, and aggregate) to the data before writing. Additionally, an AWS Glue trigger is used to schedule the job executions. Alternatively, AWS Managed Workflows for Apache Airflow can be used to schedule jobs.

Figure 2. Consolidated workload with source on AWS

Similarly, data warehousing function is consolidated with Amazon Redshift. Amazon Redshift is used to store and organize enriched data and also enforce appropriate data access control for both workloads and users.

Lastly, business intelligence functions are consolidated using Amazon QuickSight. It used to create necessary dashboards that source data from Amazon Redshift and apply complex business logic to produce necessary charts and graphs needed for business insights. It is also used to implement necessary access restrictions to dashboards and data.

Scenario 2

In situation where source databases are in on-premises datacenter, the overall solution will be similar to Scenario 1, with an additional step to move the data continually from on-premise database to an Amazon Simple Storage Service (Amazon S3) bucket. The data movement can be efficiently handled by AWS Database Migration Service (AWS DMS).

To make the source database accessible to AWS DMS, a connection needs to established between the AWS cloud and on-premise network. Based on performance and throughput needs, the organization can choose either AWS Direct Connect service or AWS Site-to-Site VPN service to securely move the data. For the purpose of this discussion, we are considering AWS Direct Connect.

In Figure 3, AWS DMS task is used to perform a full-load followed by change data capture to continuously move the data to an S3 bucket. In this scenario, AWS Glue is used to catalogue and read the data from S3 bucket. The remaining portion of the dataflow is the same as the one mentioned in Scenario 1.

Figure 3. Consolidated workload with source at datacenter

Scaling

Both of the updated architectures provide necessary scaling:

Auto scaling feature can be used to scale-up or -down AWS Glue ETL job resources
Concurrency scaling feature can be used to support virtually unlimited concurrent users and queries in Amazon Redshift
Amazon QuickSight resources (web server, Amazon QuickSight engine, and SPICE) are auto scaled by design

Security, monitoring, and auditing

Also, the updated architectures provide necessary security by using access control, data encryption at-rest and in transit, monitoring, and auditing.

AWS Key Management Service can be used to generate keys necessary for data encryption at rest.
AWS CloudTrail can be used for tracking user activity and API usage for auditing and troubleshooting.
Amazon CloudWatch can be used to monitor Amazon Redshift service and log generated by AWS Glue jobs.
Amazon Simple Notification Service can be used for sending notifications from AWS cloud. For example, AWS Glue jobs’ execution status, Amazon QuickSight SPICE data failure notification.
AWS Identity and Access Management is used for user and group access in an organization’s AWS account.

Additionally, both Amazon Redshift and Amazon QuickSight provides their own authentication and access controls. Therefore, a user can be a local user or a federated one. With the help of these authentications, an organization will be able to control access to data in Amazon Redshift and also access to the dashboard in Amazon QuickSight.

Conclusion

In this blog post, we discussed how AWS Glue, Amazon Redshift, and Amazon QuickSight can be used to consolidate DWBI technologies. We also have discussed how an architecture can help an organization build a scalable, secure workload with auto scaling, access control, log monitoring and activity auditing.

Ready to get started?

Learn how to author job in AWS Glue
Authorize connection from Amazon QuickSight to Amazon Redshift clusters
Discover a typical Amazon Redshift data processing flow
Get started by checking hands-on with the Amazon Redshift Analytics Workshop

AWS Architecture Blog