AWS Partner Network (APN) Blog

Unlocking the Power of Customer Data: How Caylent and AWS Modernized an Analytics Pipeline

By Israel Mendes, Data Engineering leader – Caylent
By Washim Nawaz, Analytics Specialist – AWS
By Isaac Owusu-Hemeng, Customer Solutions Manager – AWS
By Muz Syed, Sr. Partner Solutions Architect – AWS

Caylent 

In this post, we will discuss how APN Premier partner Caylent leveraged a 3-Day Experience Based Acceleration (EBA) workshop to finalize and validate the right architectural data pipeline solution with one of our shared customer. The key AWS services that were the focal point of this workshop were Amazon Redshift, AWS Glue, Amazon S3 and AWS Database Migration Service (DMS). The success of this Caylent-led EBA workshop, along with their deep analytics experience, enabled the customer to gain the confidence it needed to proceed to modernize its existing analytics pipeline to meet its security and scalability needs for its growing global customer base.

Caylent partners with various businesses, from Startups to Fortune 500 enterprises, to leverage AWS tools and technologies to drive innovation initiatives forward.

Current Architecture

A core need for the customer is to move and share data quickly with thousands of their global clients, empowering decision-making. To achieve this, they need to build a robust data production solution on AWS. Their existing infrastructure needed to be modernized to guarantee data consistency (idempotency) and quality across databases. Security and scalability are paramount as their client-base and environment will expand globally in the coming years.

The customer sought out Caylent’s expertise to revamp its AWS infrastructure, aiming for reduced maintenance and production failures. Goals include:

  • Prioritizing security, notably data isolation at scale.
  • Centralizing information into a data lake for improved decision-making and potential data science experimentation.
  • Addressing technical pain points like schema evolution and data quality.

Solution

After an assessment with Caylent, the following solution was proposed to solve the main problems and pain points from the current data environment:

Solution Diagram

The proposed solution involved transitioning to a Data Lakehouse architecture with a metadata-driven pipeline to handle customer’s data from thousands of multi-Region tenants. Critical points of the solution included:

  • Lakehouse Architecture: Transitioning to a metadata-driven Lakehouse architecture for customer onboarding.
  • Data preparation: Landing DMS changes in S3 for further processing such as partitioning, merging, de-duplication, data cleansing, and data quality checks in the data lake.
  • Data Warehouse: Create a local Redshift table from the glue catalog table, setting Redshift distribution preferences and implementing Row-Level Security (RLS) for tenant data segregation.
  • Transformation: Data cleansing and schema handling with AWS Glue Jobs.

Amazon Redshift is a fully managed, petabyte-scale data warehouse service in the cloud. Amazon Redshift Serverless lets you access and analyze data without all of the configurations of a provisioned data warehouse. Resources are automatically provisioned and data warehouse capacity is intelligently scaled to deliver fast performance for even the most demanding and unpredictable workloads.

AWS Database Migration Service (AWS DMS) is a managed migration and replication service that helps move your database and analytics workloads to AWS quickly, securely, and with minimal downtime and zero data loss. AWS DMS supports migration between 20-plus database and analytics engines.

Amazon Simple Storage Service (Amazon S3) is an object storage service that offers industry-leading scalability, data availability, security, and performance. Customer of all sizes and industries can use Amazon S3 to store and protect any amount of data for a range of use cases, such as data lakes, websites, mobile applications, backup and restore, archive, enterprise applications, IoT devices, and big data analytics.

AWS Glue is a serverless data integration and Extract Transform and Load (ETL) service that makes it easy for analytics users to discover, prepare, move, and integrate data from multiple sources. You can use it for analytics, machine learning, and application development. It also includes additional productivity and data operations tooling for authoring, running jobs, and implementing business workflows.

Analytics Solution Design

Caylent was tasked with working with the customer to explore the best solutions to meet the specific requirements below in an immersive hands-on architectural detailed design workshop.

  1. Extract, Transform, Load (ETL):
    • Transfer data from multiple databases in an Aurora MySQL cluster to S3.
    • Ensure consistency despite schema changes and handle both hard and soft deletes.
    • Clean and consolidate data from all tenants into a single table.
  2. Analytics and Security:
    • Set up an Amazon Redshift cluster on top of the data lake.
    • Address scalability and security concerns for multi-tenant tables.
    • Demonstrate row-level security policies for client data access.

Overall, the workshop aimed to validate the AWS solution’s capabilities in managing diverse data sources, ensuring security, and effectively handling data integrity issues. Caylent determined that they can address these topics over a 3-Day Experience Based Acceleration (EBA) workshop.

What is an Experience-based Acceleration (EBA)?

Experience-Based Acceleration (EBA) is a transformation methodology designed to help customers accelerate cloud adoption and create sustained momentum for migration and modernization initiatives. AWS mechanisms that apply this methodology are called EBA Parties. EBA Parties deliver a critical path item to cloud adoption, or scaling cloud across a client’s organization. As a result of the experience-based approach itself, customers change the way they work (legacy siloed teams to cross functional empowered teams; waterfall project-based focus to iterative agile-based execution; long-lead one-way door analysis to two-way door action). Customers are unblocked and enabled to deliver cloud-based outcomes at a faster speed and larger scale.

How EBAs Work

EBA utilizes hands-on, agile, immersive workshops that span three days. EBA Parties bring an alignment, mindset and integration that address the top blocker to enterprise transformation – the old way of working.

Experience Based Acceleration (EBA)

EBA Findings

During the EBA, Caylent determined that by leveraging row-level security (RLS) in Amazon Redshift, the customer can have granular access control over their customer data at scale. The customer can decide which users or roles can access specific records of data within schemas or tables, based on security policies that are defined at the database objects level. In addition to that, column-level security can be implemented, where users can be granted permissions to a subset of columns, use RLS policies to further restrict access to particular rows of the visible columns.

An AWS DMS task can capture ongoing changes from the source data store after you the initial (full-load) migration to a supported target data store. This process of ongoing replication is referred to as change data capture (CDC). AWS DMS uses this process when replicating ongoing changes from a source data store. This process works by collecting changes to the database logs using the database engine’s native API

The AWS Glue jobs will use Python 3 or PySpark with AWS Glue 4.0, allowing for scalability with worker-type adjustment. Each job will use workers and log directly to CloudWatch and the Glue data catalog will partition the data. The Drop Duplicates transform in AWS Glue removed rows from the data source and provides two options to accomplish this. One can choose to remove the duplicate row that are completely the same or by selecting the fields to match and remove only those rows based on chosen fields.

Conclusion

By leveraging an Experience Based Acceleration (EBA workshop) AWS Partner Caylent was able to address the nuances of the data pipeline solution which consisting of an architecture that is secure, consistent and scalable that leveraged AWS analytics Services such as Amazon Redshift, AWS DMS, Amazon S3 and AWS Glue with AWS Organizations that simplified the management of row-level data security and achieve fine-grained access control in Amazon Redshift. With the experience and scale needed to deliver on numerous large-scale data modernization programs Caylent is well positioned to help customer on their cloud journey.

We were able to delve into key issues and explore potential solutions in real-time. This is exactly the kind of productive discussion we needed before starting the project. This has been very effective! Customer CTO”