This Guidance demonstrates how to ingest third-party Software as a Service (SaaS) data into Amazon S3 to build a serverless business intelligence pipeline. SaaS applications remove the burden of having to build a solution from the ground up. The challenge with SaaS applications is that the data exists in external data stores, making it difficult to analyze the data that comes from other sources. Importing data into Amazon S3 removes data silos and centralizes the data to transform and enrich the datasets. By deploying this Guidance, you can gain better insights from your SaaS data, remove barriers when integrating the data, and leverage a serverless architecture that provides on-demand resources and a pay-as-you-go pricing model.
Please note: [Disclaimer]
Amazon AppFlow is invoked to run either on-demand, or on a schedule, to pull data from software as a service (SaaS) applications such as SAP, Salesforce, ServiceNow and other SaaS applications.
Amazon AppFlow stores the raw data pulled from SaaS applications in Amazon Simple Storage Service (Amazon S3).
AWS Glue reads Data Catalog to access raw data, and also makes a new table of the enriched and transformed data.
Transformed data is stored in an Amazon S3 bucket (Curated_Data) that can be analyzed further using business intelligence (BI) services or tools.
The AWS Well-Architected Framework helps you understand the pros and cons of the decisions you make when building systems in the cloud. The six pillars of the Framework allow you to learn architectural best practices for designing and operating reliable, secure, efficient, cost-effective, and sustainable systems. Using the AWS Well-Architected Tool, available at no charge in the AWS Management Console, you can review your workloads against these best practices by answering a set of questions for each pillar.
The architecture diagram above is an example of a Solution created with Well-Architected best practices in mind. To be fully Well-Architected, you should follow as many Well-Architected best practices as possible.
This Guidance focuses on deploying a decoupled architecture that is modular and can change based on business requirements. By using Amazon S3 as a data lake, you can integrate your SaaS data into BI tools like QuickSight. Future business requirements may change to use machine learning, and by using a data lake, the data can be easily integrated into AWS artificial intelligence and machine learning (AI/ML) services, such as Amazon SageMaker. You can also use Amazon Redshift as a data warehouse to provide live data querying capabilities using Open Data Connectivity (ODBC) or Java Database Connectivity (JDBC).
A number of design decisions were factored into securing people and machine access in this Guidance. One, credentials that Amazon AppFlow uses to authenticate with the SaaS applications are secured using AWS Secrets Manager. Credentials are encrypted and cannot be viewed by users or roles without explicit permissions in AWS Identity and Access Management (IAM). Data stored in Amazon S3 can be encrypted when it is written, and access to data is granted with IAM. Two, with a serverless architecture, AWS manages the network security of the underlying infrastructure. You only need to focus on granting permissions to the users of services using IAM. Third, all public access to Amazon S3 is blocked by default. Only users or roles with sufficient IAM permissions are able to perform actions in the bucket. Amazon S3 automatically applies server-side encryption for each new object, unless you specify a different encryption option. Data that moves from one service to another is encrypted in transit by default. Services that access Amazon S3, like AWS Glue, Athena, or QuickSight, all need explicit access in IAM to read or write data.
This Guidance implements a highly available network topology by using a serverless architecture that is deployed in a single Region and runs across multiple Availability Zones (AZs). This removes the single source of failure that could come from a rare, but possible, AZ failure.
Additionally, application reliability is achieved by decoupling the application into individual components that focus on a single task. Having data from the SaaS application land in Amazon S3 allows you to ingest data without having to transform it in-transit. This makes it resilient to schema changes and other errors that may occur when ingesting or transforming the data in-transit. Once raw data is in Amazon S3, you can transform the dataset using AWS Glue or DataBrew without altering the source dataset, and have the transformed dataset land in an Amazon S3 curated data bucket. The curated dataset is used by Athena to load the data into the QuickSight SPICE cache.
Decupling each service allows each component to work independently of each other. Each service is purpose built to perform a specific task. Services like AppFlow, Amazon S3, and Athena can scale up and down to meet demand without having to anticipate changes in demand. AWS Glue and AWS Glue DataBrew data processing units (DPUS) can be increased perform task faster, or process larger datasets.
Optimizations of the data to increase performance can be done based on how the data is being queried. Query access patterns will determine if there is a need to use columnar storage, such as parquet, or partitioning the data based on query filters. This optimization will accelerate query performances in Athena.
This Guidance focuses on making a serverless BI pipeline. By using a serverless architecture, you only pay for what you use. Services like Amazon AppFlow, AWS Glue, and Athena only incur charges for times that those services are invoked and used. Services like Amazon S3 incur charges for the amount of data that is actually used. And, using a serverless architecture allows the architecture to scale up and down to meet demand without having to over provision resources.
Because deploying this Guidance results in a serverless BI pipeline, resources are on-demand and do not sit idle when not in use. This limits the overall hardware used in the AWS Cloud so you can maximize efficiency and minimize waste.
The sample code; software libraries; command line tools; proofs of concept; templates; or other related technology (including any of the foregoing that are provided by our personnel) is provided to you as AWS Content under the AWS Customer Agreement, or the relevant written agreement between you and AWS (whichever applies). You should not use this AWS Content in your production accounts, or on production or other critical data. You are responsible for testing, securing, and optimizing the AWS Content, such as sample code, as appropriate for production grade use based on your specific quality control practices and standards. Deploying AWS Content may incur AWS charges for creating or using AWS chargeable resources, such as running Amazon EC2 instances or using Amazon S3 storage.
References to third-party services or organizations in this Guidance do not imply an endorsement, sponsorship, or affiliation between Amazon or AWS and the third party. Guidance from AWS is a technical starting point, and you can customize your integration with third-party services when you deploy the architecture.