Unlock third-party data with API-driven data pipelines on AWS

Public sector organizations often utilize third-party Software-as-a-Service (SaaS) to manage various business functions, such as marketing and communications, payment processing, workflow automation, donor management, and more. This common SaaS landscape can lead to data silos where data becomes isolated in disparate systems and difficult to centralize for business insights. If existing SaaS connectors are not available, public sector organizations can use Amazon Web Services (AWS) to build an API-driven data pipeline to consolidate data from SaaS platforms offering open APIs.

API-driven data pipelines consolidate business data which enables public sector organizations to offer more personalized experiences for constituents, enrich research publications, improve operations to reduce the burden on internal teams, and more. Use cases for API data ingestion in the public sector include:

Joining clickstream data to donor profiles to enable more personalized donor experiences based on website navigation history.
Supplementing existing research with publicly available datasets to enrich findings prior to publication.
Pairing operational metrics from project management systems with human resource (HR) data to streamline internal workforce reporting.
Tying event registration with sales reporting to drive targeted marketing.

Commonly, each of these systems are independent and don’t speak to one another. With API-driven data pipelines, public sector organizations can break these data siloes and start generating new business value.

In this post, learn how to build an API data pipeline on AWS. Explore best practices for reviewing SaaS API documentation, options for making API requests from AWS services, methods for handling payload data from API calls, and guidance for orchestration and scaling an API data pipeline on AWS.

Understanding the API

Many SaaS platforms offer open APIs to registered users, allowing you to consume data via API requests. Although there are several API protocols, this post focuses on REST APIs as they are the most common open API offering.

As you review the SaaS API documentation, consider how to accommodate these common configurations on AWS:

Authentication: Leverage AWS Secrets Manager to store credentials like API Keys, ClientIDs, and ClientSecrets, and then reference them in your API call without exposing plain text.
Pagination: Develop custom connectors on Amazon Appflow, which provides pagination support. Alternatively, write your own pagination logic using AWS Lambda or AWS Glue (Jump to Making API calls on AWS section).
Response format: Use services like AWS Glue to reformat and process the payload data (Jump to Handling API data section).
Rate limits: Note the limits associated with your API so that your pipeline design does not exhaust the endpoint. (Jump to Orchestration section).

Making API calls on AWS

AWS Partner solutions and existing third-party integrations

Prior to constructing your own API data pipeline on AWS, review partner tooling available on the AWS Marketplace. AWS Marketplace offers a curated digital catalog where customers can purchase third-party software, including ETL tools like Matillion, Alteryx, Fivetran, and Boomi. These partners offer intuitive drag-and-drop interfaces and API query connectors that enable you to read, transform, and store data from APIs.

Although partner solutions offer simple integration, public sector organizations may prefer native AWS services for API ingestion pipelines to reduce risk exposure and avoid new licensing costs and contracts.

Amazon AppFlow provides bidirectional data integration between on-premises systems and applications, SaaS applications, and AWS services. At the time of publication, the Amazon AppFlow integrations list supports over 70 SaaS applications. Additionally, Amazon EventBridge supports over 40 partner event sources letting you to stream data from SaaS applications without having to write any code. Lastly, AWS Marketplace offers a variety of ready-to-launch SaaS connectors published by third-parties. Learn more about AWS Marketplace.

If your desired API integration does not already exist in Amazon AppFlow, EventBridge, or the AWS Marketplace, consider the following AWS services for API requests.

Amazon AppFlow Custom Connector Software Development Kit (SDK)

For data sources that are reused across teams and require additional functionality such as write calls, consider building a custom connector using the Amazon Appflow custom connector SDK. Although Amazon AppFlow provides a library of pre-built connectors, you can create custom connectors for APIs that are not currently integrated (see Figure 1).

The custom connector SDK provides authentication, pagination, throttling, error handling, deployment scripts, and a test framework out of the box. To learn more, refer to the Custom Connector SDK Developer Guide.

Figure 1. Amazon AppFlow Custom Connector SDK stores the API payload in Amazon S3.

Figure 1. Amazon AppFlow Custom Connector SDK stores the API payload in Amazon S3.

AWS Lambda

For API payloads that may or may not require additional transformation, consider using AWS Lambda to make the API call. AWS Lambda provides a variety of runtimes, so your team has options for making the request, handling authentication, and managing pagination in the code.

With Lambda, you can write the raw API payload as an object to Amazon Simple Storage Service ( Amazon S3). If the payload requires restructuring before storage, then you can add logic within the Lambda function to alter the payload before storing in Amazon S3 (Figure 2). However, Lambda has a 15-minute timeout so it is not recommended for long-running operations.

Figure 2. Lambda calls the API and stores the transformed payload to Amazon S3.

Figure 2. Lambda calls the API and stores the transformed payload to Amazon S3.

AWS Glue

For API payloads that require longer running transformations, consider using AWS Glue to make the API call directly from an extract, transform, and load (ETL) job. AWS Glue ETL jobs can be written in Python or Scala. If you need a particular library for API requests, then you can pass in external libraries through the Glue job parameters.

You can couple the API call and the transformation of the data within the same ETL job, removing intermediate raw data storage and simplifying the pipeline (Figure 3). For an example of making API calls from an AWS Glue Job, review this post on creating a serverless workflow to process Microsoft data with AWS Glue.

Figure 3. AWS Glue calls the API and stores the transformed payload in Amazon S3.

Figure 3. AWS Glue calls the API and stores the transformed payload in Amazon S3.

Lambda with AWS Glue

It’s a best practice to separate your API call for data ingestion from your processing job for transformation. You can use Lambda to make the API call and write the payload to Amazon S3, then trigger an AWS Glue ETL job for processing and store the cleaned data back in Amazon S3 (Figure 4).

If the Glue job were to fail, the raw payload is available in Amazon S3 for a retry without making a new API call and using up invocations toward the API’s rate limit. Additionally, decoupling the API call from the data transformation simplifies code management as each component is responsible for its independent task.

Figure 4. Lambda stores the raw payload in Amazon S3. AWS Glue transforms the raw payload and stores back in Amazon S3.

Figure 4. Lambda stores the raw payload in Amazon S3. AWS Glue transforms the raw payload and stores back in Amazon S3.

Handling API data

Store response data in a data lake

As response data is generated from your API calls, consider implementing a data lake architecture to store raw data, run processing jobs, and output structured data for analytics. A data lake is a centralized repository that lets you store all your structured and unstructured data at any scale. For public sector organizations, examples of data stored in a data lake include sales reporting, expense logging, website analytics, donation history, marketing content, and membership rosters.

Organizations have chosen to build data lakes on top of Amazon S3 and process data with AWS Glue for many years. After response data is added to Amazon S3 for storage, AWS Glue can perform transformations on your payload, such as the relationalize transform to flatten nested JSON. These transformations make sure your data is in a suitable format for relational targets like databases or data warehouses.

For more information about developing data lakes on AWS, check out this reference architecture for a serverless analytics pipeline. Also, consider this post for building a data lake foundation with AWS Glue and Amazon S3.

Manage new data

In an API driven data pipeline, API calls deliver fresh data on a schedule. Depending on your business requirements, you can elect to overwrite the previous data with the latest batch (truncate and reload) or add new data to your existing dataset (upserts and merges).

Truncate and reload: Removing old data and storing only the latest data while keeping the schema intact, often referred to as truncate and reload for databases, can be achieved by either deleting the existing data prior to a new API call, or by overwriting the existing data. You can achieve this by implementing additional logic in your API call or data processing jobs.
Upserts and merges: Adding new data requires either updating existing data or inserting new data. These operations can also be referred to as upserts (updates and inserts) or merges. A best practice for handling merges is to use a temporary staging table for new data and to add logic that checks the new data against the existing data. For organizations that require real time data merging for operational data, consider leveraging AWS Glue for upserts with the open-source Delta lake.

Orchestration

Make an API orchestration plan

After testing your API data pipeline, layer in automation to reduce operational overhead. Consider the following when planning the orchestration of your API data pipeline.

Define a schedule: Work backward from your business requirements to determine when you must make API requests for fresh data. Note any rate limits from the API subscription. Common schedules for API requests are hourly, daily, and weekly.
Evaluate dependencies: Processing jobs may rely on more than one dataset to be available before progressing the pipeline. Examine these dependencies to develop the optimal sequence for data ingestion. Look for opportunities to run API calls in parallel for a more efficient pipeline.

Orchestration and scaling

With your automation plan in place, use AWS Step Functions and Amazon EventBridge to orchestrate API calls and processing jobs on a schedule. Step Functions can help you coordinate service executions using visual workflows on the AWS Management Console (Figure 5). In an API data pipeline, a Step Functions workflow can execute the API call and trigger the subsequent processing jobs. Then, Amazon EventBridge can be scheduled to kick off a Step Functions workflow.

Figure 5. The Step Functions workflow studio with sample API data pipeline orchestration. The workflow includes two separate API calls executed in parallel with Lambda and sent to Glue for processing.

Figure 5. The Step Functions workflow studio with sample API data pipeline orchestration. The workflow includes two separate API calls executed in parallel with Lambda and sent to Glue for processing.

As the number of SaaS data sources grows, scaling elegantly can become a challenge. Instead of building independent pipelines for each SaaS, consider passing input into more generic, templatized functions and processing jobs to reduce the number of custom instances and simplify pipeline management. Step Functions supports scaling your pipeline as you can pass custom payloads into AWS Lambda or custom parameters into an AWS Glue Job.

Learn more about orchestrating data pipelines with AWS Step Functions. For additional examples, review the Amazon EventBridge service integration for AWS Step Functions.

Conclusion

As more public sector organizations rely on third-party SaaS, it is important to know your options for unlocking segregated data across your business landscape. An API driven data pipeline on AWS can help centralize data from third-party SaaS offering open REST APIs. Centralized data enables public sector organizations to offer personalization to their constituents, enrich existing datasets prior to publication, and improve operational efficiency to support staff.

Using AWS services like Amazon S3, Lambda, AWS Glue, and Amazon AppFlow, organizations can consolidate disparate data into a central repository as a source for business intelligence.

In part two of this series, we walk through a hands on example of ingesting SaaS data from APIs, handling payload data from API calls, and orchestrating an API data pipeline on AWS.

For additional reading, review this whitepaper on patterns for ingesting SaaS data into AWS data lakes. To learn more about data lakes and business intelligence on AWS, explore the Modern Data Architecture on AWS overview.

If you would like to discuss this further with your AWS Account Team, complete the Public Sector Contact Us form for your organization.

Read related posts on the AWS Public Sector Blog:

Subscribe to the AWS Public Sector Blog newsletter to get the latest in AWS tools, solutions, and innovations from the public sector delivered to your inbox, or contact us.

Please take a few minutes to share insights regarding your experience with the AWS Public Sector Blog in this survey, and we’ll use feedback from the survey to create more content aligned with the preferences of our readers.