AWS Big Data Blog

Running Amazon Payments analytics on Amazon Redshift with 750TB of data

The Amazon Payments Data Engineering team is responsible for data ingestion, transformation, and the computation and storage of data. It makes these services available for more than 300 business customers across the globe. These customers include product managers, marketing managers, program managers, data scientists, business analysts and software development engineers. They use the data for their scheduled and one-time queries to make the right business decisions. This data is also used for building weekly, monthly and quarterly business review metrics, which are reviewed by the leadership team.

We support various consumer payment business teams, including the following:

  • Amazon Payment Products (Credit cards, Shop with Points, Amazon Currency Convertor, International Payment Products)
  • Gift Cards
  • Payment Acceptance Experience
  • Amazon Business Payments.

We also feed into the machine learning recommendation engine. This engine suggests the best payment product to a customer on Amazon’s payment checkout page.

Challenges with old datawarehouse

This section describes our previous challenges with our data warehouse and analytical needs. With payment products launches and their extension to new marketplaces, we had exponential growth in data volume. Later, scaling our extract, transform, and load process (ETL) was met with severe challenges, and resulted in delays and operational burdens. Here are the specific challenges we faced with our data warehouse:

  • Upsert did not scale, so we got updates more than ~10MN per run. The consumer product catalog dataset has more than 6.5BN records listed in the US marketplace, and occasionally the daily updates exceeded 10MN mark. We saw a similar trend for the Order attributes dataset.
  • Data aggregation either took longer or never finished if we had to analyze even for 6 months of payments data. Often, business owners wanted to aggregate the data based on certain attributes. For example, the number of successful transactions and monetary value by certain types of cards.
  • Shared cluster, and thus shared storage and compute caused resource crunch and impacted all its users. Each team was given ~100TB each on the Data warehouse. Each team could bring their table and join with central data warehouse tables. Any bad query on the cluster impacted all other queries on the same cluster. It was difficult to identify the owner of those bad queries.
  • There were more than 30,000 production tables. it became almost impossible to host all of them on the same cluster.
  • Index corruption on a larger table was cumbersome to rebuild and backfill the table.
  • We required a Database Administrator to apply patches and updates.

Using Amazon Redshift as the new payments data warehouse

We started exploring different options which suits our analytical needs, which is fast, reliable and scales well for future data growth. With all of the previously described issues, Central Data warehouse moved towards separating the compute and storage layer and they decided to be responsible for storage. They built a data lake on Amazon S3, which is encrypted to store even the highly confidential critical data. Each consumer team got the guideline to bring their own compute capacity for their analytical needs. Our payments team started looking for the following advantages:

  • Expedient analytics.
  • Integrates with S3 and the other AWS services.
  • Affordable storage and compute rates.
  • Works for ETL processing.

We chose Amazon Redshift because of it has the following features:

  • Bulk uploads are faster. ~700MN data inserts into ~30 minutes.
  • Data upsert is exceptionally fast.
  • Aggregation query on multi-million dataset with fewer columns of data returns in a few seconds as compared to a few minutes.
  • No need for DBA time to be allocated to maintain the database. Data engineers can easily perform backups, re-snapshot to a new cluster, set up alarms in case of cluster issues, and add new nodes.
  • The ability to keep data on S3. This data is accessible from multiple independent Amazon Redshift clusters through Spectrum and also allows users to join Spectrum tables with other tables created locally on Amazon Redshift. It offloads some processing to the Spectrum layer while storing the data on S3.
  • The ability to use Amazon Redshift best practices to design our tables in regards to distribution keys, sort keys, and compression. As a result, the query performance exceeded our SLA expectations.
  • An effective compression factor. This saves more than 40 to 50 percent of space by choosing the right compression. This enables faster query and efficient storage option.

Sources of data and storage

We consume data from different sources, like PostgreSQL, Amazon DynamoDB live streams, Central data warehouse data lake and bank partners’ data through secure protocols. Data from PostgreSQL databases are in relational format, whereas DynamoDB has key value pairs. We translate the key/value data to relational format and store in Amazon Redshift and S3. Most frequently accessed data is kept in Amazon Redshift. Less frequently accessed and larger datasets are stored in S3 and accessed through Amazon Redshift Spectrum.

Central data lake hosts more than 30,000 tables from different teams, such as Orders, Shipments, and Refunds. Whereas, we as a payments team need approximately 200 tables from this data lake as source tables. Later, we built a data mart specific to payment products, which feeds both to scheduled and one-time data and reporting needs. All small and mid-size tables, smaller than 50 TB, are directly loaded in Amazon Redshift from data lake, which physically stores the data. Tables larger than 50 TB are not stored locally on Amazon Redshift. Instead, we pull from the data lake using EMR-Hive, convert the format from tsv to ORC/Parquet and store on S3. We create an Amazon Redshift Spectrum table on top of S3 data. Format conversion lowers the runtime for each analytical aggregation queries, whereas storing on S3 makes sure we do not fill up entire Amazon Redshift cluster with data rather use it for efficient computing.

Data Architecture

Different components

  1. Central data warehouse data lake (Andes) — Almost all the systems in Amazon, wanting to share their data with other teams, publish their data to this datalake. It is an encrypted storage built on top of Amazon S3 which has metadata attached along with datafiles. Every dataset has a onetime dump and then incremental delta files. Teams willing to consume the data by
  • Physically copying the data into their own Amazon Redshift cluster. It is efficient for smaller and mid-size tables which are accessed most frequently.
  • Using Amazon Redshift Spectrum to run analytical queries on datasets stored in data lake. It helps in accessing cold large data, generally larger than 50TB, thereby avoids scaling up your Amazon Redshift cluster just because all the space might be consumed by these larger data files.
  • Using the AWS Glue catalog to update the metadata in your team’s S3 bucket and use Amazon EMR to pull the data, apply transformation, change format and store the final data in S3 bucket, which can further be consumed using Amazon Redshift Spectrum. It is efficient when the dataset is large and need transformations before being consumed.
  1. Amazon Redshift clusters — Amazon Redshift has centric architecture and is best suited for being single place for all source of truth, but we are managing three clusters mostly because of having consistent SLA of our reports, decoupling the user query experience with central data lake ingestion process (which is resource intensive). Here are the cluster specific reasons for why we need these as separate clusters.
  • Staging cluster:
    • Our data sources are dynamic which are in the transition state and moving away from relational to non-relational sources; for example, Oracle to Postgres or to DynamoDB.
    • The mechanism to pull data from central data lake and store into Amazon Redshift, is also evolving and is resource intensive in current state.
    • Our datamart is payment specific, though the table names in our datamart looks similar to central data lakes tables, but our datamart data is different than central data lake datasets. We apply necessary transformation and filters before bringing the data to the user’s Amazon Redshift cluster.
  • User cluster: Our internal business users wanted to create the tables in public schema for their analysis. They also needed direct connect access for any adhoc analysis. Most of the users know SQL and are aware of best practices but there are users who are new to SQL and sometimes their query is not optimized and impact other running queries, we have Amazon Redshift workload manager (WLM) settings to protect our cluster from these bad queries.
  • Prod ETL cluster: We have tight SLA to make dataset available for data users. In order to minimize the impact of bad queries running on the system we have set up replica of user cluster. All prod transforms run here and the output data is copied to both user and prod clusters. It insures the SLA we commit to data business users.
  1. Near real time data ingestion — Many applications like promotional data, card registration, gift card issuance etc need realtime data collection to detect fraud. Application data is stored in Amazon DynamoDB, with DynamoDB Streams enabled. We consume the data from these streams through an AWS Lambda function and Amazon Kinesis Data Firehose. Kinesis Firehose delivers the data to S3 and submits the copy command to load the data into Redshift. We have micro batch of 15 mins which makes sure not all the connections are consumed by these near-real time data applications.
  2. Alternate compute on Amazon EMR — We receive website clicks data through clickstream feeds which can run into billions of records in a day for each marketplace. These large datasets are critical but less frequently accessed on Amazon Redshift. We decided to choose S3 as a storage option and applied the transformations using Amazon EMR. With this approach, we made sure we do not fill up the database with massive cold data and at the same time we enable data access on S3 using Amazon Redshift Spectrum, which provided similar query performance. As Amazon Redshift is a columnar database and is exceptionally fast for any sort of aggregation if we choose fewer dimensional columns. We wanted similar performance for the data we stored on S3. We were able to do it using Amazon EMR and by changing the data format from TSV to ORC or Parquet. Every day, we created new partition data on S3 and refreshed the Amazon Redshift Spectrum table definition to include new partition data. These Spectrum table were accessed by business users for their one-time analysis using any Amazon Redshift SQL client or for scheduling any ETL pipeline.
  3. Publish the data to data warehouse datalake for non-payment users — We built payment specific datasets. For example, decline transaction behavior, wallet penetration and others. Sometimes non-payments business users are also interested in consuming these datasets. We publish these datasets to central data warehouse data lake for them. Additionally, payments application teams are the source for payment product application data. Data engineering team consumes these datasets, apply needed transformation and publish it both for payments and non-payments user through Amazon Redshift and the data lake.

Schema management

We have a prod schema which stores all production tables and only platform team has access to make any changes to it. We also provide payment product specific sandboxes which is accessible by product specific member. There is generic public schema for any payments data users. They can create, load, truncate/drop the tables in this schema.

Database and ETL lookup

Here are few fun facts about our Amazon Redshift database objects.

  • Number of Databases: 4
    • Staging Database DB1: ds2.8xlarge x 20 nodes
      Memory. 244 GiB per node
      Storage. 16TB HDD storage per node
    • User Database DB2: ds2.8xlarge x 24 nodes
      Memory. 244 GiB per node
      Storage. 16TB HDD storage per node
    • Platform Database DB3: ds2.8xlarge x 24 nodes
      Memory. 244 GiB per node
      Storage. 16TB HDD storage per node
    • Reporting Database DB4: ds2.8xlarge x 4 nodes
      Memory. 244 GiB per node
      Storage. 16TB HDD storage per node
  • Size of Databases:
    • Total memory: 17 TB
    • Total storage: 1.15 Petabytes
  • Number of tables:
    • Analytics prod db: 6500
    • Analytics staging db: 390

User cluster

Here are some stats around the database that is exposed to the users. It has both core tables and users are allowed to create their own tables based on their need. We have the mirror image of the same database, which hosts all of the tables, except user=created tables. Another database is used to run ETL platform related prod pipeline. Most of these tables have entire history, except the snapshot tables like clickstream datasets, which have been archived to S3.

Staging cluster

Here are some stats around the staging database. It is the landing zone for all the data coming from other teams or from the central data warehouse data lake. Table retention has been applied to all the tables as most of the ELT downstream jobs look for the last updated date, pulls just incremental data and store in user and replica databases.

Scheduled ETL and Query load on database

  • Number of daily extraction ETL jobs: 2943
  • Number of loading ETL jobs: 1655

  • Total daily load processed volume: 119 BN
  • Total daily loading runtime: 11,415 mins

  • Total daily data extraction volume: 166 BN
  • Total daily date extraction runtimes: 25,585 mins

Both scheduled and one-time query loads on the database

  • Daily query load on database by different database users.

Best practices

  1. Design tables with right sort keys and distribution key: Query performance is dependent on how much data it scans and if the joins are co-located join. Choosing right sort key make sure we do not scan the data which we do not need and selecting right distribution key makes sure the joining data is present on the same node and there is less movement of data over network resulting in better query performance. For more information, see Amazon Redshift Best Practices for Designing Tables.
  1. Refer to Amazon Redshift Best Practices for Designing Queries while writing the query.
  1. Change the loading strategy by splitting larger files into smaller files, use bulk loads instead serial inserts. For more information, see Amazon Redshift Best Practices for Loading Data.
  1. Configure the appropriate WLM setting to avoid system abuse by allocating right run-times, memory, priority queues, etc. For more information, see Tutorial: Configuring Workload Management (WLM) Queues to Improve Query Processing.
  1. Use Amazon Redshift advisor to identify potential tables needing compression, tables with missing stats, uncompressed data loads, and further fine tune your ETL pipelines. For more information, see Working Recommendations from Amazon Redshift Advisor.
  1. Identify the tables with most of the wasted space and vacuum them frequently. It releases the wasted space and, at the same time, increases the query performance. For more information, see Vacuuming Tables.
  1. Analyze the SQLs submitted to DB and identify the pattern on table usage and expensive joins. It helps data engineers build more denormalized tables by pre-joining these tables, and helping users access a single table, which is fast and efficient.

Conclusion and Next Steps

Amazon Redshift clusters with total capacity of 1.15 PB, ~6500 tables, 4500 scheduled ETL runs, 13,000 ETL queries a day, is solving almost all the ETL need of business users in payments team. However, the recent volume growth is filling up our dbs more than we expected, Next step could be choosing cheaper storage option by building a datalake on S3 and access them using Amazon Redshift spectrum without even bothering about scaling challenges and with seamless user experience.

 


About the authors

Bishwabandhu Newton is a senior data engineer with Amazon Consumer Payments team. He has over 12 years of data warehousing experience, with 9+ years at Amazon.com.

 

 

 

Matt Scaer is a Principal Data Warehousing Specialist Solution Architect, with over 20 years of data warehousing experience, with 11+ years at both AWS and Amazon.com.