AWS Big Data Blog

Bannerconnect uses Amazon Redshift to help clients improve digital marketing results

Bannerconnect uses programmatic marketing solutions that empower advertisers to win attention and customers by getting their ads seen by the right person at the right time and place. Data-driven insights help large advertisers, trade desks, and agencies boost brand awareness and maximize the results of their digital marketing. Timely analysis of log data is critical to respond to dynamic changes in customer behavior, optimize marketing campaigns quickly, and to gain competitive advantage.

By moving to AWS and Amazon Redshift, our clients can now get near real-time analytics at their fingertips. In this blog post, we describe the challenges that we faced with our legacy, on-premises data warehouse, and discuss the benefits we received by moving to Amazon Redshift. Now, we can ingest data faster, do more sophisticated analytics, and help our clients make faster, data-driven decisions to improve their digital marketing.

Legacy on-premises situation and challenges

Our on-premises, legacy infrastructure consisted of the IBM PureData System as our log level data warehouse. We used a MySQL database for storing all metadata and all analytics data. In this physical, nonvirtualized environment, we needed to plan capacity carefully, far in advance, to handle data growth. We needed a sizeable team to manage upgrades, maintenance, backups, and the day-to-day management of workloads and query performance.

We faced many challenges. We had only 1 gigabyte of bandwidth available to load log-level data into the data warehouse. At peak loads, our extract, transform, load (ETL) server ran completely full, and bandwidth became a bottleneck that delayed when the data was available for analytics. Software and firmware upgrades to the data warehouse needed to be scheduled, and maintenance downtime sometimes took up to eight hours to complete. Our infrastructure was also fragile. We ran everything on one PureData System, and we didn’t have a separate development and test environment. Clients that had direct access to our production environment could submit incorrect SQL queries and pull down the entire data warehouse.

From the log-level data, we created aggregates and stored them in MySQL. Indexes slowed down the loading process significantly. Several aggregations that we wanted to do were simply not possible. Running ad hoc (one-time) queries against 200 gigabytes of uncompressed, row-based data took ages to complete. Many dashboard queries took 10–15 minutes or longer, and were ultimately cancelled. Users were frustrated, so we knew that we had to evolve to a more responsive solution, end-to-end. We chose AWS and Amazon Redshift for our data warehouse.

Moving to Amazon Redshift

Because our legacy software was not designed to run in the cloud, we decided to rebuild our applications using all available AWS components. This saved us the hassle of any migration process, and we could design our applications to use the full potential of AWS.

Our new infrastructure uses Amazon Redshift as our log-level data warehouse. We use a 40-node ds2.xlarge cluster for our production processes. Here, we run log-level queries to aggregate data for the analytics cluster, and run thousands of queries each day to optimize our marketing campaigns.

We set up a separate 30-node ds2.xlarge Amazon Redshift cluster for client access. We replicate the log level data to this cluster, and allow our clients to run queries here without jeopardizing our production processes. Our clients perform data science queries against the data in this cluster.

We also created a 24-node dc2.large cluster for high-performance queries that would not be affected by large, complex queries running on our other clusters. We use this cluster for ad hoc analytics on aggregated data, made available through our API.

We use Amazon S3 as our main data store, giving us infinite storage. Amazon EC2 hosts our ETL processes, API, and several other applications.

Bannerconnect architecture. Amazon S3 is not added to the flowchart to make it simpler. You can add S3 to almost every arrow in the chart.

Our biggest gains

Creating our next-generation solution on AWS and Amazon Redshift provides many benefits for us. We simply follow the best practices provided in Amazon Redshift documentation.

  • Managed service: We wanted to focus on what we are good at, developing software, and not managing and maintaining infrastructure. With Amazon Redshift, we no longer have to do software updates, firmware upgrades, or deal with broken hardware. We no longer need to plan capacity months in advance, or deal with integrating new servers into our environment. Amazon Redshift completely automates all of this for us, including scaling with our data growth, so we can focus on data and analytics to better serve our clients.
  • A full, separate development and test environment: We now have an isolated Amazon Redshift cluster (single node), where we can perform, develop, and test without worrying about breaking the production environment. In our on-premises setup, we did have a development database, but it was always on the production infrastructure. Furthermore, we have an exact copy of our entire infrastructure in our test environment (obviously all in small scale and small instance types). This lets us run automated tests on each deployment to verify that all data flows work as expected.
  • Infinite scalability: We can scale instantly to any capacity we need. Amazon S3 gives us infinite storage, and we can scale Amazon Redshift compute capacity in just a few clicks.
  • Separate clusters: Clients now can’t jeopardize our production processes. Clients still write incorrect SQL queries. However, by using query monitoring rules in Amazon Redshift, we can identify these queries and have Amazon Redshift automatically stop them. Bad queries might affect the client cluster momentarily, but they don’t affect our production processes at all.
  • Faster ad hoc analytics: Due to the massive parallel processing, data compression, and columnar-based storage capabilities in Amazon Redshift, we can create aggregates that were not possible in MySQL. In terms of performance, it’s hard to give good numbers to compare. Running a query against a smaller aggregate using an index on MySQL might be faster at times. However, the majority of our queries are significantly faster on Amazon Redshift. For example, our biggest aggregated table contains about 2 billion records and 500 GB of data (compressed). MySQL couldn’t handle this, but Amazon Redshift results are retrieved within seconds. Large, complex queries took a long time on MySQL. Amazon Redshift completes these in tens of seconds or less.

Building the multicluster environment

This section explores an easy option to build a multicluster setup using AWS CloudFormation templates. With the templates, you can launch multiple Amazon Redshift clusters inside a VPC in both private and public subnets in different Availability Zones. The private subnet enables internal applications, such as EC2 instances, to execute the ETL process to interact with the Amazon Redshift cluster to refresh data. You can use the public Amazon Redshift cluster for the external client tools and scripts. Here is the architecture of this setup:

Let’s walk through the configuration. For demonstration purposes, we use just two Amazon Redshift clusters in a private and public subnet, but you can modify these steps to add more parallel clusters. The configuration is a two-step process to first create the network stack and later launch the Amazon Redshift cluster in those stacks. This process creates the following:

  • VPC and associated subnets, security groups, and routes
  • IAM roles to load data from S3 to Amazon Redshift
  • Amazon Redshift cluster or clusters

Directions

Step 1 – Create a network stack

  1. Sign in to the AWS Management Console and navigate to CloudFormation, then do the following:
  • Choose the AWS Region to launch the stack in, for example US East (Ohio).
  • Choose Create Stack.
  • Choose Specify an Amazon S3 template URL.
  • Copy and paste this URL into the text box: https://s3.amazonaws.com/salamander-us-east-1/Bannerconnect/networkstack.json
  1. Choose Next and provide the following information:
    • Stack Name: Name the stack anything convenient.
    • CIDR Prefix: Enter a class B CIDR prefix (for example, 168, 10.1, or 172.16).
    • Environment Tag: Name the environment anything convenient to tag the resources.
    • Key Name: The EC2 key pair to allow access to EC2 instances. If you don’t already have one, see Amazon EC2 Key Pairs in the EC2 documentation.
    • Use the default values for all other parameters and choose Create.
  2. The stack takes 10 minutes to launch, after which the network stack is ready.
  3. Review the outputs of the stack when the launch is complete to note the resource names that were created.

Step 2 – Create an Amazon Redshift cluster or clusters

  1. Navigate back to the CloudFormation console and do the following:
  • Choose Create Stack.
  • Choose Specify an Amazon S3 template URL.
  • Copy and paste this URL into the text box: https://s3.amazonaws.com/salamander-us-east-1/Bannerconnect/reshiftstack.json
  1. Choose Next and provide the following information:
    • Stack Name: Name the stack anything convenient.
    • Cluster Type: Choose a multi-node or single-node cluster.
    • Inbound Traffic: Allow inbound traffic to the cluster from this CIDR range.
    • Master Username: The user name that is associated with the master user account for the cluster that is being created. The default is adminuser.
    • Master User Password: The password that is associated with the master user account for the cluster that is being created.
    • Network Stack Name: The name of the active CloudFormation stack that was created in step 1, which contains the networking resources such as the subnet and security group.
    • Node Type: The node type to be provisioned for the Amazon Redshift cluster.
    • Number Of Nodes: The number of compute nodes in the Amazon Redshift cluster:
      • When the cluster type is specified as single-node, this value should be 1.
      • When the cluster type is specified as multi-node, this value should be greater than 1.
    • Port Number: The port number on which the cluster accepts incoming connections. The default is 5439.
    • Public Access: Public access to the Amazon Redshift cluster, either true or false. When this value is true, the cluster is launched in a public subnet. When this value is false, the cluster is launched in a private subnet.
    • Use the default values for all other parameters and choose Create.
  2. The stack take 10 minutes to launch, after which the Amazon Redshift cluster is launched in the network stack.
  3. Review the outputs of the stack when the launch is complete to note the resource names that were created for the Amazon Redshift cluster.
  4. Repeat steps 5–8 to add more Amazon Redshift clusters to this network stack.

With this easy deployment using the AWS CloudFormation template, you can launch all the resources needed for a multicluster setup with a few clicks.

Conclusion

Migrating to Amazon Redshift and setting up the data warehouse on AWS enabled us to build highly scalable decoupled applications and to use different clusters for different use cases. Operationally, we were able to build robust dev, test, and prod systems independently that are easy to manage to implement complex data workflows.

Recently, we started using Amazon Redshift Spectrum to query data directly from Amazon S3, without needing to load the data into Amazon Redshift. This saves us loading time and speeds up time to analytics, creating many new possibilities for us. Loading dynamic data with different formats and columns becomes a lot easier with Amazon Redshift Spectrum.

 


About the Authors

Danny Stommen has been working for Bannerconnect for 10 years, with his current role being Solutions Architect and most of his time working on the CORE solution. Next to work, he enjoys spending quality time with his family and actively playing soccer.

 

 

 

Thiyagarajan Arumugam is a Big Data Solutions Architect at Amazon Web Services and designs customer architectures to process data at scale. Before working at AWS, he built data warehouse solutions at Amazon.com. In his free time, he enjoys all outdoor sports and practices the Indian classical drum mridangam.