AWS Big Data Blog

Build a serverless tracking pixel solution in AWS

August 30, 2023: Amazon Kinesis Data Analytics has been renamed to Amazon Managed Service for Apache Flink. Read the announcement in the AWS News Blog and learn more.

Let’s describe the typical use case where a tracking pixel solution, also known as a web beacon, might help you: Analyzing web traffic is critical to understanding user behavior in order to improve their experience. Let’s think about a company—Example Company Hotels—that embeds a piece of HTML into a high-traffic, third-party website (example.HighTrafficWebsite.com) to have more visibility into how users interact with its products. Let’s assume that this fragment of code can’t make calls to the example.HighTrafficWebsite.com backend because it doesn’t want to incorporate delays on its website loading times. How could Example Company Hotels track as much information as possible about user behavior in example.HighTrafficWebsite.com in order to understand if its offering is addressed to the right user profiles?

A tracking pixel consists of using a 1×1 pixel image to leverage the image loading call to send the tracking information to a backend server (generally known as a beacon web server). Instead of using a traditional Javascript API call, the information is sent in the parameters of the image GET request and on the HTTP headers themselves, making it possible to include it in any component supporting HTML, like a webpage or even an email. See the following code:

<img src="https://example.examplecompanyHotels.com/trackingpixel?userid=aws_user&thirdpartyname=example.hightrafficwebsite.com”>

You could implement an analytics real-time solution (see Real-Time Web Analytics with Kinesis Data Analytics), but the information available for the tracking pixel use case typically doesn’t require real time, so you can take a simpler and more cost-effective approach.

This post shows how to build a serverless tracking pixel solution in AWS, avoiding the undifferentiated heavy lifting of traditional solutions based on self-managed beacon web servers.

In building any serverless tracking pixel solutions, we recommend that you confirm whether any privacy laws or regulations apply and ensure that any solutions are compliant with them.

Overview of solution

Any tracking pixel solution needs some key elements in its architecture:

  • A beacon web server to receive the tracking information. This web server should be able to cope with the fluctuating traffic demand and take into consideration all the security concerns around having a public endpoint exposed to internet.
  • A streaming engine to ingest the incoming information, also able to scale according to the traffic needs.
  • A storage layer to keep the information so it can be analyzed later.
  • A visualization tool to turn the information into business insights.

The following architecture diagram represents a fully serverless solution that eliminates the heavy lifting of managing servers. It automatically scales based on the incoming traffic and the TCO is fully associated with its usage.

The architecture includes the following main components, presented in their order in the workflow:

  • AWS Shield – A managed protection service that automatically safeguards applications running on AWS against the most common network and transport layer DDoS attacks, with no additional cost.
  • Amazon API Gateway – API Gateway defines HTTP APIs that enable you to send requests to AWS Lambda functions and adds some interesting functionalities like monitoring, securitization, throttling, method filtering, and even the possibility to use a custom domain.
  • AWS Lambda – The Lambda function acts as a serverless version of the traditional beacon web servers. It transforms the incoming web request into a structured format that facilitates its analysis and publishes it to the streaming service.
  • Amazon Kinesis Data Firehose – Kinesis Data Firehose is the easiest way to reliably ingest streaming data into data lakes, data stores, and analytics tools. The streaming information is buffered (in this case to its maximum value, corresponding to 15 minutes) to consolidate the information in a minor number of files to store in the data lake. That fact, in addition to the automatic transformation to columnar and compressed formats like Apache Parquet or ORC that the service offers, allows it to minimize the costs of storage and future queries.
  • Amazon Simple Storage Service – Amazon S3 is a managed object storage service that acts as the foundation of the data lake because of its industry-leading scalability, data availability, security, and performance. It also allows you to configure lifecycle rules to move the old records to cold storage or even delete them.
  • AWS Glue – The AWS Glue Data Catalog acts as the central metadata repository, making it easy to discover and access data in the data lake. You can use AWS Glue crawlers to automatically create schema tables of the catalog.
  • Amazon Athena – This interactive query service makes it easy to analyze data in Amazon S3 using standard SQL, and is integrated out of the box with the AWS Glue Data Catalog.

To implement this solution, we complete the following high-level steps:

  1. Create a Kinesis Data Firehose delivery stream.
  2. Deploy a Lambda function with an API Gateway HTTP endpoint.
  3. Modify the Lambda function to extract the relevant data for your use case.
  4. Create the tables in the AWS Glue Data Catalog.
  5. Query and track pixel information with Athena.

Prerequisites

For this walkthrough, you should have the following prerequisites:

  • An AWS account
  • Although no specific knowledge is needed, you should be able to navigate through the AWS Management Console resources and have a basic awareness of AWS services
  • A development environment with the following dependencies:

If you don’t have the specified environment or don’t want to install anything on your computer, you can use AWS Cloud9, a cloud IDE that lets you write, run, and debug your code with just a browser.

Creating a Kinesis Data Firehose delivery stream

To create your delivery stream, complete the following steps:

  1. On the Kinesis Data Firehose console, create a new delivery stream.
  2. For Source, select Direct PUT or other sources.
  3. When requested, create a new S3 bucket and specify a prefix and prefix error based on the year, month, day, and hour compatible with Hive partitioning style to facilitate the queries.

We could use the Kinesis Data Firehose partition mechanism, but then the partition names should be manually added on the catalog.

The following is the prefix code:

year=!{timestamp:yyyy}/month=!{timestamp:MM}/day=!{timestamp:dd}/hour=!{timestamp:HH}/

The following is the error prefix code:

fherroroutputbase/!{firehose:random-string}/!{firehose:error-output-type}/!{timestamp:yyyy/MM/dd}/
  1. Set the buffer condition at its minimum values by now (5 MiB or 60 seconds). The rest of configurations can be set at their default values.
  2. Copy the delivery stream ARN and the delivery stream name; you them in the next step.

Deploying the Lambda function with an API Gateway HTTP endpoint

We use the AWS Serverless Application Model (AWS SAM), a framework extending AWS CloudFormation syntax, to easily define serverless components such as Lambda functions or Amazon DynamoDB tables.

  1. Clone or fork the following repository: aws-serverless-tracking-pixel.
  2. Deploy the solution by following the instructions of the README.md file.
  3. Get the URL of your new tracking pixel API in the TrackingPixelProcessingAPIURL output parameter.
  4. Check if everything is working as expected by accessing the API URL in your browser:
    https://<TrackingPixelProcessingAPIURL>?userid=aws_user&thirdpartyname=example.hightrafficwebsite.

After 60 seconds, a folder structure with a file inside should appear in the S3 bucket on the Amazon S3 console.

Modifying the Lambda function to extract the relevant data for your use case

The provided Lambda function illustrates how to parse and publish the HTTP request to the streaming resource. You should modify it so the information relevant for your use case is extracted instead.

  1. Implement your own extracting logic in the Lambda function and redeploy it.

We recommend that you do that in your local file and follow the same AWS SAM deployment process used to deploy the base solution. The function code is stored in the <local_repository/trackingPixelProcessing/app.py file.

If you’re not sure about the request structure received by the function, you can print the event parameter and look for the trace information in Amazon CloudWatch Logs.

  1. Delete any previous content you might have in the target S3 bucket.
  2. Make a couple of invocations to the API endpoint, including all the fields that you want to extract, and verify that the content is properly created in the S3 bucket and it has the correct structure.
  3. When your custom processing logic is working and the S3 bucket only has some files corresponding to the final structure, go to the Kinesis Data Firehose delivery stream configuration and update its buffer condition so a single file containing all the records is generated every 15 minutes instead of 60 seconds.

Cataloging the data

We use AWS Glue crawlers to automatically generate a table in your AWS Glue Data Catalog.

  1. On the AWS Glue console, create a new table using a crawler.
  2. For Crawler name, enter a name (for example, TrackingPixelInfo).

The crawler source type is a data store that points to the created S3 bucket. It’s important to use the root of the bucket so the appropriate partitions are automatically built based on the folder structure.

  1. For Choose a data store, choose S3.
  2. For Crawl data in, select Specified path in my account.
  3. For Include path, enter your S3 path.
  4. Create a new AWS Identity and Access Management (IAM) role so the crawler automatically has all the permisions needed to query the S3 bucket.
  5. Configure the crawler to be run manually by specifying Run on demand as its frequency.
  6. For Database, add a new database.
  7. For Prefix, specify a prefix for the tracking pixel tables (for example, tp_).
  8. After you generate the crawler, run it manually.

After a while, we see a new table in the catalog with the tracking pixel structure plus the partition fields that allow you to optimize the queries. You can modify the schema, for example, to change column names or its data type. For this post, we change the data type of the partitions to int.

Querying the tracking pixel information with Athena

We can use Athena to query the tracking pixel information using SQL syntax. Athena can also be used directly by other analytical services in AWS or even by third-party analytics tools with ODBC or JDBC drivers.

  1. Sign in to the Athena console.
  2. If it’s your first time using Athena on this account, you’re asked to specify an S3 bucket for storing the query results and metadata.
  3. Choose the table created by the AWS Glue crawler.
  4. Run the following query:
    SELECT * FROM tp_examplecompanyhotelsdatalake

The following screenshot shows our query results.

New partition values don’t appear automatically, because they have to be included in the catalog. You should configure a scheduled AWS Glue crawler that doesn’t modify the table structure but updates the partition values. For testing purposes, you can just choose the table’s contextual menu and choose Load partitions.

Cost considerations

The cost of the solution is linearly linked to its usage. It may vary depending on the message size, the number of requests, and the runtime of the Lambda function. Only as a reference, the cost of the code shown in this step-by-step guide is around $50 per month to process 1 million requests per day.

Improvements and next steps

The basic skeleton of the solution has been built. In this section, we show improvements that you can implement and how to extend or generalize the solution.

Using a tabular format for the data

The current implementation uses plain JSON as the file format stored in Amazon S3 by Kinesis Data Firehose. A columnar compressed format like Apache Parquet or Apache ORC reduces file size and improves Athena query performance. You can use Kinesis Data Firehose to convert the record format to Apache Parquet or Apache ORC.

Creating a custom domain for the endpoint

API Gateway automatically generates the default API endpoint. You might have some requirements to include a specific domain name, or your request may be blocked due to CSRF because the endpoint domain isn’t trusted. With API Gateway custom domain names, you can set up your API’s hostname and choose a base path to map the alternative URL to your API.

Increasing endpoint security

AWS Shield automatically protects the public endpoint against DDoS attacks. You can increase this security by using AWS WAF to protect your APIs. It protects the endpoint from common exploits like SQL injection and cross-site scripting attacks, and allows you to block specific IPs or apply any custom rule to avoid specific attacks.

Implementing additional analytics capabilities

The current solution uses Athena to run analytic queries to understand usage patterns. However, some insights may be better extracted using other tools. You can choose from several options:

Cleaning up

To avoid incurring future charges, delete the resources generated if you don’t need the solution anymore:

  • Kinesis Data Firehose Delivery Stream
  • Cloudformation template deployed with SAM
  • AWS Glue Crawler
  • IAM role created for the AWS Glue Crawler
  • AWS Glue Data Catalog Database
  • S3 Bucket created during the initial Athena configuration
  • Any resource created in the Improvements and next steps section

Conclusion

You can easily build the skeleton of a fully serverless solution to capture, store, and analyze user behavior by using a tracking pixel that could be embedded in websites and emails. This solution scales automatically, reduces operating costs because no servers need to be managed, and can be easily adapted to your specific needs, improving on the traditional web beacon server approach.


About the Authors

Albert Capdevila is an AWS Solutions Architect based in sunny Barcelona, helping customers build their AWS workloads according to best practices. After more than 15 years of working in projects around B2B, integration, and business process-oriented architectures, he’s now focused on machine learning and cloud areas, being a true believer of cloud adoption as the new normal for companies. Albert is currently trying to generate a forecasting model to know how much free time his sons will leave him to go climb mountains.