AWS for Industries
Capture clickstream data using AWS serverless services
Clickstream data refers to the collection of digital interactions that occur between a user and a website or mobile application. Capturing and creating usable insights from user data in real-time can be challenging. Amazon Web Services (AWS) serverless services can help by providing a scalable architecture to seamlessly capture, process, visualize and load clickstream data into analytics platforms.
User interactions encompass a wide range of actions, including clicks on links or buttons, views of different pages, the duration of time spent on specific pages, submissions of forms, downloads of files, and many other activities that take place within the digital environment. To learn about why clickstream data is critical for organizations, please refer to Driving Business Outcomes with Clickstream Data.
In this blog, we will take a closer look at how AWS services make it easier to capture and process clickstream data without the need of provisioning and managing servers.
Architecture
The solution uses Amazon API Gateway, AWS Lambda and Amazon Kinesis Data Streams to ingest and process clickstream data, Amazon Kinesis Data Firehose to save the raw data in Amazon Simple Storage Service (Amazon S3), then Amazon Athena and Amazon QuickSight to analyze and visualize data in a user-friendly manner.
Why did we choose these services?
Clickstream data continuously streams in as a large volume of messages, at highly-variable rates depending on user traffic and behavior. When evaluating the performance of new application features, website layouts, or marketing campaigns, it is crucial to analyze them in real-time to enable prompt actions.
The AWS services selected for this architecture offer autoscaling capabilities and cost-efficient solutions for processing clickstream data. These services dynamically scale resources to accommodate the fluctuations in the incoming workload, ensuring near real-time processing and analysis. With a pay-as-you-go pricing model, you only pay for the resources consumed, eliminating the need for overprovisioning and minimizing costs.
Amazon API Gateway is a fully managed service that makes it straightforward for developers to create, publish, maintain, monitor, and secure APIs at any scale.
AWS Lambda is a serverless, event-driven compute service that lets you run code for virtually any type of application or backend service without provisioning or managing servers. You can trigger Lambda from over 200 AWS services and software as a service (SaaS) applications—only paying for what you use.
Amazon Kinesis Data Streams is a serverless streaming data service that facilitates the capture, processing, and storage of data streams at any scale.
Amazon Kinesis Data Firehose is an extract, transform, and load (ETL) service that reliably captures, transforms, and delivers streaming data to data lakes, data stores, and analytics services.
Amazon S3 is an object storage service offering industry-leading scalability, data availability, security, and performance. Customers of all sizes and industries can store and protect any amount of data for virtually any use case, such as data lakes, cloud-native applications, and mobile apps.
Amazon Athena provides a straightforward, flexible way to analyze petabytes of data where it lives. With Athena, you can analyze data or build applications from an Amazon S3 data lake and 30 data sources, including on-premises data sources or other cloud systems using SQL or Python.
Amazon QuickSight powers data-driven organizations with unified business intelligence (BI) at hyperscale. With QuickSight, all users can meet varying analytic needs from the same source of truth through modern interactive dashboards, paginated reports, embedded analytics, and natural language queries.
Architecture diagram
Figure 1 presents the clickstream data flow architecture, showcasing how the clickstream payload progresses through a series of steps. The customer web portal in the diagram, which serves as a digital platform, such as a website or mobile application, enables users to interact with the system. As users navigate through the web portal and click on different links, the clickstream data undergoes the following stages of flow.
Figure 1 – Architecture
1. The client (customer web portal) sends the clickstream payload (record) to the API Gateway.
2. The API Gateway transmits the record to Lambda, where the data is standardized.
3. Lambda sends the record to Kinesis Data Streams for asynchronous processing.
4. Kinesis Data Streams transfers the request to Kinesis Data Firehose.
5. Kinesis Data Firehose buffers the records every minute and uploads them to an S3 bucket.
6. Athena is used to query and analyze the data stored in the S3 bucket.
7. QuickSight is used to create dashboards and display the data visually.
Prerequisites
To deploy this solution, you must have the following:
1. An AWS account.
2. AWS Identity and Access Management (IAM) permissions to run an AWS CloudFormation template.
Step 1: Solution Implementation
Complete the following steps to create AWS resources to build a clickstream pipeline as mentioned in the architecture. For this post, we use the AWS Region us-east-1.
1. Launch the CloudFormation Stack.
2. On the Create stack page, choose Next.
Figure 2 – Create Stack using CloudFormation
3. On the Specify stack details page, click Next
Figure 3 – Stack details
4. On the Configure stack options page, keep default selections and click Next.
5. On the Review page, acknowledge that AWS CloudFormation might create IAM resources. For more information about IAM, see Resources to learn more about IAM.
6. Click Submit
Figure 4 – IAM resources acknowledgment
7. Once the stack is completely deployed, stack status will change to CREATE_COMPLETE. You can check the Resources tab for the resources created by the CloudFormation stack.
Figure 5 – Stack status
Figure 6 – Provisioned resources
8. Before running queries in Athena, we need to select an S3 bucket to store the query results. Follow steps 9 through 15 to configure the S3 bucket path.
9. Navigate to the Amazon Athena console and select Query your data.
10. Click Launch query editor.
Figure 7 – Amazon Athena console
11. In the Query editor page, select the Settings tab.
12. In the Query result and encryption settings section, click Manage.
Figure 8 – Query editor settings
13. On the Manage settings page, click Browse S3.
14. On the Choose S3 data set page, select the S3 bucket beginning with clickstream- and click Choose.
Figure 9 – Query result location
15. On the Manage settings page, click Save.
Figure 10 – Confirm settings
16. Now follow steps 17 through 20 to create a table in Athena to store the clickstream data.
17. On Query editor, select the Saved queries tab.
Figure 11 – Athena Saved queries
18. In the Name field, look for a name which begins with ClickstreamAthenaNamedQuery-, click on the ID link associated with that name.
19. Ensure that the query is populated in the Editor tab as shown in Figure 12. Click Run.
Figure 12 – Execute query
20. Once the query executes, a clickstream_data table will be created with the necessary columns shown in Figure 13.
Figure 13 – Table creation
Step 2: Testing the Solution
We will use a Lambda function that emits the subsequent data points to simulate clickstream data:
- customerid
- deviceid
- productid
- productcategory
- productsubcategory
- activitytype
Step 2a: Ingest Data
1. Navigate to the AWS Lambda console.
2. Click on the Lambda function beginning with Clickstream-IngestDataLambda.
Figure 14 – Select Lambda
3. Select the Test tab. Enter the event name of your choice (for example, ClickstreamTest). Click Save, then click Test.
Figure 15 – Lambda test configuration
4. The Lambda function will start generating random clickstream payloads and pass them to the API Gateway, as shown in step 1 of the Figure 1 architecture diagram.
5. It may take up to a minute for the function to execute. Check the Log Output section to view clickstream payloads generated by the Lambda function.
Figure 16 – Lambda logs
Step2b: Validate Data
1. Once the Lambda function is executed, it may take a few minutes for the data to show up in Amazon S3. Kinesis Data Firehose buffers incoming records before delivering them to your S3 bucket. To upload the clickstream data to Amazon S3 quickly, we have set the buffer interval to 60 seconds, which is the minimum allowed value for Kinesis Data Firehose.
2. Navigate to Amazon S3 console.
3. Click on the S3 bucket starting with clickstream-clickstreamS3buket- and navigate through the underlying folder structure to locate the clickstream data file. Amazon Kinesis Data Firehose adds a UTC time prefix in the format YYYY/MM/DD/HH before putting objects to Amazon S3. The prefix translates into an Amazon S3 folder structure, where each label, separated by a forward slash (/), becomes a sub-folder.
Figure 17 – Amazon S3 path
4. Click on the object link and click Open to open the CSV file.
Figure 18 – Open file
5. Examine the data. It should look similar to the one shown in Figure 19.1.
Figure 19 – Examine data
Step2c: Analyze data:
1. Navigate to the Query Editor on the Amazon Athena console.
2. Ensure AwsDataCatalog is selected in the Data source field.
3. Select clickstreamDb from the Database field.
4. Expand Tables by clicking on the directional triangle to the left of Tables.
5. Locate the clickstream_data table and click on the three vertical dots to the extreme right of the clickstream_data table name. Select Preview Table from the drop-down menu.
Figure 20 – Preview table
6. You should see the data ingested through the IngestClickstream Lambda function as show in Figure 21.
Step3: Creating dashboards using Amazon QuickSight
1. Navigate to the Amazon QuickSight console.
a. If you have not created a QuickSight account before, Steps 2 through 6 take you through creating an account. If you already have a QuickSight account, go to Step 7.
2. Click on the Sign up for QuickSight button.
3. On the Create your QuickSight account page, keep the default selections and click Continue.
4. On the Get Paginated Report add-on, click No, Maybe Later.
5. On the Create your QuickSight account page,
a. Under Authentication method section, select Use IAM federated identities & QuickSight-managed users
b. Select the US East (N. Virginia) region.
c. Enter a unique QuickSight account name.
d. Enter an email address to receive notifications.
e. Under IAM Role, select Use QuickSight-managed role (default).
f. Under Allow access and autodiscovery for these resources, check IAM, Amazon S3 and Amazon Athena checkboxes. Uncheck all other boxes as shown in Figure 22.
g. Under the Amazon S3 resource you just check marked click on the Select S3 bucket link. From the new pop-up window select the S3 bucket starting with clickstream-clickstreams3bucket- and click Finish to close the pop-up window as shown in Figure 23.
6. Click Finish to complete the creation of your Amazon QuickSight account.
Figure 22 – Create QuickSight account
Figure 23 – Select S3 bucket
7. Click Go to Amazon QuickSight. This will open QuickSight.
Figure 24 – Account created
8. Ensure Analyses is selected in the navigation menu on the left side.
9. Click on the New analysis button on the top right corner. On the next page click on the New dataset button.
Figure 25 – New analysis
10. On the Create a Dataset page, select Athena.
11. On the New Athena data source page enter the Data source name of your choice and click on Create data source.
Figure 26 – Create data source
12. On the Choose your table page, select:
a. awsDataCatalog under Catalog: contain sets of databases.
b. clickstream_db under Database: contain sets of tables.
c. clickstream_data under Tables: contain the data you can visualize.
13. Click on Select when done.
Figure 27 – Select table
14. On the Finish dataset creation page, keep default selections and click Visualize.
Figure 28 – Finish dataset creation
15. On the next page, make certain Interactive Sheet is selected. Keep the default selections and click Create.
Figure 29 – Finish dataset creation continued
We will create a visual to analyze the number of product categories clicked.
16. Click Add in the top left corner and then click Add Visual.
17. Select the Pie chart icon from the Visual Types.
18. Drag and drop productcategory in the Group/Color field as shown in Figure 30. This will generate a pie chart showing the count of records by productcategory.
Figure 30 – Product categories
Let’s add one more visual to analyze the number of subcategories clicked under each category.
19. Click Add in the top left corner and then click Add Visual.
20. Select the Vertical Stacked bar chart icon from the Visual Types.
21. Drag and drop productcategory in the X axis field and productsubcategory in the Group/Color field as shown in Figure 31.
Figure 31 – Product categories / subcategories
Step 4: Clean-up
To avoid incurring future charges, follow these steps to remove the resources:
1. Delete the QuickSight account.
a. Click on the profile icon in the top right corner.
b. Click on Account Settings.
c. Click on the Manage button.
Figure 32 – QuickSight account settings
d. Turn off Account termination protection by sliding the Account Termination toggle to the left.
e. Type confirm in the Type “confirm” to delete this account box.
f. Click Delete Account.
Figure 33 – QuickSight account termination
2. Empty the S3 bucket.
a. Navigate to the Amazon S3 console.
b. Select the bucket starting with clickstream-clickstreams3bucket, click Empty.
c. On the Empty bucket page, enter permanently delete in the confirm deletion field and click Empty.
Figure 34 – Select S3 bucket
Figure 35 – Empty S3 bucket
3. Delete the solution.
a. Go to the CloudFormation console and select the clickstream stack you created as part of this project. Click on the stack and click Delete.
Figure 36 – Delete CloudFormation stack
Conclusion
Leveraging AWS serverless services provides a powerful and scalable solution for capturing clickstream data. By utilizing services such as Amazon API Gateway, Amazon Kinesis Data Streams, Amazon Kinesis Data Firehose and AWS Lambda, Amazon Athena and Amazon QuickSight, organizations can seamlessly capture, process, visualize and load clickstream data into analytics platforms. Kinesis Data Firehose facilitates the ingestion process by automatically scaling and buffering incoming data, while Lambda enables the execution of custom code for data transformation and enrichment.
With the serverless architecture, businesses can efficiently handle varying data volumes, reduce operational costs, quickly iterate on and learn from changes to their customer facing digital properties. They can also rapidly extract insights from clickstream data—enabling data-driven decision-making and enhanced customer experiences.
Contact an AWS Representative to know how we can help accelerate your business.