AWS Database Blog

Analyze user behavior using Amazon Elasticsearch Service, Amazon Kinesis Data Firehose and Kibana

September 8, 2021: Amazon Elasticsearch Service has been renamed to Amazon OpenSearch Service. See details.

February 9, 2024: Amazon Kinesis Data Firehose has been renamed to Amazon Data Firehose. Read the AWS What’s New post to learn more.

Let’s assume that you work for an ecommerce company and you want to provide the best user experience to your customers. A customer could land on a product page by coming from a recommendation on another page in your application or from a search engine. Whatever the route, you want to ensure that your customers actually land on a page that they are looking for. However, not every customer takes the same route. It can depend on how they are accessing your application, from what location, and many other attributes. To analyze and determine the patterns, you’ll want to review logs that hold lots of valuable data.

In this blog post, I discuss how you can tap into Apache Webserver logs to analyze user behavior and turn them into actionable insights.

I have used the following AWS services in this blog:

Architecture Overview

The architecture diagram shown below provides an overview of the solution.

Here are the architectural components in detail.

  1. Apache Webserver on an EC2 instance– This is where our web application is hosted. Apache access logs contain valuable data such as:
    • The IP address of the client accessing the page
    • User-agent data that can be used to derive attributes such as the OS and browser being used by the user
    • The page being accessed
    • The page that led the user to the particular page (referer)

I use a Kinesis agent to tail the Apache access logs. I then convert them to a JSON format and ship them to a Kinesis Data Firehose delivery stream.

  1. Kinesis Data Firehose –A Kinesis Data Firehose data delivery stream delivers the data to the Amazon ES domain. The delivery stream is configured to buffer data for 5 MB or 60 seconds, whichever comes first. The stream then delivers the data to the Amazon ES domain. It also backs-up failed records to an S3 bucket.
  2. AWS Lambda – The data captured from the log files is processed and enriched before it can be analyzed. The Apache log files contain fields such as IP address of the user, a user agent, and other important details that can be processed to determine user attributes. I use a geo-ip lookup on the IP address to know the location of the user. The user-agent can to be decoded to understand the platform (OS, browser) being used. I use an in-line Lambda function to process the log data and enrich it before it is ingested into an Amazon ES domain for analysis. Download the Python 3.6 code package for the Lambda function in ZIP format from an S3 bucket. Inside the package, the actual Lambda function is contained in the Lambda.py file
  3. Amazon ES – Amazon ES indexes the incoming data and makes it available for analysis. I use Kibana, which is integrated into Amazon ES for analyzing the data ingested into the Amazon ES domain.
  4. Amazon Cognito – I use Amazon Cognito to authenticate users on Kibana.

Deploying the stack

Use AWS CloudFormation to deploy the entire setup. You can deploy the stack in any AWS region that has the following services available.

  • Kinesis Data Firehose
  • Amazon Cognito
  • Amazon Elasticsearch Service
  • AWS Lambda

Create these resources before you deploy the CloudFormation stack:

  • A VPC with at least one public subnet
  • An EC2 SSH Key
  • An S3 bucket to collect failed records from the Firehose delivery stream.

To launch the CloudFormation stack, choose Launch Stack below.

Provide the appropriate parameter values described below.

Stack name – Enter a value for the CloudFormation stack name.

VPC for Webserver deployment – The VPC to be used for deploying the Webserver EC2 instance.

Public Subnet for Webserver deployment– The public subnet to be used for deploying the Webserver EC2 instance.

SSH Keypair for the Webserver – EC2 SSH key pair that you would like to associate with the Webserver EC2 instance.

Base AMI ID– Keep this value as is. The template picks up the latest Amazon Linux 2 AMI identifier based on your chosen region.

S3 bucket for failed records– The name of the S3 bucket where you would like the failed records to be written.

Kibana User Id– A user identifier to be used as a login credential for accessing Kibana.

Temporary Kibana Password – A temporary password for logging into Kibana. Make it at least eight characters long.

Follow through the CloudFormation stack creation wizard leaving all default values unchanged. On the final page, ensure that you check the option I acknowledge that AWS CloudFormation might create IAM resources with custom names and then choose Create.

After the stack is created, note the following values from the stack output. These values are used later.

  1. KibanaHTTPURL
  2. WebserverHTTPURL

Next, log in to Kibana and create an index with a mapping to start ingesting data into the index. To do this follow these steps:

  1. Access Kibana using the HTTP link that you noted from the CloudFormation stack output.
  2. Log in with the credentials (user identifier and password) that you provided as parameters to the CloudFormation stack.
  3. On the next screen, enter a new password. Make it at least eight characters long. You land on the Kibana home page.
  4. Create an index. For this choose Dev Tools and then choose Get to work. This screen is generally visible only the first time you access this functionality.
  5. In the console, run the following two commands:
    • This step deletes the apache_logs index if it has already been created.
      DELETE /apache_logs
    • This creates a new index called apache_logs with a mapping called access_logs. Our Kinesis Data Firehose delivery stream writes to this index. The mapping ensures that the latitude and longitude fields are read in a geo_point format and the datetime field is read in the right datetime format
PUT apache_logs
{
    "settings" : {
        "index" : {
            "number_of_shards" : 10, 
            "number_of_replicas" : 0
        }
    },

  "mappings": {
    "access_logs": { 
      "properties": { 
        "agent":    { "type": "text"  }, 
        "browser":    { "type": "keyword"  }, 
        "bytes":    { "type": "text"  }, 
        "city":    { "type": "keyword"  }, 
        "country":    { "type": "keyword"  }, 
        "datetime":    { "type": "date","format":"dd/MMM/yyyy:HH:mm:ss Z"  }, 
        "host":    { "type": "text"  }, 
        "location":    { "type": "geo_point"  }, 
        "referer":    { "type": "text"  }, 
        "os":    { "type": "keyword"  }, 
        "request":    { "type": "text"  }, 
        "response":    { "type": "text"  },
        "webpage":    { "type": "keyword"  },
        "refering_page":    { "type": "keyword"  }

      }
    }
  }
}

Generate some traffic on the website so that you can visualize the data with Kibana. In order to do that, you need to understand how the website is set up. The website is made of six placeholder pages. When you first access the site, you land on the main.php page. From here, you can either go to the search page or a recommendation page. Both these pages are used as referrers to finally land on one of the product pages. You have three product pages for three products – echo, kindle and fire tvstick. You can navigate from both the search and recommendation pages to any of the product pages and back. These are placeholder pages only to be used to generate traffic to our website so that we can then analyze the generated data with Kibana.

The diagram below shows the setup.

To start generating traffic on the website, access the link provided as CloudFormation stack output WebserverHTTPURL. Opening the link takes you to the main.php page. From here, you can go to the search or recommendation pages and from there, you can navigate to any one of the product pages. You can also navigate back and forth between these pages. In order to simulate multiple users, try using multiple browsers and devices and do it multiple times so that you have enough data to start analyzing.

With data ingested into the Amazon ES domain, create an index pattern in Kibana. Follow these steps:

  1. Go to the Kibana home page and choose Management
  2. Enter “apache_logs” as the index pattern and choose Next step.
  3. Select the “datetime field” as the Time Filter field name on the next screen and choose Create index pattern

Now it’s time to analyze this data and find patterns by using visualization. It might take a few minutes for data to land in the Amazon ES domain. It’s buffered through the Kinesis Data Firehose delivery stream and processed by the Lambda function.

Now you can see how people in various roles can use this data to identify patterns and generate actionable insights.

Sales Team – The sales team might need to know if people from a particular location are more interested in a particular product. Amazon ES is able to display the events from the log lines on an adjustable map, based on the value of the location field. Recall that in the transformation Lambda function we used GeoIP to add that information to each log line. The sales team can use the location field from the Apache web logs to determine the geographical regions that show the largest interest. By using the search features of Elasticsearch, they can even narrow the traffic displayed, based on the URL (details page) they’re looking at.

In order to do this we create a map-based visualization with the following steps.

  1. Choose Visualize on the Kibana home page.
  2. Choose the plus sign to create a new visualization
  3. Select the Coordinate Map type visualization under the Maps
  4. On the next screen, select the apache_logs index.
  5. On the top right corner of the screen, change the time interval to the time interval you want to visualize the data for.
  6. Leave the Metrics to Value=Count
  7. Under Buckets, choose Geo Coordinates. Select Aggregation as Geohash. Set Field to location.
  8. Add a filter on the top of the page to only look for webpage = whatever product you are interested in. In the screenshot below, I have chosen firetvstick.

  1. On the Options tab, under Base Layer Settings, set Layers to road_map and choose Apply changes.

  1. Choose the Apply Changes button
  2. The visualization in the screenshot shows that more people from the United Kingdom seem to be viewing the firetvstick page compared to India. You might have different results based on your data.

Marketing – People in marketing might want to know if users accessing the application through a particular OS or browser are more probable to view a particular product. They might use this data to publish ads on some other websites based on the criteria that is uncovered. Kibana can display a nested pie chart, where the inner-most rings are the webpage, and each webpage is subdivided into OS and then browser. This lets the marketing team see which products are popular for which OS and browser.

  1. Choose Visualize on the Kibana Home page.
  2. Choose the plus sign to create a new visualization
  3. Select the Pie type visualization under the Basic Charts
  4. On the top right corner of the screen, change the time interval to the time interval you want to visualize the data for.
  5. Leave the Slice Size under Metrics=Count
  6. Under Bucket, select Aggregation as Terms, Field as webpage
  7. Check the Group other values in separate bucket check box
  8. Provide the Label for the other bucket as Other

  1. Now, choose add sub-buckets, and repeat the same steps as above, however, choose Field as os this time

  1. Choose add sub-buckets one more time and again repeat the same steps as above, however, choose Field as browser this time

  1. Add a filter on the top of the page to only look for webpage = whatever product(s) you are interested in. In the screenshot below, I have chosen all three – firetvstick, echo and kindle.
  2. On the Options tab, uncheck the Donut and the Show Top Level only check boxes
  3. Choose the Apply Changes button

  1. From the data in the visualization in the screenshot, it is evident that the Kindle product page is primarily viewed from Windows OS and Firefox browser. Echo and FireTVStick are visited mainly from Android and Windows. You might have different results based on your data.

Application Team – The application team would like to know if more people land on the page through a search or a recommendation so that either of the two functionalities can be tweaked if needed. They could use several of Kibana’s visualizations – a vertical bar chart, a line chart, or our choice here, a heat map.

  1. Choose Visualize on the Kibana Home page
  2. Choose the plus sign to create a new visualization
  3. Select Heat Map under Basic Charts as the visualization type.
  4. Select apache_logs as the index
  5. Leave the Metrics Value to Count
  6. Under Buckets, for X-Axis, select Aggregation as Terms, select refering_page as the Field.
  7. Check the Group other values in separate bucket check box
  8. Provide the Label for the other bucket as Other

  1. Now, choose Add sub-buckets. Choose buckets type – Y-Axis.
  2. Select Sub Aggregation as Terms. Select Field as webpage.
  3. Check the Group other values in separate bucket check box
  4. Provide the Label for the other bucket as Other

  1. On the Options tab, check Show Labels
  2. Add a filter on the top of the page to only look for webpage = whatever product you are interested in. In the screenshot below, I have chosen all three – firetvstick, echo and kindle.
  3. Choose a­pply changes button.

  1. From the visualization in the screenshot, it is evident that the kindle page is reached only through the search page whereas firetvstick page is reached mainly through the recommendation page. You might have different results based on your data.

Cleaning Up

When you have finished visualizing data, clean up all the AWS resources that you created using AWS CloudFormation. Use the following steps for clean-up:

  1. In the Amazon Cognito User Pool console in the selected region, select the aes_kibana_demo_userpool user pool and delete the domain associated with the pool.

  1. Navigate to the S3 bucket that you specified for storing failed records from the Kinesis Data Firehose delivery stream and delete all objects prefixed with aes-kibana-demo-failed
  2. Next, delete the CloudFormation stack.

Conclusion

In this blog, you saw how you can use data in Apache access logs to create visualizations providing valuable insights into user behavior. The sales, marketing, and application teams can act on these insights to drive sales and improve user experience. You can create many other visualizations using the various visualization types provided by Kibana for different use cases.

You can also use a similar solution to analyze and process any type of logs, or, for that matter, any data that you want to ingest and analyze. To tweak the solution to analyze any other type of data, and generate sample streaming data, you can use the Amazon Kinesis Data Generator. For more information, be sure to check out Allan MacInnis’s blog post, Test your Streaming Data Solution with the New Amazon Kinesis Data Generator.

If you have any feedback about this blog post, please use the comment area on this page.


About the Author

Ninad Phatak is a Big Data Solutions Architect with Amazon Internet Services Private Limited. He helps customers build big data solutions on AWS to meet their data processing, analytics and business intelligence needs.