AWS Big Data Blog

Creating customized Vega visualizations in Amazon Elasticsearch Service

Computers can easily process vast amounts of data in their raw format, such as databases or binary files, but humans require visualizations to be able to derive facts from data. The plethora of tools and services such as Kibana (as part of Amazon ES) or Amazon Quicksight to design visualizations from a data source are a testimony to this need.

Such tools often provide out-of-the-box templates for designing simple graphs from appropriately pre-processed data, but applying these to production-grade, complex visualizations can be challenging for several reasons:

  • The raw data upon which a visualization is built may contain encoded attributes that aren’t understandable for the viewer. For example, the following layered bar chart could be raw data about people where the gender is encoded in numbers, but the visualization should still have human-readable attribute values.
  • Visualizations are often built on aggregated summaries of raw data. Storing and maintaining different aggregations for each visualization may be unfeasible. For instance, if you classify raw data items in multiple dimensions (for example, classifying cars by color or engine size) and build visualizations on these different dimensions, you need to store different aggregated materialized views of the same data.
  • Different data views used in a single visualization may require an ad hoc computation over the underlying data to generate an appropriate foundational data source.

This post shows how to implement Vega visualizations included in Kibana, which is part of Amazon Elasticsearch Service (Amazon ES), using a real-world clickstream data sample. Vega visualizations are an integrated scripting mechanism of Kibana to perform on-the-fly computations on raw data to generate D3.js visualizations. For this post, we use a fully automated setup using AWS CloudFormation to show how to build a customized histogram for a web analytics use case. This example implements an ad hoc map-reduce like aggregation of the underlying data for a histogram.

Use case

For this post, we use Online Shopping Store – Web Server Logs published by Harvard Dataverse. The 3.3 GB dataset contains 10,365,152 access logs of an online shopping store. For example, see the following data sample:

207.46.13.136 - - [22/Jan/2019:03:56:19 +0330] "GET /product/14926 HTTP/1.1" 404 33617 "-" "Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)" "-"

Each entry contains a source IP address (for the preceding data sample, 207.46.13.136), the HTTP request, and return code (“GET /product/14926 HTTP/1.1” 404), the size of the response in bytes (33617) and additional metadata, such a timestamp. For this use case, we assume the role of a web server administrator who wants to visualize the response message size in bytes for the traffic over a specific time period. We want to generate a histogram of the response message sizes over a given period that looks like the following screenshot.

In the preceding screenshot, the majority of the response sizes are less than 1 MB. Based on this visualization, the administrator of the online shop could identify the following:

  • Requests that result in high bandwidth use
  • Distributed denial of service (DDoS) attacks caused by repeatedly requesting pages, which causes a high amount of response traffic

The built-in Vertical / Horizontal Bar visualization options in Kibana can’t produce this histogram over this Elasticsearch index without storing the raw data. In addition to this, these visualizations should be able to instruct complex transformations and aggregations over the data, so you can generate the preceding histogram. Storing the data in a histogram-friendly way in Amazon ES and building a visualization with Vertical / Horizontal Bar components requires ETL (Extract Transform Load) of the data to a different index. Such an ETL creates unnecessary storage and compute costs. To avoid these costs and complex workflows, Vega provides a flexible approach that can execute such transformations on the fly on the server log data.

We have built some of the necessary transformations outside of our Vega code to keep the code presented in this post concise. This was done to improve the readability of this post; the complete transformation could have also been done in Vega.

Overview of solution

To avoid unnecessary costs and focus on Vega visualization creation task in this post, we use an AWS Lambda function to stream the access logs out of an Amazon Simple Storage Service (Amazon S3) bucket instead of serving them from a web server (this Lambda function contains some of the transformations that have been removed to improve the readability of this post). In a production deployment, you replace the function and S3 bucket with a real web server and the Amazon Kinesis Agent, but the rest of the architecture remains unchanged. AWS CloudFormation deploys the following architecture into your AWS account.

The Lambda function reads and transforms the access logs out of an S3 bucket to stream them into Amazon Kinesis Data Firehose, which forwards the data to Amazon ES. Kinesis Data Firehose is the easiest way to reliably load streaming data into data lakes, data stores, and analytics tools, and therefore it is suitable for this task. The available data will be automatically transferred to Amazon ES after the deployment.

Prerequisites and deploying your CloudFormation template

To follow the content of this post, including a step-by-step walkthrough to build a customized histogram with Vega visualizations on Amazon ES, you need an AWS account and access rights to deploy a CloudFormation template. The Elasticsearch cluster and other resources the template creates result in charges to your AWS account. During a test-run in eu-west-1, we measured costs of $0.50 per hour. Complete the following steps:

  1. Depending on which Region you want to use, sign in to the AWS Management Console and choose one of the following Regions to deploy the necessary resources in your account:
  1. Keep all the default parameters as is.
  2. Select the check boxes, I acknowledge that AWS CloudFormation might create IAM resources with custom names and I acknowledge that AWS CloudFormation might require the following capability: CAPABILITY_AUTO_EXPAND.
  3. Choose Create Stack to start the deployment.

The deployment takes approximately 20–25 minutes and is finished when the status switches to CREATE_COMPLETE.

Connecting to the Kibana

After you deploy the stack, complete the following steps to log in to Kibana and build the visualization inside Kibana:

  1. On the Outputs tab of your CloudFormation stack, choose the URL for KibanaLoginURL.

This opens the Kibana login prompt.

  1. Unless you changed these at the start of the CloudFormation stack deployment process, the username and password are as follows:
    • Username – admin@example.com
    • Password – Amazon123

Amazon Cognito requires you to change the password, which Kibana prompts you for.

  1. Enter a new password and a name for yourself.

Remember this password; there is no recovery option in this setup if you need to re-authenticate.

Upon completing these steps, you should arrive at the Kibana home screen (see the following screenshot).

You’re ready to build a Vega visualization in Kibana. We guide through this process in the following sections.

Creating an index pattern and exploring the data

Kibana visualizations are based on data stored in your Elasticsearch cluster, and the data is stored in an Elasticsearch index called vega-visu-blog-index. To create an index pattern, complete the following steps:

  1. On the Kibana home screen, choose Discover.
  2. For Index Pattern, enter “vega*”.
  3. Choose Next step.

  1. For Time Filter field name, choose timestamp.
  2. Choose Create index pattern.

After a few seconds, a summary page with details about the created index pattern appears.

The index pattern is now created and you can use it for queries and visualizations. When you choose Discover, you see a page that shows your index pattern. By default, Kibana displays data of the last 15 minutes, but given that the data is historical, you have to properly select the time range. For this use case, the data spans January 1, 2019 to February 1, 2019. After you configure the time range, choose Update.

In addition to a graphical representation of the tuple count over time, Kibana shows the raw data in the following format:

size_in_bytes: { "size": 13000, "freq": 11 }, …

The following screenshot shows both the visualization and raw data.

As noted before, we executed a pre-aggregation step with the data, which counted the number of requests in the log file with a given size. The message sizes were truncated into steps of 100 bytes. Therefore, the preceding screenshot indicates that there were 11 requests with 13,000–13,099 bytes in size during the minute after January 26, 2019, on 13:59. The next section shows how to create the aggregated histogram from this data using a Vega visualization involving a series of data transformations implicitly and on-the-fly, which Amazon ES runs without changing the underlying data stored in the index.

Vega on-the-fly data transformation

To properly compute the desired histogram, you need to transform the data. Because the format and temporal granularity of the data stored in the Elasticsearch index doesn’t match the format required by the visualization interface. For this use case, we use the scripted metric aggregation pattern, which calls stored scripts written in Painless to implement the following four steps:

  1. Variable initialization
  2. Mapping documents to a key-value structure for the histogram
  3. Grouping values into arrays by keys
  4. Aggregation of array content into the histogram

To implement the first step for initializing the variables used by the other steps, choose Dev Tools on the left side of the screen. The development console allows you to define scripts that are later called by the Vega visualization. The following code for the first step is very simple because it initializes an empty array variable “state.test”:

#init_script
POST _scripts/initialize_variables
{
  "script": {
    "lang" : "painless",
    "source" : "state.test = [:]"
  }
}

Enter the preceding code into the left-hand input section of the Kibana Dev Tools page, select the code, and choose the green triangular button. Kibana acknowledges the loading of the script in the output (see the following screenshot).

To improve the readability of the rest of this section, we will show the result of each step based on the following initial input JSON:

{
  “timestamp”: “2019/01/26 12:58:00”, 
  “size_in_bytes”: [ 
    {“size”: 100, “freq”: 369},
    {“size”: 200, “freq”: 62},
    … 
]}
{
  “timestamp”: “2019/01/26 12:57:00”, 
  “size_in_bytes”: [ 
    {“size”: 100, “freq”: 386},
    {“size”: 200, “freq”: 60},
    … 
]}

The next step of the transformation maps the request size in each document of the index to their appropriate count of occurrences as key-value pairs, i.e., with the example data above, the first size_in_bytes field is transformed into “100”: 369. As in the previous step, implement this part of the transformation by entering the following code into the Kibana Dev Tools page and choosing the green button to load the script.

#map_script
POST _scripts/map_documents
{
  "script": {
    "lang": "painless",
    "source": """
      if (doc.containsKey(params.bucketField))
        for (int i=0; i<doc[params.bucketField].length; i++)
        {
          def key = doc[params.bucketField][i];
          def value = doc[params.countField][i];
          state.test[(key).toString()] = value;
        }
        """
  }
}

Using our example input data results for this step, the following output is computed:

{“100”: 369, “200”: 62, …}
{“100”: 386, “200”: 60, …}

As shown above, the script’s outputs are JSON documents. Because the desired histogram aggregates data over all documents, you need to combine or merge the data by key in the next step. Enter the following code:

#combine_script
POST _scripts/combine_documents
{
  "script": {
    "lang": "painless",
    "source": "return state.test"
  }
}

With the example intermediate output from the previous step, the following intermediate output is computed:

{
  “100”: [369, 386],
  “200”: [62, 60], …
}

Finally, you can aggregate the count value arrays returned by the combine_documents script into one scalar value by each messagesize as implemented by the aggregate_histogram_buckets script. See the following code:

#reduce_script
POST _scripts/aggregate_histogram_buckets
{
  "script": {
    "lang": "painless",
    "source": """
      Map result = [:];
      states.forEach(value ->
        value.forEach((k,v) ->
          {result.merge(k, v, (value1, value2) -> value1+value2)}
        )
      );
      return result.entrySet().stream().sorted(
        Comparator.comparingInt((entry) -> 
        Integer.parseInt(entry.key))
      ).map(
        entry -> [
          "messagesize": Integer.parseInt(entry.key),
          "messagecount": entry.value
        ]).collect(Collectors.toList())
      """
  }
}

The final output for our computation example has the following format:

{
  “messagesize”: 100,
  “messagecount”: 755,
}
{
  “messagesize”: 200,
  “messagesize”: 122
}

This concludes the implementation of the on-the-fly data transformation used for the Vega visualization of this post. The documents are transformed into buckets with two fields—messagesize and—messagecount – which contain the corresponding data for the histogram.

Creating a Vega-Lite visualization

The Vega visualization generates D3.js representations of the data using the on-the-fly transformation discussed earlier. You can create this histogram visualization using Vega or Vega-Lite visualization grammars, which are both supported by Kibana. Vega-Lite is sufficient to implement our use case. To make use of the transformation, create a new visualization in Kibana and choose Vega.

A Vega visualization is created by a JSON document that describes the content and transformations required to generate the visual output. In our use case, we use the vega-visu-blog-index index and the four transformation steps in a scripted metric aggregation operation to generate the data property which provides the main content suitable to visualize as a histogram. The second part of the JSON document specifies the graph type, axis labels, and binning to format the visualization as required for our use case. The full Vega-Lite visualization JSON is as follows:

{
  $schema: https://vega.github.io/schema/vega-lite/v2.json
  data: {
    name: table
    url: {
      index: vega-visu-blog-index
      %timefield%: timestamp
      %context%: true
      body: {
        aggs: {
          temporal_hist_agg: {
            scripted_metric: {
              init_script: {
                id: "initialize_variables"
              }
              map_script: {
                id: map_documents
                params: {
                  bucketField: size_in_bytes.size
                  countField: size_in_bytes.freq
                }
              }
              combine_script: {
                id: "combine_documents"
                }
              reduce_script: {
                id: "aggregate_histogram_buckets"
              }
            }
          }
        }
        size: 0
      }
    }
    format: {property: "aggregations.temporal_hist_agg.value"}
  }
  mark: bar
  title: {
    text: Temporal Message Size Histogram
    frame: bounds
  },
  encoding: {
    x: {
      field: messagesize
      type: ordinal
      axis: {
        title: "Message Size Bucket"
        format: "~s"
      }
      bin: {
        binned: true,
        step: 10000
      }
    }
    y: {
      field: messagecount
      type: quantitative
      axis: {
        title: "Count"
        }
    }
  }
}

Replace all text from the code pane of the Vega visualization designer with the preceding code and choose Apply changes. Kibana computes the visualization calling the four stored scripts (mentioned in the previous section) for on-the-fly data transformations and displays the desired histogram. Afterwards, you can use the visualization just like the other Kibana visualizations to create Kibana dashboards.

To use the visualization in dashboards, save it by choosing Save. Changing the parameters of the visualization automatically results in a re-computation, including the data transformation. You can test this by changing the time period.

Exploring and debugging scripted metric aggregation

The debugger included into most modern browsers allows you to view and test the transformation result of the scripted metric aggregation. To view the transformation result, open the developer tools in your browser, choose Console, and enter the following code:

VEGA_DEBUG.view.data('table')

The Vega Debugger shows the data as a tree, which you can easily explore.

The developer tools are useful when writing transformation scripts to test the functionality of the scripts and manually explore their output.

Cleaning up

To delete all resources and stop incurring costs to your AWS account, complete the following steps:

  1. On the AWS CloudFormation console, from the list of deployed stacks, choose vega-visu-blog.
  2. Choose Delete.

The process can take up to 15 minutes, and removes all the resources you deployed for following this post.

Conclusion

Although Kibana itself provides powerful built-in visualization methods, these approaches require the underlying data to have the right format. This creates issues when the same data is used for different visualizations or simply not available in an ideal format for your visualization. Because storing different aggregations or views of the same data isn’t a cost-effective approach, this post showed how to generate customized visualizations using Amazon ES, Kibana, and Vega visualizations with on-the-fly data transformations.

 


About the authors

Markus Bestehorn is a Principal Prototyping Engagement Manager at AWS. He is responsible for building business-critical prototypes with AWS customers, and is a specialist for IoT and machine learning. His “career” started as a 7-year-old when he got his hands on a computer with two 5.25” floppy disks, no hard disk, and no mouse, on which he started writing BASIC, and later C, as well as C++ programs. He holds a PhD in computer science and all currently available AWS certifications. When he’s not on the computer, he runs or climbs mountains.

 

 

Anil Sener is a Data Prototyping Architect at AWS. He builds prototypes on Big Data Analytics, Streaming, and Machine Learning, which accelerates the production journey on AWS for top EMEA customers. He has two Master Degrees in MIS and Data Science. He likes to read about history and philosophy in his free time.