Visualizing big data with AWS AppSync, Amazon Athena, and AWS Amplify

This article was written by Brice Pelle, Principal Technical Account Manager, AWS

Organizations use big data and analytics to extract actionable information from untapped datasets. It can be difficult for you to build an application with access to this trove of data. You want to build great applications quickly and need access to tools that allow you to interact with the data easily.

Presenting data is just as challenging. Tables of numbers and keywords can fail to convey the intended message and make it difficult to communicate insightful observations. Charts, graphs, and images tend to be better at conveying complex ideas and patterns.

This post demonstrates how to use Amazon Athena, AWS AppSync, and AWS Amplify to build an application that interacts with big data. The application is built using React, the AWS Amplify Javascript library, and the D3.js Javascript library to render custom visualizations.

The application code can be found in this GitHub repository. It uses Athena to query data hosted in a public Amazon S3 bucket by the Registry of Open Data on AWS. Specifically, it uses the High Resolution Population Density Maps + Demographic Estimates by CIESIN and Facebook.

This public dataset provides “population data for a selection of countries, allocated to 1 arcsecond blocks and provided in a combination of CSV and Cloud-optimized GeoTIFF files,” and is hosted in the S3 bucket s3://dataforgood-fb-data.

Architecture overview

The Amplify CLI sets up sign-in/sign-up with Amazon Cognito, stands up a GraphQL API for GraphQL operations, and provisions content storage on S3 (the result bucket).

The Amplify CLI Storage Trigger feature provisions an AWS Lambda function (the announcer function) to respond to events in the result bucket. With the CLI, the announcer Lambda function’s permissions are set to allow GraphQL operations on the GraphQL API.

The Amplify CLI supports defining custom resources associated with the GraphQL API using the CustomResources.json AWS CloudFormation template located in the folder amplify/backend/api/YOUR-API-NAME/stacks/ of an Amplify Project. You can use this capability to define via CloudFormation an HTTP data source and AppSync resolvers to interface with Athena, and a None data source and local resolvers to trigger subscriptions in response to mutations from the announcer Lambda function.

Setting up multi-auth on the GraphQL API

AWS AppSync supports multiple modes of authorization that can be used simultaneously to interact with the API. This application’s GraphQL API is configured with the Amazon Cognito User Pool as its default authorization mode.

Users must authenticate with the User Pool before sending GraphQL operations. Upon sign-in, the user receives a JSON Web Token (JWT) that is attached to requests in an authorization header when sending GraphQL operations.

IAM Authorization is another available mode of authorization. The GraphQL API is configured with IAM as an additional authorization mode to recognize and authorize SigV4-signed requests from the announcer Lambda function. The configuration is done using a custom resource backed by a Lambda function. The custom resource is is defined in the CloudFormation template with the AppSyncApiId as a property. When deployed, it uses the UpdateGraphqlApi action to add the additional authorization mode to the API:

"MultiAuthGraphQLAPI": {
  "Type": "Custom::MultiAuthGraphQLAPIResource",
  "Properties": {
    "ServiceToken": { "Fn::GetAtt": ["MultiAuthGraphQLAPILambda", "Arn"] },
    "AppSyncApiId": { "Ref": "AppSyncApiId" }
  },
  "DependsOn": "MultiAuthGraphQLAPILambda"
}

The GraphQL schema must specify which types and fields are supported by the authorization modes (with Amazon Cognito User Pool being the default). The schema is configured with the needed authorization directives:

@aws_iam to specify if a field or type is IAM authorized.
@aws_cognito_user_pools to specify if a field or type is Amazon Cognito User Pool authorized.

The announcer Lambda function needs access to the announceQueryResult mutation and the types included in the response. The AthenaQueryResult type is returned by the startQuery query (called from the app), and by announceQueryResult. The type must support both authorization modes.

type AthenaQueryResult @aws_cognito_user_pools @aws_iam {
    QueryExecutionId: ID!
    file: S3Object
}
type S3Object @aws_iam {
    bucket: String!
    region: String!
    key: String!
}
type Query {
    startQuery(input: QueryInput): AthenaQueryResult
}
type Mutation {
    announceQueryResult(input: AnnounceInput!):
      AthenaQueryResult @aws_iam
}

Setting up a NONE data source (Local Resolver) to enable subscriptions

The announcer Lambda function is triggered in response to S3 events and sends a GraphQL mutation to the GraphQL API. The mutation in turn triggers a subscription and sends the mutation selection set to the subscribed app.

The mutation data does not need to be saved. AWS AppSync only needs to forward the results to the application using the triggered subscription. To enable this, a NONE data source is configured and associated with the local resolver announceQueryResult. NONE data sources and local resolvers are very useful to allow publishing real-time subscriptions without triggering a data source call to modify or update data.

"DataSourceNone": {
  "Type": "AWS::AppSync::DataSource",
  "Properties": {
    "ApiId": { "Ref": "AppSyncApiId" },
    "Name": "None",
    "Description": "None",
    "Type": "NONE"
  }
},
"AnnounceQueryResultResolver": {
  "Type": "AWS::AppSync::Resolver",
  "Properties": {
    "ApiId": {"Ref": "AppSyncApiId"},
    "DataSourceName": { "Fn::GetAtt": ["DataSourceNone", "Name"] },
    "TypeName": "Mutation",
    "FieldName": "announceQueryResult",
  }
}

In the schema, the onAnnoucement subscription is associated with the mutation.

type Mutation {
    announceQueryResult(input: AnnounceInput!):
      AthenaQueryResult @aws_iam
}
type Subscription {
    onAnnouncement(QueryExecutionId: ID!): 
      AthenaQueryResult
        @aws_subscribe(mutations: ["announceQueryResult"])
}

Setting up Athena as a data source

AWS AppSync supports HTTP data sources and can be configured to interact securely with AWS service endpoints.

To configure Athena as a data source, the CustomResources.json template defines the role that AWS AppSync assumes to interact with the API: AppSyncAthenaRole.

The role is assigned the managed policy AmazonAthenaFullAccess. The policy provides read and write permissions to S3 buckets with names starting with aws-athena-query-results-. The application uses this format to name the S3 bucket in which the Athena query results are stored. It assigns the AmazonS3ReadOnlyAccess policy to allow Athena to read from the source data bucket.

The resource DataSourceAthenaAPI defines the data source and specifies IAM as the authorization type along with the service role to be used.

"AppSyncAthenaRole": {
  "Type": "AWS::IAM::Role",
  "Properties": {
    "RoleName": {
      "Fn::Join": [
        "-",
        ["appSyncAthenaRole", { "Ref": "AppSyncApiId" }, { "Ref": "env" }]
      ]
    },
    "ManagedPolicyArns": [
      "arn:aws:iam::aws:policy/AmazonAthenaFullAccess",
      "arn:aws:iam::aws:policy/AmazonS3ReadOnlyAccess"
    ],
    "AssumeRolePolicyDocument": {
      "Version": "2012-10-17",
      "Statement": [
        {
          "Effect": "Allow",
          "Principal": { "Service": ["appsync.amazonaws.com"] },
          "Action": ["sts:AssumeRole"]
        }
      ]
    }
  }
},
"DataSourceAthenaAPI": {
  "Type": "AWS::AppSync::DataSource",
  "Properties": {
    "ApiId": { "Ref": "AppSyncApiId" },
    "Name": "AthenaAPI",
    "Description": "Athena API",
    "Type": "HTTP",
    "ServiceRoleArn": { "Fn::GetAtt": ["AppSyncAthenaRole", "Arn"] },
    "HttpConfig": {
      "Endpoint": {
        "Fn::Join": [
          ".",
          ["https://athena", { "Ref": "AWS::Region" }, "amazonaws.com/"]
        ]
      },
      "AuthorizationConfig": {
        "AuthorizationType": "AWS_IAM",
        "AwsIamConfig": {
          "SigningRegion": { "Ref": "AWS::Region" },
          "SigningServiceName": "athena"
        }
      }
    }
  }
}

Application Overview

A walkthrough and guide to setting up the application, configuring the subscription, and visualization follows.

Walk-through

Here is how the application works:

Users sign in to the app using Amazon Cognito User Pools. The JWT access token returned at sign-in is sent in an authorization header to AWS AppSync with every GraphQL operation.
A user selects a country from the drop-down list and chooses Query. This triggers a GraphQL query. When the app receives the QueryExecutionId in the response, it subscribes to mutations on that ID.
AWS AppSync makes a SigV4-signed request to the Athena API with the specified query.
Athena runs the query against the specified table. The query returns the sum of the population at recorded longitudes for the selected country along with a count of latitudes at each longitude.
```
SELECT longitude, count(latitude) as count, sum(population) as tot_pop
  FROM "default"."hrsl"
  WHERE country='${countryCode.trim()}'
  group by longitude
  order by longitude
```
The results of the query are stored in the result S3 bucket, under the /protected/athena/ prefix. Signed-in app users can access these results using their IAM credentials.
Putting the query result file in the bucket generates an S3 event and triggers the announcer Lambda function.
The announcer Lambda function sends an announceQueryResult mutation with the S3 bucket and object information.
The mutation triggers a subscription with the mutation’s selection set.
The client retrieves the result file from the S3 bucket and displays the custom visualization.

Setting up the application

The application is a React app that uses the Amplify Javascript library to interact with the Amplify-configured backend services. To get started, install the required libraries.

npm install aws-amplify aws-amplify-react

Then, in the main app file, import the necessary dependencies, including the ./aws-exports.js file containing the backend configuration information.

import React, { useEffect, useState, useCallback } from 'react'
import Amplify, { API, graphqlOperation, Storage } from 'aws-amplify'
import { withAuthenticator } from 'aws-amplify-react'
...
import awsconfig from './aws-exports'
...
Amplify.configure(awsconfig)

To get automatic sign-in, sign-up, and confirm functionality in the app, wrap the main component in the withAuthenticator higher-order component (HOC).

export default withAuthenticator(App, true)

Configuring the subscription

When a user chooses Query, it calls a startQuery callback, and sends a GraphQL query, which returns a QueryExecutionId and updates the queryExecutionId state variable.

const [isSending, setIsSending] = useState(false)
const [QueryExecutionId, setQueryExecutionId] = useState(null)

const startQuery = useCallback(async () => {
  if (isSending) return
  setIsSending(true)
  setFileKey(null)
  try {
    const result = await API.graphql(
      graphqlOperation(queries.startQuery, {
        input: { QueryString: sqlQuery(countryCode) }
      })
    )
    console.log(`Setting sub ID: ${result.data.startQuery.QueryExecutionId}`)
    setIsSending(false)
    setQueryExecutionId(result.data.startQuery.QueryExecutionId)
  } catch (error) {
    setIsSending(false)
    console.log('query failed ->', error)
  }
}, [countryCode, isSending])

Setting the state triggers the following useEffect hook, which creates the subscription. Any time that subscriptionId is changed (for example, set to null), it calls the useEffect return function, which unsubscribes the existing subscription.

const [countryCode, setCountryCode] = useState('')
const [fileKey, setFileKey] = useState(null)

useEffect(() => {
  if (!QueryExecutionId) return

  console.log(`Starting subscription with sub ID ${QueryExecutionId}`)
  const subscription = API.graphql(
    graphqlOperation(subscriptions.onAnnouncement, { QueryExecutionId })
  ).subscribe({
    next: result => {
      console.log('subscription:', result)
      const data = result.value.data.onAnnouncement
      console.log('subscription data:', data)
      setFileKey(data.file.key)
      setQueryExecutionId(null)
    }
  })

  return () => {
    console.log(`Unsubscribe with sub ID ${QueryExecutionId}`, subscription)
    subscription.unsubscribe()
  }
}, [QueryExecutionId])

Visualization

When triggered, the onAnnouncement subscription returns the following data specified in the mutation selection set. This tells the application where to fetch the result file. Signed-in users can read objects in the result bucket starting with the /protected/ prefix. Because Athena saves the results under the /protected/athena/ prefix, authenticated users can retrieve the result files.

QueryExecutionId
    file {
        bucket
        region
        key
}

The key value is passed to the fileKey props in a Visuals component. The application splits the key to extract the level (protected), the identity (Athena), and the object key (\*.csv). The Storage.get function generates a presigned URL with the current IAM credentials, used to retrieve the file with the d3.csv function.

The file is a CSV file with rows of longitude, count, and population. A callback maps the values to x and y (the graph coordinates), and a count property. The application uses the D3.js library along with the d3-hexbin plugin to create the visualization. The d3-hexbin plugin groups the data points in hexagonal-shaped bins based on a defined radius.

const [link, setLink] = useState(null)
useEffect(() => {
  const go = async () => {
    const [level, identityId, _key] = fileKey.split('/')
    const link = await Storage.get(_key, { level, identityId })
    setLink(link)

    const data = Object.assign(
      await d3.csv(link, ({ longitude, tot_pop, count }) => ({
        x: parseFloat(longitude),
        y: parseFloat(tot_pop),
        count: parseInt(count)
      })),
      { x: 'Longitude', y: 'Population', title: 'Pop bins by Longitude' }
    )
    drawChart(data)
  }
  go()
}, [fileKey])

Launching the application

Follow these steps to launch the application.

One-click launch

You can deploy the application directly to the Amplify Console from the public GitHub repository. Both the backend infrastructure and the frontend application are built and deployed. After the application is deployed, follow the remaining steps to configure your Athena database.

Clone and launch

Alternatively, you can clone the repository, deploy the backend with Amplify CLI, and build and serve the frontend locally.

First, install the Amplify CLI and step through the configuration.

$ npm install -g @aws-amplify/cli
$ amplify configure

Next, clone the repository and install the dependencies.

$ git clone https://github.com/aws-samples/aws-appsync-visualization-with-athena-app
$ cd aws-appsync-visualization-with-athena-app
$ yarn

Update the name of the storage bucket (bucketName) in the file ./amplify/backend/storage/sQueryResults/parameters.json then initialize a new Amplify project and push the changes.

$ amplify init
$ amplify push

Finally, launch the application.

$ yarn start

Setting up Athena

The application uses data hosted in S3 by the Registry of Open Data on AWS. Specifically, you use the High Resolution Population Density Maps + Demographic Estimates by CIESIN and Facebook. You can find information on how to set up Athena to query this dataset in the Readme file.

From the Athena Console:

1. Create a database named `default`.

create database IF NOT EXISTS default;

2. Create the table in the default database.

CREATE EXTERNAL TABLE IF NOT EXISTS default.hrsl (
  `latitude` double,
  `longitude` double,
  `population` double 
) PARTITIONED BY (
  month string,
  country string,
  type string 
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
WITH SERDEPROPERTIES (
  'serialization.format' = '\t',
  'field.delim' = '\t'
) LOCATION 's3://dataforgood-fb-data/csv/'
TBLPROPERTIES ('has_encrypted_data'='false', 'skip.header.line.count'='1');

3. Recover the partitions.

MSCK REPAIR TABLE hrsl;

When that completes, you should be able to preview the table and see the type of information shown in the following screenshot.

4. Finally, create a new workgroup. First, look up the name of your S3 content storage bucket:

If you deployed using the one-click launch, search for aws_user_files_s3_bucket in the backend build activity log available by clicking the Download button in the Build Activity section from the Amplify Console.
If you deployed using the “Clone and Launch” steps, find aws_user_files_s3_bucket in your aws-exports.js file in the src directory.

From the Athena console, choose Workgroup in the upper bar, then choose Create workgroup. Provide the workgroup name: appsync. Set Query result location to s3://BUCKET_NAME/protected/athena/. Choose Create workgroup.

Conclusion

This post demonstrated how to use AWS AppSync to interact with the Amazon Athena API and securely render custom visualizations in your front-end application. By combining these services, you can easily create applications that interact directly with big data stored on S3, and render the data in different ways with graphs and charts.

Along with libraries from D3.js, you can develop new innovative ways to interact with data and display information to users. In addition, you can get started quickly, implement core functionality, and deploy instantly using the AWS Amplify Framework.