AWS Machine Learning Blog

Build a social media dashboard using machine learning and BI services

In this blog post we’ll show you how you can use Amazon Translate, Amazon Comprehend, Amazon Kinesis, Amazon Athena, and Amazon QuickSight to build a natural-language-processing (NLP)-powered social media dashboard for tweets.

Social media interactions between organizations and customers deepen brand awareness. These conversations are a low-cost way to acquire leads, improve website traffic, develop customer relationships, and improve customer service.

In this blog post, we’ll build a serverless data processing and machine learning (ML) pipeline that provides a multi-lingual social media dashboard of tweets within Amazon QuickSight. We’ll leverage API-driven ML services that allow developers to easily add intelligence to any application, such as computer vision, speech, language analysis, and chatbot functionality simply by calling a highly available, scalable, and secure endpoint. These building blocks will be put together with very little code, by leveraging serverless offerings within AWS. For this blog post, we will be performing language translation and natural language processing on the tweets flowing through the system.

In addition to building a social media dashboard, we want to capture both the raw and enriched datasets and durably store them in a data lake. This allows data analysts to quickly and easily perform new types of analytics and machine learning on this data.

Throughout this blog post, we’ll show how you can do the following:

  • Leverage Amazon Kinesis Data Firehose to easily capture, prepare, and load real-time data streams into data stores, data warehouses, and data lakes. In this example, we’ll use Amazon S3.
  • Trigger AWS Lambda to analyze the tweets using Amazon Translate and Amazon Comprehend, two fully managed services from AWS. With only a few lines of code, these services will allow us to translate between languages and perform natural language processing (NLP) on the tweets.
  • Leverage separate Kinesis data delivery streams within Amazon Kinesis Data Firehose to write the analyzed data back to the data lake.
  • Leverage Amazon Athena to query the data stored in Amazon S3.
  • Build a set of dashboards using Amazon QuickSight.

The following diagram shows both the ingest (blue) and query (orange) flows.

Note: At the time of this blog post, Amazon Translate is still in preview. In production workloads, use the multilingual features of Amazon Comprehend until Amazon Translate becomes generally available (GA).

Build this architecture yourself

We’ve provided you with an AWS CloudFormation template that will create all the ingestion components shown in the previous diagram, except for the Amazon S3 notification for AWS Lambda (depicted as the dotted blue line).

In the AWS Management Console, launch the CloudFormation Template.

This will launch the CloudFormation stack automatically into the us-east-1 Region with the following settings:

Input Value
Region us-east-1
CFN Template: https://s3.amazonaws.com/serverless-analytics/SocialMediaAnalytics-blog/deploy.yaml
Stack Name: SocialMediaAnalyticsBlogPost

Specify these  required parameters:

Parameter Description
InstanceKeyName KeyPair used to connect to a Twitter streaming instance
TwitterAuthAccessToken Twitter Account Access token
TwitterAuthAccessTokenSecret Twitter Account Access token secret
TwitterConsumerKey Twitter Account consumer key (API key)
TwitterConsumerKeySecret Twitter Account consumer secret (API secret)

You will need to create an app on Twitter: Create a consumer key (API key), consumer secret key (API secret), access token, and access token secret and use them as parameters in the CloudFormation stack. You can create them using this link.

Additionally, you can modify which terms and languages will be pulled from the Twitter streaming API. This lambda implementation calls Comprehend for each tweet. If you’d like to modify the terms to something that may retrieve tens or hundreds of tweets a second, please look at performing batch calls or leveraging AWS Glue with triggers to perform batch processing versus stream processing.

Note: At the time of this blog post, Amazon Translate is in preview. If you do not have access to Amazon Translate, only include the en (English) value.

Note: The code in this blog assumes that the language codes used by Twitter are the same as those used by Amazon Translate and Comprehend. The code could easily be expanded on, but if you are adding new labels, please confirm this assumption is kept true. (Unless you also update the AWS Lambda code.)

In the CloudFormation console, you can acknowledge by checking the boxes to allow AWS CloudFormation to create IAM resources and resource with custom name. The CloudFormation template uses serverless transforms. Choose Create Change Set to check the resources that the transforms add, then choose Execute.

After the CloudFormation stack is launched, wait until it is complete.

When the launch is finished, you’ll see a set of outputs that we’ll use throughout this blog post:

Setting up S3 Notification – Call Amazon Translate/Comprehend from new Tweets:

After the CloudFormation stack launch is completed, go to the outputs tab for direct links and information. Then click the LambdaFunctionConsoleURL link to launch directly into the Lambda function.

The Lambda function calls Amazon Translate and Amazon Comprehend to perform language translation and natural language processing (NLP) on tweets. The function uses Amazon Kinesis to write the analyzed data to Amazon S3.

 

Most of this has been set up already by the CloudFormation stack, although we will have you add the S3 notification so that the Lambda function is invoked when new tweets are written to S3:

  1. Under Add Triggers, select the S3 trigger.
  2. Then configure the trigger with the new S3 bucket that CloudFormation created with the ‘raw/’ prefix. The event type should be Object Created (All).

Following least privilege patterns, the IAM role that the Lambda function has been assigned only has access to the S3 bucket that the CloudFormation template created.

The following diagram shows an example:

Take some time to examine the rest of the code. With a few lines of code, we can call Amazon Translate to convert between Arabic, Portuguese, Spanish, French, German, English, and many other languages.

The same is true for adding natural language processing into the application using Amazon Comprehend. Note how easily we were able to perform the sentiment analysis and entity extraction on the tweets within the Lambda function.

Start the Twitter stream producer

The only server used in the example is outside the actual ingestion flow from Kinesis Data Firehose. It’s used to collect the tweets from Twitter and push them into Kinesis Data Firehose. In a future post, we’ll show you how you can shift this component to also be serverless.

Use SSH to connect to the Amazon Linux EC2 instance that the CloudFormation stack created.

Part of the CloudFormation stack outputs includes an SSH-style command that can be used on many systems to connect to the instance.

Note: Refer to the Amazon EC2 documentation for details on how to connect either from a Windows or Mac/Linux machine.

Run the following command:

node twitter_stream_producer_app.js

This starts the flow of tweets. If you want to keep the flow running, simply run it as a background job. For simple testing, you can also keep the SSH tunnel open.

After a few minutes, you should be able to see the various datasets in the S3 bucket that the CloudFormation template created:

Note (If you don’t see all three prefixes): If you don’t see any data, check to make sure the Twitter reader is reading correctly and not creating errors. If you only see a raw prefix and not the others, check to make sure that the S3 trigger is set up on the Lambda function.

Create the Athena tables

We are going to manually create the Amazon Athena tables. This is a great place to leverage AWS Glue crawling features in your data lake architectures. The crawlers will automatically discover the data format and data types of your different datasets that live in Amazon S3 (as well as relational databases and data warehouses). More details can be found in the documentation for Crawlers with AWS Glue.

In Athena, run the following commands to create the Athena database and tables:

create database socialanalyticsblog;

This will create a new database in Athena.

Run the next statement.

IMPORTANT: Replace <TwitterRawLocation> with what is shown as an output of the CloudFormation script:

CREATE EXTERNAL TABLE socialanalyticsblog.tweets (
	coordinates STRUCT<
		type: STRING,
		coordinates: ARRAY<
			DOUBLE
		>
	>,
	retweeted BOOLEAN,
	source STRING,
	entities STRUCT<
		hashtags: ARRAY<
			STRUCT<
				text: STRING,
				indices: ARRAY<
					BIGINT
				>
			>
		>,
		urls: ARRAY<
			STRUCT<
				url: STRING,
				expanded_url: STRING,
				display_url: STRING,
				indices: ARRAY<
					BIGINT
				>
			>
		>
	>,
	reply_count BIGINT,
	favorite_count BIGINT,
	geo STRUCT<
		type: STRING,
		coordinates: ARRAY<
			DOUBLE
		>
	>,
	id_str STRING,
	timestamp_ms BIGINT,
	truncated BOOLEAN,
	text STRING,
	retweet_count BIGINT,
	id BIGINT,
	possibly_sensitive BOOLEAN,
	filter_level STRING,
	created_at STRING,
	place STRUCT<
		id: STRING,
		url: STRING,
		place_type: STRING,
		name: STRING,
		full_name: STRING,
		country_code: STRING,
		country: STRING,
		bounding_box: STRUCT<
			type: STRING,
			coordinates: ARRAY<
				ARRAY<
					ARRAY<
						FLOAT
					>
				>
			>
		>
	>,
	favorited BOOLEAN,
	lang STRING,
	in_reply_to_screen_name STRING,
	is_quote_status BOOLEAN,
	in_reply_to_user_id_str STRING,
	user STRUCT<
		id: BIGINT,
		id_str: STRING,
		name: STRING,
		screen_name: STRING,
		location: STRING,
		url: STRING,
		description: STRING,
		translator_type: STRING,
		protected: BOOLEAN,
		verified: BOOLEAN,
		followers_count: BIGINT,
		friends_count: BIGINT,
		listed_count: BIGINT,
		favourites_count: BIGINT,
		statuses_count: BIGINT,
		created_at: STRING,
		utc_offset: BIGINT,
		time_zone: STRING,
		geo_enabled: BOOLEAN,
		lang: STRING,
		contributors_enabled: BOOLEAN,
		is_translator: BOOLEAN,
		profile_background_color: STRING,
		profile_background_image_url: STRING,
		profile_background_image_url_https: STRING,
		profile_background_tile: BOOLEAN,
		profile_link_color: STRING,
		profile_sidebar_border_color: STRING,
		profile_sidebar_fill_color: STRING,
		profile_text_color: STRING,
		profile_use_background_image: BOOLEAN,
		profile_image_url: STRING,
		profile_image_url_https: STRING,
		profile_banner_url: STRING,
		default_profile: BOOLEAN,
		default_profile_image: BOOLEAN
	>,
	quote_count BIGINT
) ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
LOCATION '<TwitterRawLocation>';

This will create a tweets table. Next we’ll do the same and create the entities and sentiment tables. It is important to update both of these with the actual paths listed in your CloudFormation output.

First run this command replacing the path highlighted in the following example to create the entities table:

CREATE EXTERNAL TABLE socialanalyticsblog.tweet_entities (
	tweetid BIGINT,
	entity STRING,
	type STRING,
	score DOUBLE
) ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
LOCATION '<TwitterEntitiesLocation>';

And now run this command to create the sentiments table:

CREATE EXTERNAL TABLE socialanalyticsblog.tweet_sentiments (
	tweetid BIGINT,
	text STRING,
	originalText STRING,
	sentiment STRING,
	sentimentPosScore DOUBLE,
	sentimentNegScore DOUBLE,
	sentimentNeuScore DOUBLE,
	sentimentMixedScore DOUBLE
) ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
LOCATION '<TwitterSentimentLocation>'

After running these four statements and replacing the locations for the create table statements, you should be able to select the socialanalyticsblog database in the drop-down list and see the three tables:

You can run queries to investigate the data you are collecting. Let’s first look at the tables themselves.

We can look at a sample of 20 tweets:

select * from socialanalyticsblog.tweets limit 20;

Pull the top entity types:

select type, count(*) cnt from socialanalyticsblog.tweet_entities
group by type order by cnt desc

Now we can pull the top 20 commercial items:

select entity, type, count(*) cnt from socialanalyticsblog.tweet_entities
where type = 'COMMERCIAL_ITEM'
group by entity, type order by cnt desc limit 20;

Let’s now pull 20 positive tweets and see their scores from sentiment analysis:

select * from socialanalyticsblog.tweet_sentiments where sentiment = 'POSITIVE' limit 20;

select lang, count(*) cnt from socialanalyticsblog.tweets group by lang order by cnt desc

You can also start to query the translation details. Even if I don’t know the German word for shoe, I could easily do the following query:

select ts.text, ts.originaltext from socialanalyticsblog.tweet_sentiments ts
join socialanalyticsblog.tweets t on (ts.tweetid = t.id)
where lang = 'de' and ts.text like '%Shoe%'

The results show a tweet talking about shoes based on the translated text:

Let’s also look at the non-English tweets that have Kindle extracted through NLP:

select lang, ts.text, ts.originaltext from socialanalyticsblog.tweet_sentiments ts
join socialanalyticsblog.tweets t on (ts.tweetid = t.id)
where lang != 'en' and ts.tweetid in
(select distinct tweetid from tweet_entities
 where entity = 'Kindle')

Note: Technically you don’t have to use the fully qualified table names if the database is selected in Athena, but I did that to limit people having issues if they didn’t select the socialanalyticsblog database first.

Building QuickSight dashboards

  1. Launch into QuickSight – https://us-east-1.quicksight.aws.amazon.com/sn/start.
  2. Choose Manage data from the top right.
  3. Choose New Data Set.
  4. Create a new Athena Data Source.
  5. Select the socialanalyticsblog database and the tweet_sentiments table.
  6. Then Choose Edit/Preview Data.
  7. Under Table, choose Switch to custom SQL tool:
  8. Give the query a name (such as ‘SocialAnalyticsBlogQuery’)
  9. Put in this query:
    SELECT  s.*,
            e.entity,
            e.type,
            e.score,
             t.lang as language,
             coordinates.coordinates[1] AS lon,
             coordinates.coordinates[2] AS lat ,
             place.name,
             place.country,
             t.timestamp_ms / 1000 AS timestamp_in_seconds,
             regexp_replace(source,
             '\<.+?\>', '') AS src
    FROM socialanalyticsblog.tweets t
    JOIN socialanalyticsblog.tweet_sentiments s
        ON (s.tweetid = t.id)
    JOIN socialanalyticsblog.tweet_entities e
        ON (e.tweetid = t.id)
  1. Then choose Finish.
  2. This saves the query and lets you see sampled data.
  3. Switch the datatype for the timestamp_in_seconds to be a date:
  1. And then choose Save and Visualize.

Now you can easily start to build some dashboards.

Note: With the way I created the custom query, you’ll want to count the distinct tweetids as the value.

We’ll step you through creating a dashboard.

  1. Start by making the first visualin the top-left quadrant of the display.
  2. Select type, and tweetid from the field list.
  3. Select the double arrow drop down next to Field Well.
  4. Move the tweetid to the value.
  5. And then choose it to perform Count Distinct:
  6. Now switch it to a pie chart under visualization types.

Now let’s add another visual.

  1. Choose Add (near the top left corner of the page) : Add Visual.
  2. Resize it and move it next to your first pie chart.
  3. Now choose sentiment, timestamp_in_seconds.
  4. Under the field wells, or the chart itself, you can zoom in/out of the time. Let’s zoom into hours:
  5. Suppose on the timeline, we only want to see positive/negative/mixed sentiments. The Neutral line, at least for my Twitter terms, is causing the rest not to be seen easily.
  6. Just click the Neutral line and in the box that appears choose to Exclude Neutral.

Let’s step through adding one more visual to this analysis to show the translated tweets:

  1. Under Add, choose Add Visual.
  2. Resize it to be the bottom half of the space.
  3. Choose the Table View.
  4. Select:
    • language
    • text
    • originalText
  5. Then, on the left-side, choose Filter.
  6. Create One : language.
  7. Then choose Custom filter, Does not equal, and enter a value of en.

 

Note: You might need to adjust the column widths in the Table view based on your screen resolution to see the last column.

Now you can resize and see the Entities, Sentiment over time, and translated tweets.

You can build multiple dashboards, zoom in and out of them, and see the data in different ways. For example, the following is a geospatial chart of the sentiment:

You can expand on this dashboard, and build analyses such as this one:

Shutting down

After you have created these resources, you can remove them by following these steps.

  1. Stop the Twitter stream reader (if you still have it running).
    1. CTRL-C or kill it if it’s in the background.
  2. Delete the S3 bucket that the CloudFormation template created.
  3. Delete the Athena tables database (socialanalyticsblog).
    1. Drop table socialanalyticsblog.tweets.
    2. Drop table socialanalyticsblog.tweet_entities.
    3. Drop table socialanayticsblog.tweet_sentiments.
    4. Drop database socialanalyticsblog.
  4. Delete the CloudFormation stack (ensure that the S3 bucket is empty prior to deleting the stack).

Conclusion

The entire processing, analytics, and machine learning pipeline starting with Amazon Kinesis, analyzing the data using Amazon Translate to translate tweets between languages, using Amazon Comprehend to perform sentiment analysis and QuickSight to create the dashboards was built without spinning up any servers.

We added advanced machine learning (ML) services to our flow, through some simple calls within AWS Lambda, and we built a multi-lingual analytics dashboard with Amazon QuickSight. We have also saved all the data to Amazon S3 so, if we want, we can do other analytics on the data using Amazon EMR, Amazon SageMaker, Amazon Elasticsearch Service, or other AWS services.

Instead of running the Amazon EC2 instance that reads the Twitter firehose, you could leverage AWS Fargate to deploy that code as a container. AWS Fargate is a technology for Amazon Elastic Container Service (ECS) and Amazon Elastic Container Service for Kubernetes (EKS) that allows you to run containers without having to manage servers or clusters. With AWS Fargate, you no longer have to provision, configure, and scale clusters of virtual machines to run containers. This removes the need to choose server types, decide when to scale your clusters, or optimize cluster packing. AWS Fargate removes the need for you to interact with or think about servers or clusters. Using AWS Fargate you can focus on designing and building your applications instead of managing the infrastructure that runs them.

 


Additional Reading

Learn how to detect sentiments in customer reviews with Amazon Comprehend.


 

About the Authors

Ben Snively is a Public Sector Specialist Solutions Architect. He works with government, non-profit and education customers on big data and analytical projects, helping them build solutions using AWS. In his spare time he adds IoT sensors throughout his house and runs analytics on it.

 

 

Viral Desai is a Solutions Architect with AWS. He provides architectural guidance to help customers achieve success in the cloud. In his spare time, Viral enjoys playing tennis and spending time with family.