AWS Architecture Blog

Field Notes: Building a Scalable Real-Time Newsfeed Watchlist Using Amazon Comprehend

One of the challenges businesses have is to constantly monitor information via media outlets and be alerted when a key interest is picked up, such as individual, product, or company information. One way to do this is to scan media and news feeds against a company watchlist. The list may contain personal names, organizations or suborganizations of interest, and other type of entities (for example, company products). There are several reasons why a company might need to develop such a process: reputation risk mitigation, data leaks, competitor influence, and market change awareness.

In this post, I will share with you a prototype solution that combines several AWS Services: Amazon Comprehend, Amazon API Gateway, AWS Lambda, and Amazon Aurora Serverless. We will examine the solution architecture and I will provide you with the steps to create the solution in your AWS environment.

Overview of solution

Architecture showing how to build a Scalable Real-Time Newsfeed Watchlist Using Amazon Comprehend

Figure 1 – Architecture showing how to build a Scalable Real-Time Newsfeed Watchlist Using Amazon Comprehend

Walkthrough

The preceding architecture shows an Event-driven design. To interact with the solution we use API Lambda functions which are initiated upon user request. Following are the high-level steps.

  1. Create a watchlist. Use the “Refresh_watchlist” API to submit new data, or load existing data from a CSV file located in a bucket. More details in the following “Loading Data” section.
  2. Make sure the data is loaded properly and check a known keyword. Use the “check-keyword” API. More details in the following “Testing” section.
  3. Once the watchlist data is ready, submit a request to “query_newsfeed” with a given newsfeed configuration (url, document section qualifier) to submit a new job to scan the content against the watchlist. Review the example in the “Testing” section.
  4. If an entity or keyword matched, you will get a notification email with the match results.

Technical Walkthrough

  • When a new request to “query_newsfeed” is submitted. The Lambda handler extracts the content of the URL and creates a new message in the ‘Incoming Queue’.
  • Once there are available messages in the incoming queue, a subscribed Lambda function is invoked “evaluate content”. This takes the scraped content from the message and submits it to Amazon Comprehend to extract the desired elements (entities, key phrase, sentiment).
  • The result of Amazon Comprehend is passed through a matching logic component, which runs the results against the watchlist (Aurora Serverless Postgres DB), utilizing Fuzzy Name matching.
  • If a match occurs, a new message is generated for Amazon SNS which initiates a notification email.

To deploy and test the solution we follow four steps:

  1. Create infrastructure
  2. Create serverless API Layer
  3. Load Watchlist data
  4. Test the match

The code for Building a Scalable real-time newsfeed watchlist is available in this repository.

Prerequisites

You will need an AWS account and a Serverless Framework CLI Installed.

Security best practices

Before setting up your environment, review the following best practices, and if required change the source code to comply with your security standards.

Creating the infrastructure

I recommend reviewing these resources to get started with: Amazon Aurora Serverless, Lambda, Amazon Comprehend, and Amazon S3.

To begin the procedure:

  1. Clone the GitHub repository to your local drive.
  2. Navigate to “infrastructure” directory.
  3. Use CDK or AWS CLI to deploy the stack:
aws cloudformation deploy --template RealtimeNewsAnalysisStack.template.json --stack-name RealtimeNewsAnalysis --parameter-overrides notificationemailparam=yourEmailGoesHere@YourEmailDomain.com
cdk synth / deploy –-parameters notificationemailparam=abc@domain.com

4. Navigate back to the root directory and into the “serverless” directory.

5. Initiate the following serverless deployment commands:

sls plugin install -n serverless-python-requirements 
sls deploy

6. Load data to the watchlist using a standard web service call to the “refresh_watchlist” API.

7. Test the service by calling the web service “check-keyword”.

8. Use “query_newsfeed” Web service to scan newsfeed and articles against the watchlist.

9. Check your mailbox for match notifications.

10. For cleanup and removal of the solution, review the “clean up” section at the end of this post.

The following screenshot shows best practices for images.

Screenshot showing image best practices

Figure 2 – Screenshot showing image best practices

Loading the watchlist data

We can use the refresh watchlist API to recreate the list with the list provided in the message. Use a tool like Postman to send a POST web service call to the refresh_watchlist.

Set the message body to RAW – JSON:

{
"refresh_list_from_bucket": false,
    "watchlist": [
        {"entity":"Mateo Jackson", "entity_type": "person"},
        {"entity":"AnyCompany", "entity_type": "organization"},
        {"entity":"Example product", "entity_type": "product"},
        {"entity":"Alice", "entity_type": "person"},
        {"entity":"Li Juan", "entity_type": "person"}
    ] }

It is possible to use a CSV file to load the data into the watchlist. Locate your newsfeed bucket and upload a CSV file “watchlist.csv” (no header required) under a directory “watchlist” in the newsfeed bucket (create the directory).

CSV Example:

CSV example table

The following is a screenshot showing how Postman initiates the request.

Screenshot showing Postman initiate the request

Figure 3 – Screenshot showing Postman initiate the request

Testing

You can use the dedicated check keyword API to test against a list of keywords to see if the match works. This does not utilize Amazon Comprehend, but it can verify that the list is loaded properly and match against a given criterion.

can use the dedicated check keyword API to test against a list of keywords

Figure 4 – You can use the dedicated check keyword API to test against a list of keywords

Note: the spelling mistake for alise with an “s” instead of “c”, and, the pronunciation of Li is spelled as Lee. Both returned as a match.

Now, let’s test it with a related news article.

Screenshot showing test with a related news article

Figure 5 – Screenshot showing test with a related news article

Check your mailbox! You should receive an email with the match result.

Cleaning up

Use cloudformation/cdk for clean up. Also, use serverless clean up `sls remove`.

Conclusion

In this post, you learned how to create a scalable watchlist and use it to monitor newsfeed content. This is a practical demonstration for a typical customer problem. The algorithms Levenshtein distance and soundex, along with Amazon Comprehend built-in machine learning capabilities, provides a powerful method to process and analyze text. To support a high volume of queries, the solution uses Amazon SQS to process messages and Amazon Aurora Serverless to automatically scale the database as needed. It is possible to use the same queue for additional data source ingestion.

This solution can be modified for additional purposes such as financial institutions OFAC watchlist (Work in progress) or other monitoring applications. Feel free to provide feedback and tell us how this solution can be useful for you.

Field Notes provides hands-on technical guidance from AWS Solutions Architects, consultants, and technical account managers, based on their experiences in the field solving real-world business problems for customers.

References

Developer Guide: Amazon Comprehend

Amazon Aurora Serverless

Amazon Simple Queue Service

PostgreSQL Documentation

Get started with Serverless Framework Open Source & AWS

 

 

TAGS:
Zahi Ben Shabat

Zahi Ben Shabat

Zahi Ben Shabat is a Sr. Prototype Architect at AWS, helping customers to remove obstacles and accelerate their journey to the cloud by building prototypes. He has worked with financial customers for almost a decade. In his free time, Zahi enjoys the outdoors, playing music, and swimming.