AWS Architecture Blog
Field Notes: Building a Scalable Real-Time Newsfeed Watchlist Using Amazon Comprehend
One of the challenges businesses have is to constantly monitor information via media outlets and be alerted when a key interest is picked up, such as individual, product, or company information. One way to do this is to scan media and news feeds against a company watchlist. The list may contain personal names, organizations or suborganizations of interest, and other type of entities (for example, company products). There are several reasons why a company might need to develop such a process: reputation risk mitigation, data leaks, competitor influence, and market change awareness.
In this post, I will share with you a prototype solution that combines several AWS Services: Amazon Comprehend, Amazon API Gateway, AWS Lambda, and Amazon Aurora Serverless. We will examine the solution architecture and I will provide you with the steps to create the solution in your AWS environment.
Overview of solution
Walkthrough
The preceding architecture shows an Event-driven design. To interact with the solution we use API Lambda functions which are initiated upon user request. Following are the high-level steps.
- Create a watchlist. Use the “Refresh_watchlist” API to submit new data, or load existing data from a CSV file located in a bucket. More details in the following “Loading Data” section.
- Make sure the data is loaded properly and check a known keyword. Use the “check-keyword” API. More details in the following “Testing” section.
- Once the watchlist data is ready, submit a request to “query_newsfeed” with a given newsfeed configuration (url, document section qualifier) to submit a new job to scan the content against the watchlist. Review the example in the “Testing” section.
- If an entity or keyword matched, you will get a notification email with the match results.
Technical Walkthrough
- When a new request to “query_newsfeed” is submitted. The Lambda handler extracts the content of the URL and creates a new message in the ‘Incoming Queue’.
- Once there are available messages in the incoming queue, a subscribed Lambda function is invoked “evaluate content”. This takes the scraped content from the message and submits it to Amazon Comprehend to extract the desired elements (entities, key phrase, sentiment).
- The result of Amazon Comprehend is passed through a matching logic component, which runs the results against the watchlist (Aurora Serverless Postgres DB), utilizing Fuzzy Name matching.
- If a match occurs, a new message is generated for Amazon SNS which initiates a notification email.
To deploy and test the solution we follow four steps:
- Create infrastructure
- Create serverless API Layer
- Load Watchlist data
- Test the match
The code for Building a Scalable real-time newsfeed watchlist is available in this repository.
Prerequisites
You will need an AWS account and a Serverless Framework CLI Installed.
Security best practices
Before setting up your environment, review the following best practices, and if required change the source code to comply with your security standards.
- Deploy the API and Lambda in your private VPC (learn more about deploying serverless in a VPC)
- Protect your API Gateway using a WAF
- Check AWS Managed Rules for AWS WAF
- Check serverless plugin for WAF association
- Review Security Best Practices for Amazon Simple Storage Service (Amazon S3)
- Review Security Best Practices for AWS Key Management Service
- Learn how to approve Amazon Comprehend at FSI Service Spotlight
Creating the infrastructure
I recommend reviewing these resources to get started with: Amazon Aurora Serverless, Lambda, Amazon Comprehend, and Amazon S3.
To begin the procedure:
- Clone the GitHub repository to your local drive.
- Navigate to “infrastructure” directory.
- Use CDK or AWS CLI to deploy the stack:
aws cloudformation deploy --template RealtimeNewsAnalysisStack.template.json --stack-name RealtimeNewsAnalysis --parameter-overrides notificationemailparam=yourEmailGoesHere@YourEmailDomain.com
cdk synth / deploy –-parameters notificationemailparam=abc@domain.com
4. Navigate back to the root directory and into the “serverless” directory.
5. Initiate the following serverless deployment commands:
sls plugin install -n serverless-python-requirements
sls deploy
6. Load data to the watchlist using a standard web service call to the “refresh_watchlist” API.
7. Test the service by calling the web service “check-keyword”.
8. Use “query_newsfeed” Web service to scan newsfeed and articles against the watchlist.
9. Check your mailbox for match notifications.
10. For cleanup and removal of the solution, review the “clean up” section at the end of this post.
The following screenshot shows best practices for images.
Loading the watchlist data
We can use the refresh watchlist API to recreate the list with the list provided in the message. Use a tool like Postman to send a POST web service call to the refresh_watchlist.
Set the message body to RAW – JSON:
{
"refresh_list_from_bucket": false,
"watchlist": [
{"entity":"Mateo Jackson", "entity_type": "person"},
{"entity":"AnyCompany", "entity_type": "organization"},
{"entity":"Example product", "entity_type": "product"},
{"entity":"Alice", "entity_type": "person"},
{"entity":"Li Juan", "entity_type": "person"}
] }
It is possible to use a CSV file to load the data into the watchlist. Locate your newsfeed bucket and upload a CSV file “watchlist.csv” (no header required) under a directory “watchlist” in the newsfeed bucket (create the directory).
CSV Example:
The following is a screenshot showing how Postman initiates the request.
Testing
You can use the dedicated check keyword API to test against a list of keywords to see if the match works. This does not utilize Amazon Comprehend, but it can verify that the list is loaded properly and match against a given criterion.
Note: the spelling mistake for alise with an “s” instead of “c”, and, the pronunciation of Li is spelled as Lee. Both returned as a match.
Now, let’s test it with a related news article.
Check your mailbox! You should receive an email with the match result.
Cleaning up
Use cloudformation/cdk for clean up. Also, use serverless clean up `sls remove`.
Conclusion
In this post, you learned how to create a scalable watchlist and use it to monitor newsfeed content. This is a practical demonstration for a typical customer problem. The algorithms Levenshtein distance and soundex, along with Amazon Comprehend built-in machine learning capabilities, provides a powerful method to process and analyze text. To support a high volume of queries, the solution uses Amazon SQS to process messages and Amazon Aurora Serverless to automatically scale the database as needed. It is possible to use the same queue for additional data source ingestion.
This solution can be modified for additional purposes such as financial institutions OFAC watchlist (Work in progress) or other monitoring applications. Feel free to provide feedback and tell us how this solution can be useful for you.
Field Notes provides hands-on technical guidance from AWS Solutions Architects, consultants, and technical account managers, based on their experiences in the field solving real-world business problems for customers.
References
Developer Guide: Amazon Comprehend
Get started with Serverless Framework Open Source & AWS