AWS Architecture Blog
Field Notes: Scaling Browser Automation with Puppeteer on AWS Lambda with Container Image Support
This post is contributed by Bill Kerr, SHI and Raj Seshadri, Global SA Lead, AWS.
Imagine you are launching a brand new website selling goods and services. You are expecting a huge amount of traffic due to the seasonality of the product. You would like to test 100K simultaneous connections to the website and make sure it is working properly. How would you go about doing that? Try a headless browser automation tool like Puppeteer. Puppeteer can now be packaged as a container image in a Lambda function to perform browser automation or any web scraping functionality.
Puppeteer is a Node library which allows you to automate tasks in headless Chrome. When using Puppeteer in Lambda with container image support, you can scale browser automation horizontally. With Lambda, Node packages can be installed in a container instead of having to put them in Lambda layers. This blog will show how to run Puppeteer and Chrome in a Lambda container function. In this example, multiple instances of Puppeteer will simultaneously take screenshots of several popular news websites and store them in Amazon S3.
Solution Overview
The overall solution architecture is shown in the preceding diagram. Two Lambda functions are used in this example.
- A Puppeteer function that requires a URL and bucket name as inputs. This uses Puppeteer to take a screenshot of the URL in headless Chrome and save the image in the S3 bucket.
- A fan-out function that requires a list of URLs as input, which asynchronously invokes the Puppeteer function for each URL in the list.
Lambda container Dockerfile for Puppeteer function
Here is a documented version of the Dockerfile that is used to create a container for use with Lambda.
Deploy the cloud infrastructure
Prerequisite
AWS CDK must be installed. Review the CDK installation instructions.
Download and install the dependencies and example CDK application
In a terminal, check out the code used in this article and install it.
Puppeteer Usage
The next steps are performed in the AWS Console.
1.In the AWS Console, open the Lambda function that was created by CDK above.
-
- Look for InvokeLambdaFunctionName in the Outputs section to get the name of the function to open.
- You can also find the function name in the Resources tab of the CloudFormation stack in the AWS Console.
2. In the function, click on the Test tab.
3. Create a new test event with JSON like the following, and run it. Have fun changing the URLs to what you want.
4. Click on the Invoke button to invoke the fan-out function.
5. Open the S3 bucket that was created by the aforementioned CDK.
6. Look for puppeteer.BucketName in the Outputs section to get the bucket name.
7. Within a minute of running the fan-out function, you should see a list of images in the screenshots folder in the bucket. They should slowly trickle in as you refresh the list of screenshots until all are done.
8. If any screenshots are missing, you can view CloudWatch Logs for the Puppeteer function.
9. Search all for error to determine how to implement improved error handling in the code.
10. You could modify the app to perform functional testing of a website, and save screenshots in S3 whenever errors occur.
Clean up
In the AWS Console, manually empty the bucket that was created by the CDK. Look for puppeteer.BucketName in the Outputs section. You can also find the bucket name in the Resources tab of the CloudFormation stack. Then, run the following command after the bucket has been emptied.
Conclusion
In this post, we showed you how to use Lambda functions packaged as container image to do web scraping functions. The possibilities of such applications are limitless when using lambda with container image support.
For more serverless learning resources, visit the Serverlessland website.