Skopenow: People Search Made 8x Faster with AWS Lambda
Guest post by Rob Douglas, CTO and Co-Founder of Skopenow.com
Skopenow.com is a people search platform. Our software is used by industries such as insurance, HR, and the government, but it can also be used by the public for people, email, and phone searches. Each search uses multiple APIs and our own search spider that collects information based on the given inputs and variables. The end product of each search is a beautiful, automated report that contains raw data, URLs, and screenshots. In this post I’ll share our backstory, our migration to AWS, and how we transformed our product using AWS Lambda.
Who is Skopenow?
The four co-founders of Skopenow come from all different backgrounds. As the company’s CTO, I have a background in software, design, and product development, which is my focus. Diane Krause and Mark Derrenberger have over 20 years of experience in the insurance industry and manage partner relations as well as oversee operations. Patrick Linden brings his expertise and strength in business development, direct sales, and is heavily involved in streamlining our sales pipeline.
Where We Started and Where We’re Going
Skopenow began in 2013 as a manual process. Customers would come to us, ask for an insurance claimant’s information, and our team of analysts would manually search the web for any traces of the individual. It would take around three to four hours to assemble a report of URLs, screenshots, and any instances of fraud that were detected (usually related to injury and identity).
Fast forward to 2015: Skopenow is now a search engine. Our software takes a name and a location and generates an automated PDF report. Relatives, phone numbers, emails, websites, social media profiles, blogs, comments, usernames, nicknames, and more are generated almost instantly.
Searches are Resource Intensive
For privacy reasons, Skopenow does not maintain its own database of consumer data. This means that every search fetches new information as soon as the search begins. Our algorithm creates combinations of variables, collects new information, scavenges the web for data pertaining to the input, and generates images. As you can imagine, for this to happen in real time, serious power is needed.
In early 2015, our stack was built in Rackspace. Unfortunately, during load testing we discovered that producing 10 concurrent reports, each within a minute, would be impossible. The need for a quicker and more cost-effective scaling model was critical.
Rackspace vs. Amazon Web Services
As our search volume increased, our growing server requirements and cost to scale became problematic. Additionally, within Rackspace, the lackluster speed of the scaling environment inhibited growth. Adding a new Rackspace server would take ~300 seconds. While conducting side-by-side tests, we discovered that this same feat takes only 90 seconds in AWS.
The switch from Rackspace to AWS was seamless. Within a few days of testing, it was clear that AWS allowed for a more agile development environment. In particular, the scaling capacity of Amazon EC2 made scaling and growth seem much more feasible.
It all seemed to be going smoothly, until we started seeing more traffic.
Scaling Issues on EC2
In order to conduct a single search (with approximately 500 generated images) in under one minute, each concurrent search required 16 EC2 instances. The cost of on-demand scaling was not life threatening, but the fact that it took 90 seconds to initiate new instances created an unpleasant user experience. In addition to the speed of initialization, we were paying hourly for the added resources. If 10 people conducted a search at the same moment, our environment would require 160 EC2 instances, and those 160 instances would continue running for 60 minutes, even without any search demand. Ouch.
Another issue with the EC2 environment was the 250 instance limit applied to our account. Because each concurrent search required 16 instances, that allowed us a maximum of only 15 concurrent searches at any given moment. To avoid the initialization lag and a compromised user experience, we were forced to set our baseline to accommodate 10 concurrent searches. Running 160 instances would cost ~$3,500/month without considering bandwidth and storage. Even with our server costs at around $3,500/month our most pressing issue was scalability, which was thwarted by the 250 instance limit.
We also ran into problems when rendering PDFs greater than 50 pages long. Generating each PDF would take up to two minutes and work on a queue without parallel processing. If 100 people submitted a request for a PDF, the 100th click could take over three hours to process due to the serial queue. It became apparent that our software/server configuration had a defined ceiling, and if there is one thing entrepreneurs and venture capitalists hate, it is ceilings.
We needed a pay-per-use service that provided a cost effective and limitless scaling environment. Enter AWS Lambda.
AWS Lambda Introduction
I expressed my server needs to Chris, our assigned AWS Account Manager, and to Vyom, an AWS Solutions Architect. Chris and Vyom shared my scaling environment concerns. Vyom mentioned a service that AWS offers that could potentially eliminate our need for instance scaling. He also mentioned that the service was billed on a pay-per-use model. Chris sounded excited, but reluctantly said, “…but your system needs to operate and run Node.js.” My eyes lit up! Most of Skopenow’s framework is built in Node.js. I asked what the service was called. Vyom replied, “Lambda.” As we would quickly learn, AWS Lambda would transform our business.
Transitioning to Lambda
Within a few days we converted all of our processes into Lambda functions. Having the functions on standby costs nothing, and you are billed only when the functions are executed. While Lambda takes a bit longer than an EC2 server to execute and return data (~7 seconds), we can run each Lambda function 1,000 times/sec without needing to scale. In our case, one image is equivalent to one function, which means we could now generate approximately 1,000 images within 7 seconds. Alternatively, image processing using EC2 instances requires an additional instance for every image processed in parallel.
It took about two days to build a proof of concept, and about one week to push the changes to our production server.
If you implement AWS Lambda on your SQL DB and plan to execute 1,000 functions per second, you will exceed your DB’s connection limit. We ran into this issue when conducting concurrent searches because our SQL DB was bombarded with thousands of requests. To stop our DB from crashing, we swapped out our SQL server for Amazon DynamoDB and NoSQL on AWS.
Although there are some minor drawbacks to using Lambda (1,000 connections per second limit and a 60 second maximum run time), Lambda fractionalized the time and cost of rendering images and PDFs.
Eliminating Server Load
We were using PhantomJS to fetch new URLs for generating images, but it was draining our resources. To improve this, we used a Lambda function to streamline the image processing lifecycle:
- Read from the queue
- Take a screenshot
- Upload it to Amazon S3
- Call NodeJS to inform the client side that the image is ready
Processing 1,000 images in 7 seconds within EC2 would require ~100 running instances/search and cost $3.00 for a full hour of use. Using AWS Lambda, we can now process all 1,000 images within 7 seconds. Compared to our previous stack, where we were processing 500 images within 60 seconds, we now generate our images nearly 8x faster, and at over a 99% reduction in cost!
What’s Next: Flaggin’ & Taggin’
Our users will soon be able to use activity-based tags (for example, running, jumping, and swimming) and behavioral-based tags (for example, violence, radicalism, and profanity) when searching for an individual. Users will type in a person’s name, and a tag such as “Physical Injury” or “Gun.” If a relevant image or post exists, our software will flag and tag the content and add it to the report. This level of automation empowers our users with the ability to quickly and accurately detect an individual’s behaviors and activities.
Closing Remarks: The Aftermath
Scaling is a large issue for many companies, but with Lambda you can alleviate many of the foreseeable headaches typical of growing a web-based business. Currently, Skopenow is converting its entire search process into Lambda functions. This means that we’ll be able to query thousands of websites and databases within seconds, without a need to scale. Not only will performance increase, but our cost-per-search will be dramatically lower. This methodology can be incredibly useful to companies who want to map data without having to constantly upscale and downscale their environment. Thanks to Lambda, Skopenow is growing faster and more efficiently than ever imagined. We can’t wait to see what’s AWS comes up with next.