How Rapid7 built multi-tenant analytics with Amazon Redshift using near-real-time datasets
This is a guest post co-written by Rahul Monga, Principal Software Engineer at Rapid7.
Rapid7 InsightVM is a vulnerability assessment and management product that provides visibility into the risks present across an organization. It equips you with the reporting, automation, and integrations needed to prioritize and fix those vulnerabilities in a fast and efficient manner. InsightVM has more than 5,000 customers across the globe, runs exclusively on AWS, and is available for purchase on AWS Marketplace.
To provide near-real-time insights to InsightVM customers, Rapid7 has recently undertaken a project to enhance the dashboards in their multi-tenant software as a service (SaaS) portal with metrics, trends, and aggregated statistics on vulnerability information identified in their customer assets. They chose Amazon Redshift as the data warehouse to power these dashboards due to its ability to deliver fast query performance on gigabytes to petabytes of data.
In this post, we discuss the design options that Rapid7 evaluated to build a multi-tenant data warehouse and analytics platform for InsightVM. We will deep dive into the challenges and solutions related to ingesting near-real-time datasets and how to create a scalable reporting solution that can efficiently run queries across more than 3 trillion rows. This post also discusses an option to address the scenario where a particular customer outgrows the average data access needs.
This post uses the terms customers, tenants, and organizations interchangeably to represent Rapid7 InsightVM customers.
To collect data for InsightVM, customers can use scan engines or Rapid7’s Insight Agent. Scan engines allow you to collect vulnerability data on every asset connected to a network. This data is only collected when a scan is run. Alternatively, you can install the Insight Agent on individual assets to collect and send asset change information to InsightVM numerous times each day. The agent also ensures that asset data is sent to InsightVM regardless of whether or not the asset is connected to your network.
Data from scans and agents is sent in the form of packed documents, in micro-batches of hundreds of events. Around 500 documents per second are received across customers, and each document is around 2 MB in size. On a typical day, InsightVM processes 2–3 trillion rows of vulnerability data, which translates to around 56 GB of compressed data for a large customer. This data is normalized and processed by InsightVM’s vulnerability management engine and streamed to the data warehouse system for near-real-time availability of data for analytical insights to customers.
In this section, we discuss the overall architectural setup for the InsightVM system.
Scan engines and agents collect and send asset information to the InsightVM cloud. Asset data is pooled, normalized, and processed to identify vulnerabilities. This is stored in an Amazon ElastiCache for Redis cluster and also pushed to Amazon Kinesis Data Firehouse for use in near-real time by InsightVM’s analytics dashboards. Kinesis Data Firehose delivers raw asset data to an Amazon Simple Storage Service (Amazon S3) bucket. The data is transformed using a custom developed ingestor service and stored in a new S3 bucket. The transformed data is then loaded into the Redshift data warehouse. Amazon Simple Notification Service (Amazon SNS), Amazon Simple Queue Service (Amazon SQS), and AWS Lambda are used to orchestrate this data flow. In addition, to identify the latest timestamp of vulnerability data for assets, an auxiliary table is maintained and updated periodically with the update logic in the Lambda function, which is triggered through an Amazon CloudWatch event rule. Custom-built middleware components interface between the web user interface (UI) and the Amazon Redshift cluster to fetch asset information for display in dashboards.
The following diagram shows the implementation architecture of InsightVM, including the data warehouse system:
The architecture has built-in tenant isolation because data access is abstracted through the API. The application uses a dimensional model to support low-latency queries and extensibility for future enhancements.
Amazon Redshift data warehouse design: Options evaluated and selection
Considering Rapid7’s need for near-real-time analytics at any scale, the InsightVM data warehouse system is designed to meet the following requirements:
- Ability to view asset vulnerability data at near-real time, within 5–10 minutes of ingest
- Less than 5 seconds’ latency when measured at 95 percentiles (p95) for reporting queries
- Ability to support 15 concurrent queries per second, with the option to support more in the future
- Simple and easy-to-manage data warehouse infrastructure
- Data isolation for each customer or tenant
Rapid7 evaluated Amazon Redshift RA3 instances to support these requirements. When designing the Amazon Redshift schema to support these goals, they evaluated the following strategies:
- Bridge model – Storage and access to data for each tenant is controlled at the individual schema level in the same database. In this approach, multiple schemas are set up, where each schema is associated with a tenant, with the same exact structure of the dimensional model.
- Pool model – Data is stored in a single database schema for all tenants, and a new column (
tenant_id) is used to scope and control access to individual tenant data. Access to the multi-tenant data is controlled using API-level access to the tables. Tenants aren’t aware of the underlying implementation of the analytical system and can’t query them directly.
For more information about multi-tenant models, see Implementing multi-tenant patterns in Amazon Redshift using data sharing.
Initially when evaluating the bridge model, it provided an advantage for tenant-only data for queries, plus the ability to decouple a tenant to an independent cluster if they outgrow the resources that are available in the single cluster. Also, when the p95 metrics were evaluated in this setup, the query response times were less than 5 seconds, because each tenant data is isolated into smaller tables. However, the major concern with this approach was with the near-real-time data ingestion into over 50,000 tables (5,000 customer schemas x approximately 10 tables per schema) every 5 minutes. Having thousands of commits every minute into an online analytical processing (OLAP) system like Amazon Redshift can lead to most resources being exhausted in the ingestion process. As a result, the application suffers query latencies as data grows.
The pool model provides a simpler setup, but the concern was with query latencies when multiple tenants access the application from the same tables. Rapid7 hoped that these concerns would be addressed by using Amazon Redshift’s support for massively parallel processing (MPP) to enable fast execution of most complex queries operating on large amounts of data. With the right table design using the right sort and distribution keys, it’s possible to optimize the setup. Furthermore, with automatic table optimization, the Amazon Redshift cluster can automatically make these determinations without any manual input.
Rapid7 evaluated both the pool and bridge model designs, and decided to implement the pool model. This model provides simplified data ingestion and can support query latencies of under 5 seconds at p95 with the right table design. The following table summarizes the results of p95 tests conducted with the pool model setup.
|Large customer: Query with multiple joins, which list assets, their vulnerabilities, and all their related attributes, with aggregated metrics for each asset, and filters to scope assets by attributes like location, names, and addresses||Less than 4 seconds|
|Large customer: Query to return vulnerability content information given a list of vulnerability identifiers||Less than 4 seconds|
Tenet isolation and security
Tenant isolation is fundamental to the design and development of SaaS systems. It enables SaaS providers to reassure customers that, even in a multi-tenant environment, their resources can’t be accessed by other tenants.
With the Amazon Redshift table design using the pool model, Rapid7 built a separate data access layer in the middleware that templatized queries, augmented with runtime parameter substitution to uniquely filter specific tenant and organization data.
The following is a sample of templatized query:
The following is a Java interface snippet to populate the template:
Every query uses
organization_id and additional parameters to uniquely access tenant data. During runtime,
organization_id and other metadata are extracted from the secured JWT token that is passed to middleware components after the user is authenticated in the Rapid7 cloud platform.
Best practices and lessons learned
To fully realize the benefits of the Amazon Redshift architecture and design for the multiple tenants & near real-time ingestion, considerations on the table design allow you to take full advantage of the massively parallel processing and columnar data storage. In this section, we discuss the best practices and lessons learned from building this solution.
Sort key for effective data pruning
Sorting a table on an appropriate sort key can accelerate query performance, especially queries with range-restricted predicates, by requiring fewer table blocks to be read from disk. To have Amazon Redshift choose the appropriate sort order, the AUTO option was utilized. Automatic table optimization continuously observes how queries interact with tables and discovers the right sort key for the table. To effectively prune the data by the tenant,
organization_id is identified as the sort key to perform the restricted scans. Furthermore, because all queries are routed through the data access layer,
organization_id is automatically added in the predicate conditions to ensure effective use of the sort keys.
Micro-batches for data ingestion
Amazon Redshift is designed for large data ingestion, rather than transaction processing. The cost of commits is relatively high, and excessive use of commits can result in queries waiting for access to the commit queue. Data is micro-batched during ingestion as it arrives for multiple organizations. This results in fewer transactions and commits when ingesting the data.
Load data in bulk
If you use multiple concurrent COPY commands to load one table from multiple files, Amazon Redshift is forced to perform a serialized load, and this type of load is much slower.
The Amazon Redshift manifest file is used to ingest the datasets that span multiple files in a single COPY command, which allows fast ingestion of data in each micro-batch.
RA3 instances for data sharing
Rapid 7 uses Amazon Redshift RA3 instances, which enable data sharing to allow you to securely and easily share live data across Amazon Redshift clusters for reads. In this multi-tenant architecture when a tenant outgrows the average data access needs, it can be isolated to a separate cluster easily and independently scaled using the data sharing. This is accomplished by monitoring the STL_SCAN table to identify different tenants and isolate them to allow for independent scalability as needed.
Concurrency scaling for consistently fast query performance
When concurrency scaling is enabled, Amazon Redshift automatically adds additional cluster capacity when you need it to process an increase in concurrent read queries. To meet the uptick in user requests, the concurrency scaling feature is enabled to dynamically bring up additional capacity to provide consistent p95 values that meet Rapid7’s defined requirements for the InsightVM application.
Results and benefits
Rapid7 saw the following results from this architecture:
- The new architecture has reduced the time required to make data accessible to customers to less than 5 minutes on average. The previous architecture had higher level of processing time variance, and could sometimes exceed 45 minutes
- Dashboards load faster and have enhanced drill-down functionality, improving the end-user experience
- With all data in a single warehouse, InsightVM has a single source of truth, compared to the previous solution where InsightVM had copies of data maintained in different databases and domains, which could occasionally get out of sync
- The new architecture lowers InsightVM’s reporting infrastructure cost by almost three times, as compared to the previous architecture
With Amazon Redshift, the Rapid7 team has been able to centralize asset and vulnerability information for InsightVM customers. The team has simultaneously met its performance and management objectives with the use of a multi-tenant pool model and optimized table design. In addition, data ingestion via Kinesis Data Firehose and custom-built microservices to load data into Amazon Redshift in near-real time enabled Rapid7 to deliver asset vulnerability information to customers more than nine times faster than before, improving the InsightVM customer experience.
About the Authors
Rahul Monga is a Principal Software Engineer at Rapid7, currently working on the next iteration of InsightVM. Rahul’s focus areas are highly distributed cloud architectures and big data processing. Originally from the Washington DC area, Rahul now resides in Austin, TX with his wife, daughter, and adopted pup.
Sujatha Kuppuraju is a Senior Solutions Architect at Amazon Web Services (AWS). She works with ISV customers to help design secured, scalable and well-architected solutions on the AWS Cloud. She is passionate about solving complex business problems with the ever-growing capabilities of technology.
Thiyagarajan Arumugam is a Principal Solutions Architect at Amazon Web Services and designs customer architectures to process data at scale. Prior to AWS, he built data warehouse solutions at Amazon.com. In his free time, he enjoys all outdoor sports and practices the Indian classical drum mridangam.