Actuate uses AWS Fargate for ML-based, real-time video monitoring and threat detection
This post was written in collaboration with Scott Underwood, Jacob Weiss, Tatiana Hanazaki, and Mark Berbera from Actuate AI.
The goal at Actuate AI is to leverage technology to make the world a safer place. Our team at Actuate AI aims to do that by using cutting-edge computer vision to reduce the response time of emergency services and law enforcement by providing real-time AI verified threat detection. Our customers include schools, apartment complexes, and building management companies who depend on our AI-based video monitoring to detect threats such as guns, weapons, theft, and assault. They rely on the real-time insights we provide to ensure the safety of the physical premises of their business, and therefore it is critical to maintain a robust, secure, and reliable technology stack.
Actuate’s flagship product is a cloud-based computer vision platform that ingests video feed from remote, internet-accessible cameras to detect security threats in real time. A key component of the application makes connections to these cameras, decodes video streams into JPEG images, and then sends those images to our AI models for instant analysis. With customers located in different parts of the world, it is not only important for us to optimize resources, but we also need to ensure that the quality of service to our end customers is not compromised. The data throughput varies drastically through the course of any given day, and different sites have vastly different data ingestion needs. Balancing the cost of resources with the need to continuously monitor each location is really essential for Actuate to ensure we provide a reliable service while keeping costs under control.
In the first iteration of our product, we kept it as simple as possible by deploying our application on Amazon Elastic Compute Cloud (Amazon EC2) instances provisioned for each customer. These EC2 instances were running 24/7 and the capacity was scaled up to match peak resource requirements for each location/customer because we could not afford to impact the quality of our service. For example, if a customer sent 1 TB of traffic across 1000 cameras at night to a m5.24xlarge—the largest instance in the m5 family with 96 vCPUs, 384GiB Memory, 4 x 900 NVMe SSD Storage, and 25 Gbps of Network Bandwidth—that large EC2 instance would still be up and running the following day—even though cameras rarely stream data during the day. This resulted in significant cost overruns, wastage, and resource exhaustion for processes that see an unexpected surge of demand. For example, when deployed on the same virtual machine (VM), the process responsible to monitor a school during a high traffic hour might consume most or all the memory of a VM, while starving the process responsible for monitoring a parking lot elsewhere. We didn’t want the unpredictability in our resource requirements to impact the quality of service we delivered to our customers. So, we looked for a solution that met the needs of our customers while also allowing us to optimize cost and not pay for cloud resources when not using them.
Overview of solution
As we evaluated our needs, we came up with the following requirements
- One of our primary requirements was to be able to size the resources we consume for each site according to the data throughput we expected out of the site for a particular monitoring cycle. Our customers define a grouping of surveillance cameras as a single site, with a schedule per site to turn on the Actuate threat detection software and analyze their video feeds. Some sites are groupings of hundreds of cameras, while others have just one or two cameras. So, it was critical to be able to size the resources for each site independently.
- The solution must be able to scale both memory and CPU independently, along with the ability to deploy on different architectures (ARM/x86) depending upon the underlying video technology stacks of our clients.
- The solution must give us the opportunity to improve on our continuous integration and deployment pipeline and make development and delivery easier to manage.
- Finally, the solutions should allow for various networking configuration options, as our deployments varied in their network architecture.
These requirements led us to explore Amazon ECS and AWS Fargate.
We chose ECS/Fargate because we really liked the serverless operating model; we wanted to avoid having to manage EC2 instances unless absolutely necessary. After our initial evaluation, we figured out that our application worked well with the Fargate model.
Flexibility and scale
AWS Fargate allows us to containerize subsets of cameras, and along with our internal scheduling logic, launch tasks at specific times during the day. In addition, no application re-architecture was needed, as we were able to use the same codebase used on EC2. Also, the ability to launch containers on clusters behind a NAT Gateway in a private subnet, or on a public subnet with an Internet Gateway helped because our customers had varying network requirements. For instance, some of our customers use Site-to-Site VPN connections, while others have an Elastic IP or specific IP range on access control lists in their firewall. Having the ability to support both configurations created flexibility to work with customers with varying network requirements.
Amazon ECS also uses Amazon Elastic Container Registry (Amazon ECR) for docker images, which integrates seamlessly with GitHub Actions to build out our CI/CD pipeline. We’re able to build x86 and ARM64 python images then log into ECR, upload each image to its respective repository, and send notifications to our internal Slack on the status of these builds. Our customer sites always use the latest build on start-up, thus our pipeline ensures the latest build is in use.
Improved Monitoring and Troubleshooting
AWS Fargate gives us a more stable and cost-efficient way to connect to customer cameras. It manages our cluster resources, removing past system-level issues. We also utilize monitoring resources like AWS Container Insights and can rapidly troubleshoot which sites are overusing resources, and quickly resolve connections with wasteful bandwidth.
There are a few areas for improvement in the context of our specific use case. Some of our customer cameras rarely have network traffic, which means some of our containers (even though we use the smallest size available) are underutilized. We could resolve this issue by managing our own EC2 cluster with ECS, using soft/hard limits to put multiple customers on the same EC2 instance, but we would then lose the serverless infrastructure management that AWS Fargate offers. In the future, we hope that AWS Fargate adds the ability to put multiple tasks on the same machine. Alternatives like AWS Lambda will not work for our use case since we typically connect to cameras via Real Time Stream Protocol, which requires a long-lasting, TCP connection initiated from our VPC to the customer.
We also have customers who only allow access from a specific Elastic IP (EIP) in their firewall. One solution would be to put all tasks for that customer behind a NAT Gateway with the EIP, but NAT Gateways can be costly for large data transfers. The ability to launch multiple tasks with AWS Fargate on the same machine would allow us to allocate the EIP to that one machine in the public subnet, and sidestep the cost of the NAT Gateway.
In the future, we will add an internal service that finds overutilized/underutilized tasks and dynamically reallocates vCPU and memory. Our code base is ARM compatible, and we’ve started converting our containers on ECS from x86 to ARM. Based on initial experiments, we expect greater than a 50% reduction in cost compared to our previous architecture.
We were looking for a more cost-effective, efficient, and reliable way to run our infrastructure to serve our application that provides an ML-based, real-time threat detection service to our customers. With AWS Fargate, we were able to achieve our business requirements and meet the commitments to our customers while reducing our costs and improving operational efficiency.