How Visual Layer builds high quality datasets on Amazon S3

Companies from different industries use data to help their Artificial Intelligence (AI) and Machine Learning (ML) systems make intelligent decisions. For ML systems to work well, it is crucial to make sure that the massive datasets used for training ML models are of the highest quality, minimizing noise that can contribute to less-than-optimal performance. Processing internet-scale datasets to optimize them for ML workloads requires a data storage solution with tremendous scalability and high throughput. With massive volumes of data, considering costs associated with data storage and processing is essential to ensuring an efficient and scalable solution.

Visual Layer, a Tel-Aviv based startup, helps its customers gain valuable insights on their data while identifying and resolving data quality issues. For Visual Layer to provide its customers with clean, high-quality datasets, they need their data storage solution to be cost-efficient and tremendously scalable to store massive amounts of data for processing. They also need high throughout capabilities to accelerate processing and further optimize costs. Building and integrating with Amazon S3, an object storage service that offers industry-leading scalability, data availability, security, and performance, Visual Layer can leverage virtually unlimited scalability and high throughput in its solution built to process massive volumes of data and provide its customers with clean, high-quality datasets.

In this post, we discuss how Visual Layer uses Amazon S3 and several other AWS services like Amazon EC2, Amazon EKS, and AWS Step Functions to process internet-scale datasets. Building on Amazon S3, Visual Layer gets high-quality datasets to its customers quickly and cost-effectively at scale. The scalability and throughput of Amazon S3 that Visual Layer leverages on the backend along with seamless integration with other AWS services results in 50% lower costs for their customers, and the high-quality datasets produced help Visual Layer customers speed up development of their computer vision pipelines by up to 5x.

ML powered solution for efficient ML training

Effective ML needs clean, accessible data. To solve for this, Visual layer built fastdup. Fastdup is a tool that uses their purpose-built graph engine to automatically detect issues such as corrupted images, duplicates, wrong labels, and outliers in visual datasets. The solution uses unsupervised ML, and works by indexing visual data into short feature vectors. Then, it constructs a nearest neighbor model to find similar pairs of images in the dataset. Next, it uses graph analytics to gain insights from those relationships. It does this through community detection algorithms that group similar images together and construct a graph structure from them. Lastly, it identifies connected clusters of images, effectively organizing them based on similarity. It suggests correction steps and removes issues like duplicates and outliers from the dataset, which results in a cleaned unstructured dataset. Ultimately the cleaned dataset leads to more efficient training, robust models, and lower computational costs.

Visual Layer released a web platform VL Profiler that uses the same technology as fastdup and is built on AWS. Users import and process visual datasets from Amazon S3, as shown in Figure 1 and Figure 2.

Figure 1 - Example dataset with 32M images (8.5 TB on S3) imported to Visual Layer’s platform.

Figure 1: Example dataset with 32M images (8.5 TB on S3) imported to Visual Layer’s platform.

As the persistent data storage solution for ML training data, Amazon S3 interacts with Visual Layer’s platform to process the unstructured image and video data before model training.

Figure 2 - Visual Layer improved the quality of a computer vision dataset by 40% after removing 12M duplicates, 8.3M outliers, and 6M mislabeled images.

Figure 2: Visual Layer improved the quality of a computer vision dataset by 40% after removing 12M duplicates, 8.3M outliers, and 6M mislabeled images.

Amazon S3 is critical for both Visual Layer and their customers who have hundreds of terabytes of visual data coming in every day. With Amazon S3 as the staging point for the unstructured visual datasets, Visual Layer can process unlimited storage and is able to cost-effectively scale to support increased workloads without delays. Visual Layer released a collection of high-quality computer vision datasets that they hosted and processed using Amazon S3.

With the curated training data already on Amazon S3, you can use a range of ML frameworks to seamlessly interact with this data. Amazon S3 makes it easy for users to work with powerful GPUs, such as the EC2 P3, P4, and P5 instances that are optimized for ML training, and also with Amazon SageMaker for users who want a managed solution to build end-to-end ML pipelines on AWS. Lightricks, a company that develops video and image editing mobile apps, is using Visual Layers solutions along with Amazon S3 for its generative AI use cases:

“Visual Layer’s fastdup on top of Amazon S3 enables significant improvements in the quality of internet-scale datasets that are needed to train our Generative AI foundation models”

– Yoav HaCohen, PhD, manager of the Core Generative AI Team at Lightricks

Building and scaling with AWS

Visual Layer built their web platform, VL Profiler, on AWS. Their tech stack includes Amazon Elastic Compute Cloud (EC2) for compute, Amazon Elastic Kubernetes Service (EKS) for orchestrating the containerized workload at scale, and Amazon S3 to store the visual training data. Visual Layer processes the image and video data from Amazon S3 using AWS Step Functions, with steps running either as AWS Lambda or Amazon EKS jobs. Visual Layer shares data between various processing modules over Amazon EFS, uses Amazon CloudWatch for observability and monitoring, and also uses Lambda to run periodic production sanity and health tests. Any structured data needed to run their application is stored in Amazon RDS. Visual Layer serves in-product images for its platform from Amazon S3 using Amazon CloudFront, and uses Amazon Route 53 as a DNS service, as shown in Figure 3.

Figure 3 - An overview of Visual Layer’s solution on AWS

Figure 3: An infrastructure overview of Visual Layer’s data-processing solution on AWS

Amazon S3’s simple APIs allow data to be easily accessed and shared across AWS services. This facilitates seamless integration between Amazon S3, AWS Lambda, Amazon EKS and other AWS offerings. The flexibility of AWS has helped Visual Layer quickly experiment and iterate on new solutions and accelerate their innovation cycles. In just a few months, they’ve added visualization, search capabilities, and various dash boarding elements to their platform. This has helped unlock new use-cases for data analysts who can gain interesting insights into the cleaned data.

Scalable, cost-efficient, and high-performance data-preparation for ML training

Over the past year, Visual Layer has processed over 50 billion images. The flexibility of AWS allowed them to scale their storage and compute resources up and down based on demand. With Amazon S3 as the destination for these massive image and video datasets, users pay only for what they use, which means that they are charged according to the actual dataset size.

Recently, Visual Layer processed a dataset with 1 billion images that was stored on Amazon S3 and used for training generative models, while managing their compute costs effectively. They built a complete model using a compute-intensive EC2 instance in 24 hours, surfacing quality issues such as duplications, corruptions, and outliers with a total compute cost of just $200. Additionally, Visual Layer reduced their overall storage costs by 50% when they offloaded less frequently accessed research datasets and application logs from Amazon S3 Standard to the Amazon S3 Glacier Instant Retrieval storage class.

The team has also benefited from more reliable infrastructure performance as a result of deploying with Amazon EKS. Using Amazon EKS, they scaled to run more than 500 virtual CPUs concurrently on tens of EC2 instances across two clusters. With greater flexibility to configure containers, they can more easily use and benefit from other AWS services. They can scale to thousands of transactions per second in request performance when reading and writing data to Amazon S3. Additionally, Auto Scaling Groups helps them achieve elasticity in compute and meet peak compute demand.

Conclusion

High quality and accessible data is essential for effective ML. Amazon S3’s unlimited scale makes it ideal for storing and processing large unstructured datasets. Visual Layer uses AWS to build tools that can analyze tens of millions of images and automatically find and correct issues (such as missing labels, outliers, duplicates, and test/train leaks) within these datasets. With the cleaned dataset already on Amazon S3, you can carry out efficient ML training and create robust algorithms.

To learn more about ML with Amazon S3 and Visual Layer check out the resources linked below.