Containers

Efficient image and model caching strategies for AI/ML and generative AI workloads on Amazon EKS

When organizations deploy generative AI and machine learning (ML) workloads on Amazon Elastic Kubernetes Service (Amazon EKS), implementing efficient caching strategies becomes crucial for both performance and cost optimization. Storage and caching play major roles throughout the lifecycle of any AI, ML, or generative AI workloads on Amazon EKS. This includes strategies for container image caching and storage/caching for AI/ML models during training, and the inferencing process for model checkpointing and tuning. Container image caching reduces the image pull latency, which in turn reduces the time for workloads to start the data processing. Storage options for model caching and checkpointing influences the cost and performance of the AI/ML workloads and container image caching.

Container image caching options includes using data volumes in Bottlerocket AMIs and secondary Amazon Elastic Block Store (Amazon EBS) volumes on Amazon Linux 2023 (AL2023) optimized AMIs on Amazon EKS. Both of these options deliver significant reductions in container image pull time. Bottlerocket provides a smaller resource footprint and shorter boot times when compared to other Linux distributions, which helps to reduce costs by using less storage, compute, and networking resources. To optimize Amazon EBS performance for container workloads, use Amazon EBS-optimized instance types that provide dedicated bandwidth to Amazon EBS, so that gp3 volumes can deliver at least 90% of their provisioned IOPS performance 99% of the time (Amazon EBS documentation). This approach removes the complexity of custom AMI builds while providing predictable storage performance for organizations handling extensive AI/ML container images.

For AI/ML workloads, storage performance must align with compute performance to avoid underused GPU-based compute resources, as shown in the following figure. Throughput and performance bottlenecks lead to longer training times and increased costs. Several factors influence data loading performance, such as dataset size and file count, individual file sizes and types, and data ingestion and access patterns. Latency is particularly important to consider, because small objects and files can have a huge impact on performance. When implementing distributed training workloads, you must consider data access patterns and storage solutions. Amazon FSx for Lustre or Amazon S3 Express One Zone can be used for low-latency data access, while proper networking configurations can improve performance among training nodes. Cost optimization through the implementation of prompt caching and batch processing can significantly reduce operational costs while maintaining performance standards.

Figure 1: Various flexible storage options for AI/ML and generative AI workloads.

Figure 1: Various flexible storage options for AI/ML and generative AI workloads.

This post looks at various options for container image caching, model training, and inferencing workloads. This post also discusses various storage options such as Amazon Simple Storage Service (Amazon S3), FSx for lustre, S3 Express One Zone, and Amazon S3 Connector for PyTorch. Check our guidance on AI workloads page for detailed Amazon EKS best practices, solving cold start challenges for AI/ML inference, dynamic resource allocation on Amazon EKS, networking for AI, and observability.

The role of storage in AI/ML

Storage plays several critical roles in supporting ML operations and pipelines. For ML workloads, storage systems need to meet three key requirements:

  1. Independence from pod lifecycle is a critical requirement where storage must persist beyond the lifespan of individual pods. This independence means that valuable ML training data, model checkpoints, and configurations remain intact regardless of pod creation or termination events. Teams can scale both compute and storage resources to be scaled dynamically to meet changing needs, reschedule pods across nodes, and maintain data consistency throughout the development lifecycle. This architectural approach guarantees reliable data persistence and removes operational overhead in managing ML assets across dynamic containerized environments.
  2. Availability across all pods and nodes stands as a fundamental necessity for distributed ML operations. The storage system must provide universal data access. Any pod running on any node must be able to seamlessly retrieve and process the necessary data. This ubiquitous storage availability provides consistent data accessibility across the distributed infrastructure and supports efficient parallel processing across the cluster. Teams can use this storage reachability from any compute node to implement the full distributed nature of their ML workloads. It’s particularly crucial for distributed training scenarios, where multiple nodes need simultaneous access to datasets and model parameters.
  3. High availability and durability are essential to maintaining continuous ML operations. Amazon Web Services (AWS) storage services are designed with built-in redundancy to provide consistent data access and protect against data loss, generating the reliability that production ML workloads demand. This requirement becomes particularly critical in production environments where service interruptions can significantly impact business operations. For example, a credit card fraud detection system that relies on real-time model inference to protect customers from fraudulent transactions. AWS storage services deliver the high availability and durability that ML teams need to maintain consistent performance and make sure that their critical applications remain operational.

These requirements collectively form the foundation for reliable ML infrastructure, supported by more considerations such as high-speed I/O operations, seamless scalability for growing datasets, intelligent caching mechanisms, and optimized storage patterns specifically designed for ML workloads. If organizations meet these requirements, then they can build robust, efficient, and scalable ML platforms that can handle demanding workloads while maintaining data durability and availability.

Building upon the key storage requirements for ML workloads, you must understand the quantified performance characteristics of available storage options and their measurable impact on ML pipeline efficiency. Different storage solutions serve distinct purposes within the ML offerings, with each one optimized for specific performance and cost requirements:

  • Amazon S3 provides cost-effective storage with proven scalability, supporting up to 5,500 GET requests per second per prefix and 100-200 ms latency for typical workloads.
  • For applications needing low latency object storage, S3 Express One Zone delivers consistent, single-digit millisecond first byte data access for your latency-sensitive applications. S3 Express One Zone provides Transactions per second (TPS) per prefix, which is up to ten times faster than Amazon S3 Standard.
  • When dealing with high-performance computing (HPC) requirements, FSx for Lustre becomes valuable. FSx for Lustre file systems scale to multiple terabytes per second of throughput and millions of IOPS, with the highest-performance PERSISTENT-1000 deployment type providing up to 1000 MB/s per TiB storage capacity for disk throughput, and 2,600 MBps per TiB of storage capacity for network throughput (for example cached data). The FSx for Lustre Amazon S3 data repository associations enable faster data access for frequently used files, while using lazy loading to populate the persistent file system only when files are first requested. When a file is loaded from Amazon S3 into the FSx for Lustre cache on first access (lazy loading), subsequent reads are served directly from the high-performance file system cache, delivering sub-millisecond latency. This approach is beneficial for iterative ML processes that repeatedly access the same datasets.
  • For shared access patterns, Amazon Elastic File System (Amazon EFS) supports elastic throughput up to 20-60 GiBps read throughput with automatic scaling. This makes it suitable for distributed development environments and smaller model deployments.
  • The choice of storage solution significantly impacts the performance of ML training and inference operations. In Kubernetes environments, such as Amazon EKS, persistent storage makes sure of data availability and consistency across training cycles. This is crucial for maintaining the integrity of long-running ML experiments and generating reproducible results.
  • For model storage in inference scenarios, organizations can choose between multiple options such as Amazon S3, Amazon EFS, Amazon Elastic Container Registry (Amazon ECR), and FSx for Lustre. Each of these options presents different trade-offs between latency, cost, and maintenance requirements. FSx for Lustre can serve the purpose of efficient model storage, particularly when configured with Amazon S3 data repository associations. This Amazon S3-linked FSx for Lustre file system can be used as a high-performance cache layer for models and training data stored in S3 buckets, where data accessed through FSx for Lustre delivers the lowest latency of these storage options. This architecture is especially valuable for inference workloads that need rapid model loading and frequent access to the same model artifacts, because the first access loads the model from Amazon S3 into the FSx for Lustre file system, while subsequent accesses benefit from sub-millisecond latency directly from the high-performance file system. The optimal choice depends on factors such as model size, update frequency, inference latency requirements, and access patterns. FSx for Lustre is particularly well-suited for latency-sensitive inference applications that can benefit from intelligent caching of frequently accessed models while maintaining cost-effective long-term storage in Amazon S3.
  • Storage systems play a vital role in the data preparation and ingestion phases of the ML lifecycle. Choosing the optimal storage type necessitates consideration of factors such as cost, performance, and data structure requirements. For specialized use cases, such as video-based ML pipelines, services such as Amazon Kinesis Video Streams can durably store and time-index video data from multiple sources. This provides a foundation for building custom ML processing applications that can access the stored video data frame-by-frame for real-time or batch-oriented analytics.
  • A critical consideration in ML workloads verifying that storage performance keeps pace with compute capabilities. GPUs, which represent the majority of the cost in ML training, can be challenging to saturate with training data. Underused compute resources caused by storage bottlenecks lead to increased costs and longer training times. Storage services such as Elastic Fabric Adapter (EFA) enabled FSx for Lustre filesystems to support NVIDIA’s GDS (GPU-Direct-Storage) capability. This allowed for increased throughput between the GPU and the filesystem.
  • This perspective shifts the role of storage in ML from traditional enterprise IT considerations to a HPC paradigm. Furthermore, the focus is not solely on minimizing storage costs, but also on optimizing the total cost of compute and storage together. Organization can choose the right storage solutions and meet the high-throughput demands of ML workloads to minimize GPU idle time, reduce overall training costs, and accelerate time to results.

The integration of appropriate storage solutions with ML pipelines is crucial for organizations to efficiently manage and process their data while maintaining scalability and performance. As ML workloads continue to grow in complexity and scale, the importance of storage in the ML infrastructure stack only increases, making it a key consideration in the design and implementation of robust ML systems.

Data loading performance

These are some of the factors that influence data loading performance:

  1. Dataset size and file count play a crucial role in data loading performance for ML workloads. The total volume of your dataset affects not only initial loading times and memory requirements, but also storage costs and data transfer fees. Large datasets, often measured in terabytes, demand different storage strategies when compared to smaller ones. Equally important is the number of files in your dataset. A high file count can significantly impact metadata operations, leading to increased overhead in both storage and retrieval processes. Small files in particular can create disproportionate metadata overhead, potentially slowing down data access. To mitigate these issues, organizations often employ optimization strategies such as file compaction or conversion to more efficient data formats, striking a balance between accessibility and performance.
  2. Individual file sizes and types have a substantial impact on data loading efficiency. File size considerations are critical. Files that are too small create metadata overhead and increase I/O operations, while excessively large files can negatively affect random access performance and memory usage. The optimal file size depends on your specific use case and storage system capabilities. Moreover, file format is another key factor. Binary formats may offer performance benefits over raw formats, while the choice between compressed and uncompressed data involves trade-offs between storage space and processing overhead. Columnar formats can provide significant performance improvements for certain types of queries when compared to row-based formats. The impact of serialization and deserialization overhead should also be considered when choosing file types, because it can affect overall data loading speed.
  3. Data ingestion patterns significantly influence the performance of ML workloads. In batch processing scenarios, the size of batches affects both memory usage and training speed. There’s a delicate balance to strike between processing efficiency and memory consumption. Larger batches can improve GPU usage and training throughput, but they may need more memory. On the other hand, streaming requirements introduce different considerations. Real-time processing needs differ from batch processing, needing careful configuration of buffer sizes and prefetch settings. Network bandwidth becomes a critical factor in streaming scenarios, potentially limiting the rate at which data can be ingested. Furthermore, understanding and optimizing these ingestion patterns is crucial for maintaining efficient data flow throughout the ML pipeline.
  4. Access patterns are fundamental to optimizing data loading performance. The distinction between sequential and random access is particularly important. Sequential access—which is beneficial for large, continuous reads—aligns well with certain storage systems and can significantly enhance performance for some ML workloads. Random access—while more challenging for some storage solutions—is crucial for specific training algorithms and data augmentation techniques. The ratio of read to write operations also plays a role in performance optimization. Most ML workloads are read-heavy during training, but write patterns during checkpointing and logging must also be considered. Understanding these access patterns allows for informed decisions in storage system selection and configuration, which organizations can use to tailor their infrastructure to the specific needs of their ML workflows.

Storage IO for checkpointing

Storage I/O performance is fundamental to maintaining training efficiency in modern ML workloads, particularly because model sizes have grown exponentially and training increasingly relies on distributed architectures. As per this Storage post, the statistical probability of system failures increases proportionally with cluster size. For example, a 4,000-accelerator cluster experiences component-level failures on a daily or even hourly basis, which means that robust checkpointing strategies are essential for maintaining acceptable training productivity.

To mitigate these risks, organizations implement checkpointing strategies that save the model’s state to persistent storage at regular intervals. For very large models, these checkpoints can reach terabytes in size, containing model parameters, optimizer states, and training metadata. The storage system must provide sufficient write performance to handle these massive checkpoint operations efficiently—any bottleneck in storage I/O can lead to extended checkpoint writing times and costly GPU idle time. This becomes particularly critical in distributed training scenarios where checkpoint frequency needs to balance between protecting against data loss and maintaining training efficiency.

The impact of inadequate checkpoint storage performance extends beyond just data protection—it directly affects training costs and efficiency. When GPUs wait for checkpoint operations to complete, it results in underused compute resources, extended training times, and increased operational costs. Therefore, choosing appropriate storage solutions with sufficient write bandwidth and implementing efficient checkpointing strategies is crucial for maintaining both the reliability and cost-effectiveness of ML training pipelines.

Container image caching options

This section outlines three container image caching options, data volumes for Bottlerocket, secondary EBS volumes on AL2023, and using NVMe with RAID0 for Kubelet and Containerd.

Data volumes for Bottlerocket

Bottlerocket’s unique dual-volume architecture provides an elegant solution for container image caching in ML workloads. The OS volume, dedicated to storing operating system data and boot images, provides system consistency and reliability by booting from identical OS images every time. This immutable approach significantly reduces system vulnerability and maintenance overhead, making it particularly valuable for ML environments where stability is crucial.

The data volume, Bottlerocket’s second volume, serves as a dedicated space for container-related storage needs. This volume efficiently manages container metadata, images, and ephemeral storage, which is particularly beneficial for ML workloads with large container images. This architecture separates container data from the OS, thereby providing persistent caching of frequently used ML containers, and dramatically reducing pull times for common frameworks such as PyTorch, TensorFlow, and their associated dependencies. Organizations running ML workloads on Amazon EKS can experience up to a 100% reduction in container startup times, which significantly improves the efficiency of ML training and inference pipelines. If you would like to learn more about Bottlerocket, then you can explore our Bottlerocket documentation. You can also reference our post, Reduce container startup time on Amazon EKS with Bottlerocket data volume, to reduce container startup time on Amazon EKS with Bottlerocket data volume. Furthermore, you can reference our guidance on solving cold start challenges for AI/ML inference applications on Amazon EKS to learn more on this topic.

Secondary EBS volumes on AL2023

For organizations using AL2023-based Amazon EKS optimized AMIs, implementing secondary EBS volumes for container image caching presents a flexible and powerful solution. This approach allows for customized storage configurations tailored to specific ML workload requirements, such as adjusting volume size and type based on the size and frequency of container image usage. The secondary EBS volume can be optimized for I/O performance, which is crucial for quick access to large ML container images.

The implementation of this caching strategy through Amazon EKS Blueprints for Terraform provides a standardized and reproducible approach to container image caching. Moreover, this method not only improves container startup times, but also reduces network egress costs and improves cluster reliability by minimizing dependencies on external container registries. For ML workloads that need frequent scaling or pod rescheduling, this caching mechanism provides consistent and rapid access to container images, and it maintains optimal performance during peak training or inference periods.

Using NVMe with RAID0 for Kubelet and Containerd

RAID0 provides increased performance by striping data across multiple disks, but it do’sn’t offer redundancy. NVMe SSDs offer significantly lower latency and higher throughput when compared to traditional SSDs or HDDs, making them ideal for performance-critical workloads. Using NVMe with RAID0 can greatly improve their performance and the overall responsiveness of your Kubernetes cluster. To use NVMe storage for kubelet and containerd with a RAID0 configuration in Kubernetes, configure the instance storage as a RAID array and mount it as the backing filesystem for the kubelet and containerd state directories. Karpenter supports the RAID0 configuration with the instanceStorePolicy field, enabling automated setup of high- performance storage while accepting the trade-off of reduced durability for maximum I/O performance.

Storage and caching options

This section outlines storage and caching options.

Amazon S3

Amazon S3 provides 11 9’s of durability and a high-availability object store that users can use to run big data analytics, AI, ML, and HPC applications to unlock data insights.

Benefits of Amazon S3 data lake storage for ML:

  • Fast, scalable batch data streaming to your compute cluster
  • Data ingest from publicly accessible endpoints
  • Cost-optimized to serve data for active training workloads
  • Archival storage options for long-term retention of training data

S3 Express One Zone

S3 Express One Zone is a new high-performance storage class from Amazon S3 that delivers up to ten times faster performance than Amazon S3 Standard through its consistent single-digit millisecond access times for first-byte latency. It introduces a new bucket type, known as a directory bucket, that can scale to support up to 2 million single-digit latency requests per second without per-prefix limits—a dramatic improvement over the Amazon S3 Standard 5,500 GET requests per second per prefix. This massive request capacity is complemented by a new session-based authorization model that is purpose-built for lower latency on every read or write to directory buckets, removing the authentication overhead that can add latency to traditional Amazon S3 operations. These combined capabilities make S3 Express One Zone particularly well-suited for AI/ML workloads needing ultra-high throughput and consistent low-latency access. These include real-time inference serving, high-frequency model parameter updates, and interactive ML development environments where millisecond-level response times are critical for maintaining system performance and user experience.

Optimizing code for Amazon S3 APIs

You can use the throughput-optimized S3 Connector for PyTorch. When you optimize data access between Amazon S3 and your PyTorch training job, you can keep your GPUs performing useful training work, which reduces overall training times and saves you compute cost. Saving ML training model checkpoints is up to 40% faster with the S3 Connector for PyTorch than saving to Amazon Elastic Compute Cloud (Amazon EC2) instance storage. You can use this connector to save your model checkpoints directly into Amazon S3. This is an important path to optimize, because when you typically save checkpoints, all of your training nodes need to stop and write out their state before they can resume training. Checkpointing is also really bursty—you are writing out large amounts of data over short bursts of time, then scaling down to zero, which makes it a really good fit for the elasticity and high throughput performance of Amazon S3.

Increasing per-client throughput

Mountpoint for Amazon S3 allows file-based applications to access S3 objects through familiar file system operations, automatically translating these operations into Amazon S3 API calls. However, Mountpoint for Amazon S3 isn’t an alternative to a full-fledged file system service. It is ideal read-heavy workloads that read large datasets (terabytes to petabytes in size) and necessitate the elasticity and high throughput of Amazon S3. Common use cases include ML training as well as reprocessing and validation in autonomous vehicle data processing. These workloads read large datasets over several compute instances and write sequentially to a file from a single process or thread. Using S3 Express One Zone with Mountpoint for Amazon S3 accelerates file-based applications that make random data access requests by up to six times when compared to Amazon S3 Standard. This speeds up ML training jobs, completing them faster and reducing compute costs. C++ and Go SDKs now support multi-NIC functionality that optimizes P5 instance performance by using multiple network adapters to distribute workload traffic. This capability enables full GPU saturation by removing network bottlenecks through load distribution across available network interfaces. The key advantage is asynchronous checkpointing capability, which allows data transfers to occur in parallel with ongoing large-scale distributed training operations. This feature uses the Elastic Network Adapter (ENA) infrastructure available on P5x instances, providing continuous GPU usage without I/O-related performance degradation.

Reducing latencies for frequently read data

For multi-instance ML training workloads, S3 Express One Zone can be strategically deployed as a shared cache layer, creating significant performance advantages across distributed training scenarios. In this configuration, the first training instance populates the express cache during its initial epoch, while subsequent instances benefit from the pre-warmed cache, achieving up to seven times faster data access when compared to Amazon S3 Standard. This shared caching approach is particularly valuable for distributed training environments where multiple instances access the same training datasets, because each added instance can immediately use the cached data without experiencing cold start penalties. This architecture pattern transforms S3 Express One Zone from just a high-performance storage solution into an intelligent caching layer that accelerates the entire distributed training pipeline. In turn, this reduces both training time and data transfer costs across multi-node ML workloads.

FSx for Lustre

FSx for Lustre is a fully managed file system that is optimized for high-performance workloads such as ML, where it can scale out to up-to a terabyte per-second of throughput, millions of IOPS, at sub-millisecond latencies. FSx for Lustre provides multiple deployment options for cost optimization across Scratch-SSD, Persistent-SSD and Intelligent-tiering storage. FSx for Lustre natively integrates with Amazon S3, meaning that it can process datasets stored within an S3 bucket, through a high-performance FSx file system, which is a common ML pattern. When FSx for Lustre file system is linked to an S3 bucket, it can transparently presents S3 objects as files and automatically update the contents of the linked S3 bucket as files are added to, changed in, or deleted from the file system.

We recommend FSx for Lustre in scenarios where you have an environment with multiple EC2 GPU compute instances that need low latency and high-bandwidth to the same dataset(s), such as model caching or distributed training, and you need native access to models or training data that is already stored in S3 bucket. FSx for Lustre provides native Amazon S3 integration for automatic import and export of data between the FSx file system and a linked S3 bucket, so that the FSx file system can be used as a high-performance model and data cache for your assets stored in an S3 bucket. Caching your models on an FSx for Lustre file system means that you can reduce latencies for frequently read data. You can deploy an FSx for Lustre file system linked to an S3 bucket, or as a standalone instance to host your model data.

Storage deployment options

  • Scratch-SSD storage: Recommended for short-lived workloads that are ephemeral (hours), with fixed throughput capacity per-TiB provisioned—ideal for temporary training jobs and development environments.
  • Persistent-SSD storage: Recommended for long-running workloads and mission-critical applications needing highest availability, such as HPC simulations, big data analytics, or ML training, with configurable storage and throughput capacity per-TiB.

Benefits of FSx for Lustre for ML workloads

  • Amazon EKS integration: Install the FSx for Lustre CSI driver to mount FSx filesystems on Amazon EKS as Persistent Volumes (PV) for seamless Kubernetes integration.
  • Flexible deployment models: Deploy as standalone high-performance cache or as an Amazon S3-linked file system acting as a high-performance cache for Amazon S3 data, providing fast I/O and high throughput across GPU compute instances.
  • GPU optimization with EFA: Persistent-SSD deployments support EFA for ultra-low latency networking, ideal for high-performance, throughput-based GPU workloads.
  • NVIDIA GPUDirect Storage (GDS): FSx for Lustre supports GDS technology, creating a direct data path between file system and GPU memory for faster data access, which removes CPU bottlenecks.
  • Data compression: Enable compression on the file system for compressible file types to increase performance by reducing data transfer between FSx for Lustre file servers and storage.
  • Administrative pod strategy: Configure administrative pods with Lustre client to pre-warm file systems with ML training data or large language models (LLMs). This is especially critical for Spot-based EC2 GPU instances where you can pre-load desired data before launching compute instances.

Advanced data striping configuration

  • Parallel access optimization: Distribute file data across multiple Object Storage Targets (OSTs) within the Lustre file system to maximize parallel access and throughput, especially for large-scale ML training jobs.
  • Progressive File Layouts (PFL): Default 4-component Lustre striping configuration automatically created through PFL capability. In most scenarios, no manual adjustment is needed for optimal performance.
  • Amazon S3-linked file system optimization: Files imported through Amazon S3 integration (DRA) use the ImportedFileChunkSize parameter layout (default 1 GiB) instead of the default PFL. For large files, tune this parameter to higher values for optimal stripe count distribution.
  • Placement strategy: Deploy FSx file systems in the same AWS Availability Zone (AZ) as compute/GPU nodes for lowest latency access, avoiding cross-AZ patterns. For multi-AZ GPU deployments, deploy separate FSx systems in each Availability Zone.

Performance and cost benefits

  • Intuitive data access: Consistent, fast, and scalable random data access across files and within files for researchers and developers.
  • Cost-performance optimization: Different price-performance tiers optimized to serve data for active training workloads with various durability and performance requirements.
  • Scalable architecture: Seamlessly handles workloads from single-node development to large-scale distributed training across hundreds of GPU instances.

Conclusion

This post explored the critical aspects of image and model caching strategies for AI/ML workloads on Amazon EKS, highlighting the importance of efficient storage solutions in the ML infrastructure stack. We’ve examined multiple storage and caching options, each serving distinct purposes in the ML pipeline, from container image caching using Bottlerocket and AL2023 solutions, to high-performance data access using Amazon S3 Express One Zone and Amazon FSx for Lustre. The storage solutions discussed address key challenges in ML workloads: maintaining high-performance data access, managing large-scale checkpointing operations, and providing cost-effective resource usage. Bottlerocket’s dual-volume architecture and secondary Amazon EBS volumes on AL2023 demonstrate significant improvements in container startup times, while services such as S3 Express One Zone and FSx for Lustre provide the high-throughput, low-latency access crucial for ML training workloads.

As ML models continue to grow in size and complexity, the importance of optimized storage and caching strategies become even more critical. Organizations must carefully consider their specific workload requirements when choosing storage solutions, balancing factors such as data access patterns, performance needs, and cost considerations. Organizations can implement the appropriate combination of storage solutions and caching strategies discussed in this post to significantly improve their ML training efficiency, reduce operational costs, and maintain robust, scalable ML infrastructure on Amazon EKS. Looking ahead, we’ll continue to see innovations in storage solutions specifically designed for ML workloads, as the demand for efficient handling of large-scale AI and ML operations continues to grow. Stay tuned for more best practices and architectural patterns for optimizing ML workloads on Amazon EKS.


About the authors

Elamaran (Ela) Shanmugam is a Sr. Container Specialist Solutions Architect with AWS with over 20 years of experience in architecting, building, and operating enterprise systems and infrastructure. Ela helps AWS customers and partners to build products and services using containers technologies to enable their business. Ela is a Container, App Modernization, Observability, and Machine Learning SME and helps AWS partners and customers design and build scalable, secure, and optimized container workloads on AWS. Ela contributes to open source, delivers public speaking engagements, mentors individuals, and publishes engaging technical content such as AWS Whitepapers, AWS Blogs, and internal articles. Ela is based out of Tampa, Florida. Connect with Ela on Twitter @IamElaShan and on GitHub.

Jayaprakash Alawala (JP) is a Principal Container Specialist Solutions Architect with AWS with over 20 years of experience in architecting, building enterprise systems and infrastructure. JP helps many large AWS customers to build well architected solution architectures on AWS. JP’s area of expertise spans across Application Modernization, Containers, AIML, GenAI, Security, Platform Engineering, SRE, DevSecOps, IaaC, Cost Optimization. JP contributes to open source, mentors individuals, and publishes engaging technical content such as AWS Workshops, AWS Blogs etc. JP is based out of Bangalore, India. Connect with him on LinkedIn.

Re Alvarez-Parmar is a Containers Specialist Solutions Architect at AWS. Re advises engineering teams with modernizing and building distributed services in the cloud. Prior to joining AWS, he spent over 15 years as Enterprise and Software Architect. He is based out of Seattle. Connect with him on LinkedIn.