Category: Amazon EC2

Amazon Elastic File System – Production-Ready in Three Regions

The portfolio of AWS storage products has grown increasingly rich and diverse over time. Amazon S3 started out with a single storage class and has grown to include storage classes for regular, infrequently accessed, and archived objects. Similarly, Amazon Elastic Block Store (EBS) began with a single volume type and now offers a choice of four types of SAN-style block storage, each designed to be a great for a particular set of access patterns and data types.

With object storage and block storage capably addressed by S3 and EBS, we turned our attention to the file system. We announced the Amazon Elastic File System (EFS) last year in order to provide multiple EC2 instances with shared, low-latency access to a fully-managed file system.

I am happy to announce that EFS is now available for production use in the US East (Northern Virginia), US West (Oregon), and EU (Ireland) Regions.

We are launching today after an extended preview period that gave us insights into an extraordinarily wide range of customer use cases. The EFS preview was a great fit for large-scale, throughput-heavy processing workloads, along with many forms of content and web serving. During the preview we received a lot of positive feedback about the performance of EFS for these workloads, along with requests to provide equally good support for workloads that are sensitive to latency and/or make heavy use of file system metadata. We’ve been working to address this feedback and today’s launch is designed to handle a very wide range of use cases. Based on what I have heard so far, our customers are really excited about EFS and plan to put it to use right away.

Why We Built EFS
Many AWS customers have asked us for a way to more easily manage file storage on a scalable basis. Some of these customers run farms of web servers or content management systems that benefit from a common namespace and easy access to a corporate or departmental file hierarchy. Others run HPC and Big Data applications that create, process, and then delete many large files, resulting in storage utilization and throughput demands that vary wildly over time. Our customers also insisted on high availability, and durability, along with a strongly consistent model for access and modification.

Amazon Elastic File System
EFS lets you create POSIX-compliant file systems and attach them to one or more of your EC2 instances via NFS. The file system grows and shrinks as necessary (there’s no fixed upper limit and you can grow to petabyte scale) and you don’t pre-provision storage space or bandwidth. You pay only for the storage that you use.

EFS protects your data by storing copies of your files, directories, links, and metadata in multiple Availability Zones.

In order to provide the performance needed to support large file systems accessed by multiple clients simultaneously, Elastic File System performance scales with storage (I’ll say more about this later).

Each Elastic File System is accessible from a single VPC, and is accessed by way of mount targets that you create within the VPC. You have the option to create a mount target in any desired subnet of your VPC. Access to each mount target is controlled, as usual, via Security Groups.

EFS offers two distinct performance modes. The first mode, General Purpose, is the default. You should use this mode unless you expect to have tens, hundreds, or thousands of EC2 instances access the file system concurrently. The second mode, Max I/O, is optimized for higher levels of aggregate throughput and operations per second, but incurs slightly higher latencies for file operations. In most cases, you should start with general purpose mode and watch the relevant CloudWatch metric (PercentIOLimit). When you begin to push the I/O limit of General Purpose mode, you can create a new file system in Max I/O mode, migrate your files, and enjoy even higher throughput and operations per second.

Elastic File System in Action
It is very easy to create, mount, and access an Elastic File System. I used the AWS Management Console; I could have used the EFS API, the AWS Command Line Interface (CLI), or the AWS Tools for Windows PowerShell as well.

I opened the console and clicked on the Create file system button:

Then I selected one of my VPCs and created a mount target in my public subnet:

My security group (corp-vpc-mount-target) allows my EC2 instance to access the mount point on port 2049. Here’s the inbound rule; the outbound one is the same:

I added Name and Owner tags, and opted for the General Purpose performance mode:

Then I confirmed the information and clicked on Create File System:

My file system was ready right away (the mount targets took another minute or so):

I clicked on EC2 mount instructions to learn how to mount my file system on an EC2 instance:

I mounted my file system as /efs, and there it was:

I copied a bunch of files over, and spent some time watching the NFS stats:

The console reports on the amount of space consumed by my file systems (this information is collected every hour and is displayed 2-3 hours after it is collected):

CloudWatch Metrics
Each file system delivers the following metrics to CloudWatch:

  • BurstCreditBalance – The amount of data that can be transferred at the burst level of throughput.
  • ClientConnections – The number of clients that are connected to the file system.
  • DataReadIOBytes – The number of bytes read from the file system.
  • DataWriteIOBytes -The number of bytes written to the file system.
  • MetadataIOBytes – The number of bytes of metadata read and written.
  • TotalIOBytes -The sum of the preceding three metrics.
  • PermittedThroughput -The maximum allowed throughput, based on file system size.
  • PercentIOLimit – The percentage of the available I/O utilized in General Purpose mode.

You can see the metrics in the CloudWatch Console:

EFS Bursting, Workloads, and Performance
The throughput available to each of your EFS file systems will grow as the file system grows. Because file-based workloads are generally spiky, with demands for high levels of throughput for short amounts of time and low levels the rest of the time, EFS is designed to burst to high throughput levels on an as-needed basis.

All file systems can burst to 100 MB per second of throughput. Those over 1 TB can burst to an additional 100 MB per second for each TB stored. For example, a 2 TB file system can burst to 200 MB per second and a 10 TB file system can burst to 1,000 MB per second of throughput. File systems larger than 1 TB can always burst for 50% of the time if they are inactive for the other 50%.

EFS uses a credit system to determine when a file system can burst. Each one accumulates credits at a baseline rate (50 MB per TB of storage) that is determined by the size of the file system, and spends them whenever it reads or writes data. The accumulated credits give the file system the ability to drive throughput beyond the baseline rate.

Here are some examples to give you a better idea of what this means in practice:

  • A 100 GB file system can burst up to 100 MB per second for up to 72 minutes each day, or drive up to 5 MB per second continuously.
  • A 10 TB file system can burst up to 1 GB per second for 12 hours each day, or drive 500 MB per second continuously.

To learn more about how the credit system works, read about File System Performance in the EFS documentation.

In order to gain a better understanding of this feature, I spent a couple of days copying and concatenating files, ultimately ending up using well over 2 TB of space on my file system. I watched the PermittedThroughput metric grow in concert with my usage as soon as my file collection exceeed 1 TB. Here’s what I saw:

As is the case with any file system, the throughput you’ll see is dependent on the characteristics of your workload. The average I/O size, the number of simultaneous connections to EFS, the file access pattern (random or sequential), the request model (synchronous or asynchronous), the NFS client configuration, and the performance characteristics of the EC2 instances running the NFS clients each have an effect (positive or negative). Briefly:

  • Average I/O Size – The work associated with managing the metadata associated with small files via the NFS protocol, coupled with the work that EFS does to make your data highly durable and highly available, combine to create some per-operation overhead. In general, overall throughput will increase in concert with the average I/O size since the per-operation overhead is amortized over a larger amount of data. Also, reads will generally be faster than writes.
  • Simultaneous Connections – Each EFS file system can accommodate connections from thousands of clients. Environments that can drive highly parallel behavior (from multiple EC2 instances) will benefit from the ability that EFS has to support a multitude of concurrent operations.
  • Request Model – If you enable asynchronous writes to the file system by including the async option at mount time, pending writes will be buffered on the instance and then written to EFS asynchronously. Accessing a file system that has been mounted with the sync option or opening files using an option that bypasses the cache (e.g. O_DIRECT) will, in turn, issue synchronous requests to EFS.
  • NFS Client Configuration – Some NFS clients use laughably small (by today’s standards) values for the read and write buffers by default. Consider increasing it to 1 MiB (again, this is an option to the mount command). You can use an NFS 4.0 or 4.1 client with EFS; the latter will provide better performance.
  • EC2 Instances – Applications that perform large amounts of I/O sometimes require a large amount of memory and/or compute power as well. Be sure that you have plenty of both; choose an appropriate instance size and type. If you are performing asynchronous reads and writes, the kernel use additional memory for caching. As a side note, the performance characteristics of EFS file systems are not dependent on the use of EBS-optimized instances.

Benchmarking of file systems is a blend of art and science. Make sure that you use mature, reputable tools, run them more than once, and make sure that you examine your results in light of the considerations listed above. You can also find some detailed data regarding expected performance on the Amazon Elastic File System page.

Available Now
EFS is available now in the US East (Northern Virginia), US West (Oregon), and EU (Ireland) Regions and you can start using it today.  Pricing is based on the amount of data that you store, sampled several times per day and charged by the Gigabyte-month, pro-rated as usual, starting at $0.30 per GB per month in the US East (Northern Virginia) Region. There are no minimum fees and no setup costs (see the EFS Pricing page for more information). If you are eligible for the AWS Free Tier, you can use up to 5 GB of EFS storage per month at no charge.


Elastic Network Adapter – High Performance Network Interface for Amazon EC2

Many AWS customers create high-performance systems that run across multiple EC2 instances and make good use of all available network bandwidth. Over the years, we have been working to make EC2 an ever-better host for this use case. For example, the CC1 instances were the first to feature 10 Gbps networking. Later, the Enhanced Networking feature reduced latency, increased the packet rate, and reduced variability. Through the years, the goal has remained constant—ensure that the network is not the bottleneck for the vast majority of customer workloads.

We’ve been talking to lots of customers and watching technology trends closely. Along with a short-term goal of enabling higher throughput by providing more bandwidth, we established some longer-term goals for the next generation of EC2 networking. We wanted to be able to take advantage of the increased concurrency (more vCPUs) found in today’s processors, and we wanted to lay a solid foundation for driver support in order to allow our customers to take advantage of new developments as easily as possible.

I’m happy to be able to report that we are making great progress toward these goals with today’s launch of the new Elastic Network Adapter (ENA) to provide even better support for high performance networking. Available now for the new X1 instance type, ENA provides up to 20 Gbps of consistent, low-latency performance when used within a Placement Group, at no extra charge!

Per our longer-term goals, ENA will scale as network bandwidth grows and the vCPU count increases; this will allow you to take advantage of higher bandwidth options in the future without the need to install newer drivers or to make other changes to your configuration, as was required by earlier network interfaces.

ENA Advantages
We designed ENA to work well in conjunction with modern processors, such as those found on X1 instances. Because these processors feature a large number of virtual CPUs (128 in the case of  X1), efficient use of shared resources such as the network adapter is important. While delivering high throughput and great packet per second (PPS) performance, ENA minimizes the load on the host processor in a number of ways, and also does a better job of distributing the packet processing workload across multiple vCPUs. Here are some of the features that enable this improved performance:

  • Checksum Generation – ENA handles IPv4 header checksum generation and TCP / UDP partial checksum generation in hardware.
  • Multi-Queue Device Interface – ENA makes uses of multiple transmit and receive queues to reduce internal overhead and to increase scalability. The presence of multiple queues simplifies and accelerates the process of mapping incoming and outgoing packets to a particular vCPU.
  • Receive-Side Steering – ENA is able to direct incoming packets to the proper vCPU for processing. This reduces bottlenecks and increases cache efficacy.

All of these features are designed to keep as much of the workload off of the processor as possible and to create a short, efficient path between the network packets and the vCPU that is generating or processing them.

Using ENA
In order to make use of ENA,  you need to use our new driver and tag the AMI as having ENA support.

The new driver is available in the latest Amazon Linux AMIs and will soon be available in the Windows AMIs. The Open Source Linux driver is available in source form on GitHub for use in your own AMIs. Also, a driver for the Intel® Data Plane Developer Kit (Intel® DPDK) is available for developers that are building network processing applications such as load balancers or virtual routers.

If you are creating your own AMI, you also need to set the enaSupport attribute when you register it. Here’s how you do that from the command line (see the register-image documentation for a full list of parameters):

$ aws ec2 register-image --ena-support ...

You can still use the AMI on instances that do not support ENA.

Going Forward
As noted earlier, ENA is available today for X1 instances. We are also planning to make it available for future EC2 instance types.


EC2 Instance Console Screenshot

When our users move existing machine images to the cloud for use on Amazon EC2, they occasionally encounter issues with drivers, boot parameters, system configuration settings, and in-progress software updates. These issues can cause the instance to become unreachable via RDP (for Windows) or SSH (for Linux) and can be difficult to diagnose. On a traditional system, the physical console often contains log messages or other clues that can be used to identify and understand what’s going on.

In order to provide you with additional visibility into the state of your instances, we now offer the ability to generate and capture screenshots of the instance console. You can generate screenshots while the instance is running or after it has crashed.

Here’s how you generate a screenshot from the console (the instance must be using HVM virtualization):

And here’s the result:

It can also be used for Windows instances:

You can also create screenshots using the CLI (aws ec2 get-console-screenshot) or the EC2 API (GetConsoleScreenshot).

Available Now
This feature is available today in the US East (Northern Virginia), US West (Oregon), US West (Northern California), EU (Ireland), EU (Frankfurt), Asia Pacific (Tokyo), Asia Pacific (Seoul), Asia Pacific (Singapore), Asia Pacific (Sydney), and South America (São Paulo) Regions. There are no costs associated with it.



X1 Instances for EC2 – Ready for Your Memory-Intensive Workloads

Many AWS customers are running memory-intensive big data, caching, and analytics workloads and have been asking us for EC2 instances with ever-increasing amounts of memory.

Last fall, I first told you about our plans for the new X1 instance type. Today, we are announcing availability of this instance type with the launch of the x1.32xlarge instance size. This instance has the following specifications:

  • Processor: 4 x Intel™ Xeon E7 8880 v3 (Haswell) running at 2.3 GHz – 64 cores / 128 vCPUs.
  • Memory: 1,952 GiB with Single Device Data Correction (SDDC+1).
  • Instance Storage: 2 x 1,920 GB SSD.
  • Network Bandwidth: 10 Gbps.
  • Dedicated EBS Bandwidth: 10 Gbps (EBS Optimized by default at no additional cost).

The Xeon E7 processor supports Turbo Boost 2.0 (up to 3.1 GHz), AVX 2.0AES-NI, and the very interesting (to me, anyway) TSX-NI instructions. AVX 2.0 (Advanced Vector Extensions) can improve performance on HPC, database, and video processing workloads; AES-NI improves the speed of applications that make use of AES encryption. The new TSX-NI instructions support something cool called transactional memory. The instructions allow highly concurrent, multithreaded applications to make very efficient use of shared memory by reducing the amount of low-level locking and unlocking that would otherwise be needed around each memory access.

If you are ready to start using the X1 instances in the US East (Northern Virginia), US West (Oregon), EU (Ireland), EU (Frankfurt), Asia Pacific (Tokyo), Asia Pacific (Singapore), or Asia Pacific (Sydney) Regions, please request access and we’ll get you going as soon as possible. We have plans to make the X1 instances available in other Regions and in other sizes before too long.

3-year Partial Upfront Reserved Instance Pricing starts at $3.970 per hour in the US East (Northern Virginia) Region; see the EC2 Pricing page for more information. You can purchase Reserved Instances and Dedicated Host Reservations today; Spot bidding is on the near-term roadmap.

Here are some screen shots of an x1.32xlarge in action. lscpu shows that there are 128 vCPUs spread across 4 sockets:

On bootup, the kernel reports on the total accessible memory:

The top command shows a huge number of running processes and lots of memory:

Ready for Enterprise-Scale SAP Workloads
The X1 instances have been certified by SAP for production workloads. They meet the performance bar for SAP OLAP and OLTP workloads backed by SAP HANA.

You can migrate your on-premises deployments to AWS and you can also start fresh. Either way, you can run S/4HANA, SAP’s next-generation Business Suite, as well as earlier versions.

Many AWS customers are currently running HANA in scale-out fashion across multiple R3 instances. Many of these workloads can now be run on a single X1 instance. This configuration will be simpler to set up and less expensive to run. As I mention below, our updated SAP HANA Quick Start will provide you with more information on your configuration options.

Here’s what SAP HANA Studio looks like when run on an X1 instance:

You have several interesting options when it comes to disaster recovery (DR) and high availability (HA) when you run your SAP HANA workloads on an X1 instance. For example:

  • Auto Recovery – Depending on your RPO (Recovery Point Objective) and RTO (Recovery Time Objective), you may be able to use a single instance in concert with EC2 Auto Recovery.
  • Hot Standby – You can run X1 instances in 2 Availability Zones and use HANA System Replication to keep the spare instance in sync.
  • Warm Standby / Manual Failover – You can run a primary X1 instance and a smaller secondary instance configured to persist only to permanent storage.  In the event that a failover is necessary, you stop the secondary instance, modify the instance type to X1, and reboot. This unique, AWS-powered option will give you quick recovery while keeping costs low.

We have updated our HANA Quick Start as part of today’s launch. You can get SAP HANA running in a new or existing VPC within an hour using a well-tested configuration:

The Quick Start will help you to configure the instance and the associated storage, install the requisite operating system packages, and to install SAP HANA.

We have also released a SAP HANA Migration Guide. It will help you to migrate your existing on-premises or AWS-based SAP HANA workloads to AWS.


EC2 Run Command Update – Manage & Share Commands and More

The EC2 Run Command allows you to manage your EC2 instances in a convenient, scalable fashion (see my blog post, New EC2 Run Command – Remote Instance Management at Scale, for more information).

Today we are making several enhancements to this feature:

Document Management and Sharing -You can now create custom command documents and share them with other AWS accounts or publicly with all AWS users.

Additional Predefined Commands – You can use some new predefined commands to simplify your administration of Windows instances.

Open Sourced Agent – The Linux version of the on-instance agent is now available in open source form on GitHub.

Document Management and Sharing
You can now manage and share the command documents that you execute via Run Command. This will allow you to add additional rigor to your administrative procedures by reducing variability and removing a source of errors. You can also take advantage of command documents that are created and shared by other AWS users.

This feature was designed to support several scenarios that our customers have shared with us. Some customers wanted to create documents in one account and then share them with other accounts that are part of the same organization.  Others wanted to package up common tasks and share them with the broader community. AWS partners wanted to share documents that would encapsulate common setup and administrative tasks specific to the offerings.

Here’s how you can see your documents, public documents, and documents that have been shared with you:

You can click on a document to learn more about what it does:

And to find out what parameters it accepts:

You can also examine the document before you run it (this is a highly recommended best practice, especially for documents that have been shared with you):

You can create a new command (I used a simplified version of the built-in AWS-RunShellScript command):

Finally, you can  share a document that you have uploaded and tested. You can share it publicly or with specific AWS accounts:

Read about Creating Your Own Command to learn more about this feature.

Additional Predefined Commands
Many AWS customers use Run Command to maintain and administer EC2 instances that are running Microsoft Windows. We have added four new commands designed to simplify and streamline some common operations:

AWS-ListWindowsInventory – Collect on-instance inventory information (operating system, installed applications, and installed updates). Results can be directed to an S3 bucket.

AWS-FindWindowsUpdates – List missing Windows updates.

AWS-InstallMissingWindowsUpdates – Install missing Windows updates.

AWS-InstallSpecificWindowsUpdates – Install a specific set of Windows updates, identified by Knowledge Base (KB) IDs.

Open Sourced Agent
The Linux version of the on-instance Simple Systems Manager (SSM) agent is now available on GitHub at .

You are welcome to submit pull requests for this code (see for more info).

Available Now
The features described above are available now and you can start using them today in the US East (Northern Virginia), US West (Oregon), EU (Ireland), US West (Northern California), EU (Frankfurt), Asia Pacific (Tokyo), Asia Pacific (Singapore), Asia Pacific (Sydney), and South America (São Paulo) Regions.

To learn more, read Managing Amazon EC2 Instances Remotely (Windows)  and Managing Amazon EC2 Instances Remotely (Linux).


They’re Here – Longer EBS and Storage Gateway Resource IDs Now Available

Last November I let you know that were were planning to increase the length of the resource IDs for EC2 instances, reservations, EBS volumes, and snapshots in 2016. Early this year I showed you how to opt in to the new format for EC2 instances and EC2 reservations.

Effective today you can now opt in to the new format for volumes and snapshots for EBS and Storage Gateway.

As I said earlier:

If you build libraries, tools, or applications that make direct calls to the AWS API, now is the time to opt in and to start your testing process! If you store the IDs in memory or in a database, take a close look at fixed-length fields, data structures, schema elements, string operations, and regular expressions. Resources that were created before you opt in will retain their existing short identifiers; be sure that your revised code can still handle them!

You can opt in to the new format using the AWS Management Console, the AWS Command Line Interface (CLI), the AWS Tools for Windows PowerShell, or by calling the ModifyIdFormat API function.

Opting In – Console
To opt in via the Console, simply log in, choose EC2, and click on Resource ID length management:

Then click on Use Longer IDs for the desired resource types:

Note that volume applies to EBS volumes and to Storage Gateway volumes and that snapshot applies to EBS snapshots (both direct and through Storage Gateway).

For information on using the AWS Command Line Interface (CLI) or the AWS Tools for Windows PowerShell, take a look at They’re Here – Longer EC2 Resource IDs Now Available.

Things to Know
Here are a couple of things to keep in mind as you transition to the new resource IDs:

  1. Some of the older versions of the AWS SDKs and CLIs are not compatible with the new format. Visit the Longer EC2 and EBS Resource IDs FAQ for more information on compatibility.
  2. New AWS Regions get longer instance, reservation, volume, and snapshot IDs by default. You can opt out for Regions that launch between now and December 2016.
  3. Starting on April 28, 2016, new accounts in all commercial regions except China (Beijing) and AWS GovCloud (US) will get longer instance and reservation IDs by default, again with the ability to opt out.




Experiment that Discovered the Higgs Boson Uses AWS to Probe Nature

My colleague Sanjay Padhi is part of the AWS Scientific Computing team. He wrote the guest post below to share the story of how AWS provided computational resources that aided in an important scientific discovery.


The Higgs boson (sometimes referred to as the God Particle), responsible for providing insight into the origin of mass, was discovered in 2012 by the world’s largest experiments, ATLAS and CMS, at the Large Hadron Collider (LHC) at CERN in Geneva, Switzerland. The theorists behind this discovery were awarded the 2013 Nobel Prize in Physics.

Deep underground on the border between France and Switzerland, the LHC is the world’s largest (17 miles in circumference) and highest-energy particle accelerator. It explores nature on smaller scales than any human invention has ever explored before.

From Experiment to Raw Data
The high energy particle collisions turn mass in to energy, which then turns back in to mass, creating new particles that are observed in the CMS detector. This detector is 69 feet long, 49 feet wide and 49 feet high, and sits in a cavern 328 feet underground near the village of Cessy in France. The raw data from the CMS is recorded every 25 nanoseconds at a rate of approximately 1 petabyte per second.

After online and offline processing of the raw data at the CERN Tier 0 data center, the datasets are distributed to 7 large Tier 1 data centers across the world within 48 hours, ready for further processing and analysis by scientists (the CMS collaboration, one of the largest in the world, consists of more than 3,000 participating members from over 180 institutes and universities in 43 countries).

Processing at Fermilab
Fermilab is one of 16 National Laboratories operated by the United States Department of Energy. Located just outside Batavia Illinois, Fermilab serves as one of the Tier 1 data centers for Cern’s CMS experiment.

With the increase in LHC collision energy last year, the demand for data assimilation, event simulations, and large-scale computing increased as well. With this increase came a desire to maximize cost efficiency by dynamically provisioning resources on an as-needed basis.

In order to address this issue, the Fermilab Scientific Computing Division launched the HEP (High Energy Physics) Cloud project in June of 2015. They planned to develop a virtual facility that would provide a common interface to access a variety of computing resources including commercial clouds. Using AWS, the HEP Cloud project successfully demonstrated the ability to add 58,000 cores elastically to their on-premises facility for the CMS experiment.

The image below depicts one of the simulations that was run on AWS. It shows how the collision of two protons creates energy that then becomes new particles.

The additional 58,000 cores represents a 4x increase in Fermilab’s computational capacity, all of which is dedicated to the CMS experiment in order to generate and reconstruct Monte Carlo simulation events. More than 500 million events were fully simulated in 10 days using 2.9 million jobs. Without help from AWS, this job would have taken 6 weeks to complete using the on-premises compute resources at Fermilab.

This simulation was done in preparation for one of the major high energy physics international conferences, Recontres de Moriond. Physicists across the world will use these simulations to probe nature in detail and will share their findings with their international colleagues during the conference.

Saving Money with HEP Cloud
The HEP Cloud project aims to minimize the costs of computation. The R&D and demonstration effort was supported by an award from the AWS Cloud Credit for Research.

HEP Cloud’s decision engine, the brain of the facility, has several duties. It oversees EC2 Spot Market price fluctuations using  tools and techniques provided by Amazon’s Spot team,  initializes Amazon EC2 instances using HTCondor, tracks the DNS names of the instances using Amazon Route 53 , and makes use of AWS CloudFormation templates for infrastructure as a code.

While on the road to success, the project team had to overcome several challenges, ranging from fine-tuning configurations to optimizing their use of Amazon S3 and other resources. For example, they devised a strategy to distribute the auxiliary data across multiple AWS Regions in order to minimize storage costs and data-access latency.

Automatic Scaling into AWS
The figure below shows elastic, automatic expansion of Fermilab’s Computing Facility into the AWS Cloud using Spot instances for CMS workflows. Monitoring of the resources was done using open source software provided by Grafana with custom modifications provided by the HEP Cloud.

Panagiotis Spentzouris (head of the Scientific Computing Division at Fermilab), told me:

Modern HEP experiments require massive computing resources in irregular cycles, so it is imperative for the success of our program that our computing facilities can rapidly expand and contract resources to match demand.  Using commercial clouds is an important ingredient for achieving this goal, and our work with AWS on the CMS experiment’s workloads though HEPCloud was a great success in demonstrating the value of this approach.

I hope that you enjoyed this brief insight into the ways in which AWS is helping to explore the frontiers of physics!

Sanjay Padhi, Ph.D, AWS Scientific Computing

New – CloudWatch Metrics for Spot Fleets

You can launch an EC2 Spot fleet with a couple of clicks. Once launched, the fleet allows you to draw resources from multiple pools of capacity, giving you access to cost-effective compute power regardless of the fleet size (from one instance to many thousands). For more information about this important EC2 feature, read my posts: Amazon EC2 Spot Fleet API – Manage Thousands of Spot Instances with One Request and Spot Fleet Update – Console Support, Fleet Scaling, CloudFormation.

I like to think of each Spot fleet as a single, collective entity. After a fleet has been launched, it is an autonomous group of EC2 instances. The instances may come and go from time to time as Spot prices change (and your mix of instances is altered in order to deliver results as cost-effectively as possible) or if the fleet’s capacity is updated, but the fleet itself retains its identity and its properties.

New Spot Fleet Metrics
In order to make it even easier for you to manage, monitor, and scale your Spot fleets as collective entities, we are introducing a new set of Spot fleet CloudWatch metrics.

The metrics are reported across multiple dimensions: for each Spot fleet, for each Availability Zone utilized by each Spot fleet, for each EC2 instance type within the fleet, and for each Availability Zone / instance type combination.

The following metrics are reported for each Spot fleet (you will need to enable EC2 Detailed Monitoring in order to ensure that they are all published):

  • AvailableInstancePoolsCount
  • BidsSubmittedForCapacity
  • CPUUtilization
  • DiskReadBytes
  • DiskReadOps
  • DiskWriteBytes
  • DiskWriteOps
  • EligibleInstancePoolCount
  • FulfilledCapacity
  • MaxPercentCapacityAllocation
  • NetworkIn
  • NetworkOut
  • PendingCapacity
  • StatusCheckFailed
  • StatusCheckFailed_Instance
  • StatusCheckFailed_System
  • TargetCapacity
  • TerminatingCapacity

Some of the metrics will give you some insights into the operation of the Spot fleet bidding process. For example:

  • AvailableInstancePoolsCount – Indicates the number of instance pools included in the Spot fleet request.
  • BidsSubmittedForCapacity – Indicates the number of bids that have been made for Spot fleet capacity.
  • EligibleInstancePoolsCount – Indicates the number of instance pools that are eligible for Spot instance requests. A pool is ineligible when either (1) The Spot price is higher than the On-Demand price or (2) the bid price is lower than the Spot price.
  • FulfilledCapacity – Indicates the amount of capacity that has been fulfilled for the fleet.
  • PercentCapacityAllocation – Indicates the percent of capacity allocated for the given dimension. You can use this in conjunction with the instance type dimension to determine the percent of capacity allocated to a given instance type.
  • PendingCapacity – The difference between TargetCapacity and FulfilledCapacity.
  • TargetCapacity – The currently requested target capacity for the Spot fleet.
  • TerminatingCapacity – The fleet capacity for instances that have received Spot instance termination notices.

These metrics will allow you to determine the overall status and performance of each of your Spot fleets. As you can see from the names of the metrics, you can easily observe the disk, CPU, and network resources consumed by the fleet. You can also get a sense for the work that is happening behind the scenes as bids are placed on your behalf for Spot capacity.

You can further inspect the following metrics across the Availability Zone and/or instance type dimensions:

  • CPUUtilization
  • DiskReadBytes
  • DiskReadOps
  • DiskWriteBytes
  • FulfilledCapacity
  • NetworkIn
  • NetworkOut
  • StatusCheckFailed
  • StatusCheckFailed_Instance
  • StatusCheckFailed_System

These metrics will allow you to see if you have an acceptable distribution of load across Availability Zones and/or instance types.

You can aggregate these metrics using Max, Min, or Avg in order to observe the overall utilization of your fleet. However, be aware that using Avg does not always make sense when used across a fleet comprised of two or more types of instances!

Available Now
The new metrics are available now.


New CloudWatch Events – Track and Respond to Changes to Your AWS Resources

When you pull the curtain back on an AWS-powered application, you’ll find that a lot is happening behind the scenes. EC2 instances are launched and terminated by Auto Scaling policies in response to changes in system load, Amazon DynamoDB tables, Amazon SNS topics and Amazon SQS queues are created and deleted, and attributes of existing resources are changed from the AWS Management Console, the AWS APIs, or the AWS Command Line Interface (CLI).

Many of our customers build their own high-level tools to track, monitor, and control the overall state of their AWS environments. Up until now, these tools have worked in a polling fashion. In other words, they periodically call AWS functions such as DescribeInstances, DescribeVolumes, and ListQueues to list the AWS resources of various types (EC2 instances, EBS volumes, and SQS queues here) and to track their state. Once they have these lists, they need to call other APIs to get additional state information for each resources, compare it against historical data to detect changes, and then take action as they see fit. As their systems grow larger and more complex, all of this polling and state tracking can become onerous.

New CloudWatch Events
In order to allow you to track changes to your AWS resources with less overhead and greater efficiency, we are introducing CloudWatch Events today.

CloudWatch Events delivers a near real-time stream of system events that describe changes in AWS resources. Using simple rules that you can set up in a couple of minutes, you can easily route each type of event to one or more targets: AWS Lambda functions, Amazon Kinesis streams, Amazon SNS topics, and built-in targets.

You can think of CloudWatch Events as the central nervous system for your AWS environment. It is wired in to every nook and cranny of the supported services, and becomes aware of operational changes as they happen. Then, driven by your rules,  it activates functions and sends messages (activating muscles, if you will) to respond to the environment, making changes, capturing state information, or taking corrective action.

We are launching CloudWatch Events with an initial set of AWS services and events today, and plan to support many more over the next year or so.

Diving in to CloudWatch Events
The three main components that you need to know about are events, rules, and targets.

Events (represented as small blobs of JSON) are generated in four ways. First, they arise from within AWS when resources change state. For example, an event is generated when the state of an EC2 instance changes from pending to running or when Auto Scaling launches an instance. Second, events are generated by API calls and console sign-ins that are delivered to Amazon CloudWatch Events via CloudTrail. Third, your own code can generate application-level events and publish them to Amazon CloudWatch Events for processing. Fourth, they can be issued on a scheduled basis, with options for periodic or Cron-style scheduling.

Rules match incoming events and route them to one or more targets for processing. Rules are not processed in any particular order; all of the rules that match an event will be processed (this allows disparate parts of a single organization to independently look for and process events that are of interest).

Targets process events and are specified within rules. There are four initial target types: built-in, Lambda functions, Kinesis streams, and SNS topics, with more types on the drawing board. A single rule can specify multiple targets. Each event is passed to each target in JSON form. Each rule has the opportunity to customize the JSON that flows to the target. They can elect to pass the event as-is, pass only certain keys (and the associated values) to the target, or to pass a constant (literal) string.

CloudWatch Events in Action
Let’s go ahead and set up a rule or two! I’ll use a simple Lambda function called SomethingHappened. It will simply log the contents of the event:

Next, I switch to the new CloudWatch Events Console, click on Create rule and choose an event source (here’s the menu with all of the choices):

Just a quick note before going forward. Some of the AWS services fire events directly. Others are fired based on the events logged to CloudTrail; you’ll need to enable CloudTrail for the desired service(s) in order to receive them.

I want to keep tabs on my EC2 instances, so I choose EC2 from the menu. I can choose to create a rule that fires on any state transition, or on a transition to one or more states that are of interest:

I want to know about newly launched instances, so I’ll choose Running. I can make the rule respond to any of my instances in the region, or to specific instances. I’ll go with the first option; here’s my pattern:

Now I need to make something happen. I do this by picking a target. Again, here are my choices:

I simply choose Lambda and pick my function:

I’m almost there! I just need to name and describe my rule, and then click on Create rule:

I click on Create Rule and the rule is all set to go:

Now I can test it by launching an EC2 instance. In fact, I’ll launch 5 of them just to exercise my code! After waiting a minute or so for the instances to launch and to initialize, I can check my Lambda metrics to verify that my function was invoked:

This looks good (the earlier invocations were for testing). Then I can visit the CloudWatch logs to view the output from my function:

As you can see, the event contains essential information about the newly launched instance. Your code can call AWS functions in order to learn more about what’s going on. For example, you could call DescribeInstances to access more information about newly launched instances.

Clearly, a “real” function would do something a lot more interesting. It could add some mandatory tags to the instance, update a dynamic visualization, or send me a text message via SNS. If you want to do any (or all of these things), you would need to have a more permissive IAM role for the function, of course. I could make the rule more general (or create another one) if  I wanted to capture some of the other state transitions.

Scheduled Execution of Rules
I can also set up a rule that fires periodically or according to a pattern described in a Cron expression. Here’s how I would do that:

You might find it interesting to know that this is the underlying mechanism used to set up scheduled Lambda jobs, as announced at AWS re:Invent.

API Access
Like most AWS services, you can access CloudWatch Events through an API. Here are some of the principal functions:

  • PutRule to create a new rule.
  • PutTargets and RemoveTargets to connect targets to rules, and to disconnect them.
  • ListRules, ListTargetsByRule, and DescribeRule to find out more about existing rules.
  • PutEvents to submit a set of events to CloudWatch events. You can use this function (or the CLI equivalent) to submit application-level events.

Metrics for Events
CloudWatch Events reports a number of metrics to CloudWatch, all within the AWS/Events namespace. You can use these metrics to verify that your rules are firing as expected, and to track the overall activity level of your rule collection.

The following metrics are reported for the service as a whole:

  • Invocations – The number of times that target have been invoked.
  • FailedInvocations – The number of times that an invocation of a target failed.
  • MatchedEvents – The number of events that matched one or more rules.
  • TriggeredRules – The number of rules that have been triggered.

The following metrics are reported for each rule:

  • Invocations – The number of times that the rule’s targets have been invoked.
  • TriggeredRules – The number of times that the rule has been triggered.

In the Works
Like many emerging AWS services, we are launching CloudWatch Events with an initial set of features (and a lot of infrastructure behind the scenes) and some really big plans, including AWS CloudFormation support. We’ll adjust our plans based on your feedback, but you can expect coverage of many more AWS services and access to additional targets over time. I’ll do my best to keep you informed.

Getting Started
We are launching CloudWatch Events in the US East (Northern Virginia), US West (Oregon), EU (Ireland), and Asia Pacific (Tokyo) regions. It is available now and you can start using it today!


They’re Here – Longer EC2 Resource IDs Now Available

Last November I gave you a heads-up that we planned to increase the length of the resource IDs for EC2 instances, reservations, volumes, and snapshots in early 2016. We are now entering a transition period that will last until early December (2016). During this period, you can opt in to the new format (a resource identifier followed by a 17-character string) on a region-by-region, user-by-user, type-by-type basis.  Once you opt in for a given region and IAM user (or for the root account), newly created resources will receive the longer IDs.

Effective today, you can opt in to the new format for EC2 instances and EC2 reservations. Each reservation represents the results of a single instance launch request, and can be associated with multiple instances.

If you build libraries, tools, or applications that make direct calls to the AWS API, now is the time to opt in and to start your testing process! If you store the IDs in memory or in a database, take a close look at fixed-length fields, data structures, schema elements, string operations, and regular expressions. Resources that were created before you opt in will retain their existing short identifiers; be sure that your revised code can still handle them!

You can opt in to the new format using the AWS Management Console, the AWS Command Line Interface (CLI), the AWS Tools for Windows PowerShell, or though an API function.You can also opt out if your testing uncovers issues that you need to address.

Opting In – Console
To opt in via the Console, simply log in, choose EC2, and click on Resource ID length management:

Then click on Use Longer IDs for the desired resource types:

Opting In – AWS CLI
To opt in using the AWS CLI, use the modify-id-format command and specify the desired resource type:

$ aws ec2 modify-id-format --resource instance --use-long-ids

You can opt out like this:

$ aws ec2 modify-id-format --resource instance --no-use-long-ids

Opting In – AWS Tools for Windows PowerShell
To opt in using the AWS Tools for Windows PowerShell, use the Edit-EC2IdFormat cmdlet and specify the desired resource type:

PS C:\> Edit-EC2IdFormat -Resource instance -UseLongId True

You can opt out like this:

PS C:\> Edit-EC2IdFormat -Resource instance -UseLongId False

Opting In – AWS SDKs
To opt in from your own tools or application code, call the ModifyIdFormat function, specify the resource type (instance or reservation), and set the UseLongIds parameter to True.

Things to Know
You can opt in at the IAM user or account level. After you have opted in, newly created resources of the specified types will receive the new, longer identifiers as soon as the setting takes effect (typically a few minutes). The identifiers will be visible in API results, the command line, and the console.

If a particular IAM user opts in and then creates some resources, the longer identifiers will be visible to other users, even if they have not opted in. You may want to create a separate AWS account for testing purposes in order to avoid confusion. You can also choose to do your testing in an AWS region that you don’t use on a regular basis (see the AWS Global Infrastructure page for a list of current and planned regions).

This blog post applies to all AWS regions and accounts. AWS accounts created on or after March 7, 2016 will default to longer instance and reservation IDs, and can be opted out if necessary.

After you have performed your testing and are satisfied that your tools, code, and applications work to your satisfaction, we recommend that you opt in for all of your AWS accounts before December 2016. This will give you confidence that no surprises will arise when the new identifiers become mandatory.

To learn more about this important update to AWS, please read the Longer EC2 and EBS Resource IDs FAQ.



PS – We’ll be adding support for EBS volumes and snapshots in April; stay tuned for more information.