Category: Amazon S3

AWS Online Tech Talks – September 2017

As a school supply aficionado, the month of September has always held a special place in my heart. Nothing sets the tone for success like getting a killer deal on pens and a crisp college ruled notebook. Even if back to school shopping trips have secured a seat in your distant memory, this is still a perfect time of year to stock up on office supplies and set aside some time for flexing those learning muscles. A great way to get started: scan through our September Tech Talks and check out the ones that pique your interest. This month we are covering re:Invent, AI, and much more.

September 2017 – Schedule

Noted below are the upcoming scheduled live, online technical sessions being held during the month of September. Make sure to register ahead of time so you won’t miss out on these free talks conducted by AWS subject matter experts.

Webinars featured this month are:

Monday, September 11


9:00 – 9:40 AM PDT: What’s New with Amazon DynamoDB


10:30 – 11:10 AM PDT: Local Testing and Deployment Best Practices for Serverless Applications


12:00 – 12:40 PM PDT: Managing Secrets for Containers with Amazon ECS


Tuesday, September 12


9:00 – 9:40 AM PDT: Get Ready for re:Invent 2017 Content Overview


10:30 – 11:10 AM PDT: Deep Dive on User Sign-up and Sign-in with Amazon Cognito

Management Tools

12:00 – 12:40 PM PDT: Using CloudTrail to Enhance Compliance and Governance of S3


Wednesday, September 13

Big Data

9:00 – 9:40 AM PDT: Best Practices for Processing Managed Hadoop Workloads


10:30 – 11:10 AM PDT: Migrating Your Oracle Database to PostgreSQL


12:00 – 12:40 PM PDT: Configuration Management in the Cloud


Thursday, September 14

Big Data

9:00 – 9:40 AM PDT: Tackle Your Dark Data Challenge with AWS Glue


10:30 – 11:10 AM PDT: Deep Dive on MySQL Databases on AWS


12:00 – 12:40 PM PDT: Using AWS Batch and AWS Step Functions to Design and Run High-Throughput Workflows


Tuesday, September 26


9:00 – 9:40 AM PDT: An Overview of AI on the AWS Platform

10:30 – 11:10 AM PDT: Introduction to Generative Adversarial Networks (GAN) with Apache MXNet


12:00 – 12:40 PM PDT: Revolutionizing Backup & Recovery Using Amazon S3


2:00 – 2:40 PM PDT: Securing Your Desktops with Amazon WorkSpaces


Wednesday, September 27

Security & Identity

9:00 – 9:40 AM PDT: Advanced DNS Traffic Management using Amazon Route 53


10:30 – 11:10 AM PDT: Deep Dive on Amazon EFS (with Encryption)

Hands on Lab

12:30 – 2:00 PM PDT: Hands on Lab: Windows Workloads


Thursday, September 28

Security & Identity

9:00 – 9:40 AM PDT: How to use AWS WAF to Mitigate OWASP Top 10 attacks


10:30 – 11:10 AM PDT: AWS Greengrass Technical Deep Dive with Demo

Hands on Lab

1:00 – 1:40 PM PDT: Design, Deploy, and Optimize SQL Server on AWS


The AWS Online Tech Talks series covers a broad range of topics at varying technical levels. These sessions feature live demonstrations & customer examples led by AWS engineers and Solution Architects. Check out the AWS YouTube channel for more on-demand webinars on AWS technologies.

– Sara

New – S3 Sync capability for EC2 Systems Manager: Query & Visualize Instance Software Inventory

It is now essential, with the fast paced lives we all seem to lead, to find tools to make it easier to manage our time, our home, and our work. With the pace of technology, the need for technologists to find management tools to easily manage their systems is just as important. With the introduction of Amazon EC2 Systems Manager service during re:Invent 2016, we hoped to provide assistance with the management of your systems and software.

If are not yet familiar with the Amazon EC2 Systems Manager, let me introduce this capability to you. EC2 Systems Manager it is a management service that helps to create system images, collect software inventory, configure both Windows and Linux operating systems, as well as, apply Operating Systems patches. This collection of capabilities allows remote and secure administration for managed EC2 instances or hybrid environments with on-premise machines configured for Systems Manager. With this EC2 service capability, you can additionally record and regulate the software configuration of these instances using AWS Config.

Recently we have added another feature to the inventory capability of EC2 Systems Manager to aid you in the capture of metadata about your application deployments, OS and system configurations, Resource Data Sync aka S3 Sync. S3 Sync for EC2 Systems Manager allows you to aggregate captured inventory data automatically from instances in different regions and multiple accounts and store this information in Amazon S3. With the data in S3, you can run queries against the instance inventory using Amazon Athena, and if you choose, use Amazon QuickSight to visualize the software inventory of your instances.

Let’s look at how we can utilize this Resource Data Sync aka S3 Sync feature with Amazon Athena and Amazon QuickSight to query and visualize the software inventory of instances. First things first, I will make sure that I have the Amazon EC2 Systems Manager prerequisites completed; configuration of the roles and permissions in AWS Identity and Access Management (IAM), as well as, the installation of the SSM Agent on my managed instances. I’ll quickly launch a new EC2 instance for this Systems Manager example.

Now that my instance has launched, I will need to install the SSM Agent onto my aws-blog-demo-instance. One thing I should mention is that it is essential that your IAM user account has administrator access in the VPC in which your instance was launched. You can create a separate IAM user account for instances with EC2 Systems Manager, by following the instructions noted here: Since I am using an account with administrative access, I won’t need to create an IAM user to continue installing the SSM Agent on my instance.

To install the SSM Agent, I will SSH into my instance, create a temporary directory, and pull down and install the necessary SSM Agent software for my Amazon Linux EC2 instance. An EC2 instance based upon a Windows AMI already includes the SSM Agent so I would not need to install the agent for Windows instances.

To complete the aforementioned tasks, I will issue the following commands:

mkdir /tmp/ssm

cd /tmp/ssm

sudo yum install -y

You can find the instructions to install the SSM Agent based upon the type of operating system of your EC2 instance in the Installing SSM Agent section of the EC2 Systems Manager user guide.

Now that I have the Systems Manager agent running on my instance, I’ll need to use a S3 bucket to capture the inventory data. I’ll create a S3 bucket, aws-blog-tew-posts-ec2, to capture the inventory data from my instance. I will also need to add a bucket policy to ensure that EC2 Systems Manager has permissions to write to my bucket. Adding the bucket policy is simple, I select the Permissions tab in the S3 Console and then click the Bucket Policy button. Then I specify a bucket policy which gives the Systems Manager the ability to check bucket permissions and add objects to the bucket. With the policy in place, my S3 bucket is now ready to receive the instance inventory data.

To configure the inventory collection using this bucket, I will head back over to the EC2 console and select Managed Resources under Systems Manager Shared Resources section, then click the Setup Inventory button.

In the Targets section, I’ll manually select the EC2 instance I created earlier from which I want to capture the inventory data. You should note that you can select multiple instances for which to capture inventory data if desired.

Scrolling down to the Schedule section, I will choose 30 minutes for the time interval of how often I wish for inventory metadata to be gathered from my instance. Since I’m keeping the default Enabled value for all of the options in the Parameters section, and I am not going to write the association logs to S3 at this time, I only need to click the Setup Inventory button. When the confirmation dialog comes up noting that the Inventory has been set up successfully, I will click the Close button to go back to the main EC2 console.

Back in the EC2 console, I will set up my Resource Data Sync using my aws-blog-tew-posts-ec3 S3 bucket for my Managed Instance by selecting the Resource Data Syncs button.

To set up my Resource data, I will enter my information for the Sync Name, Bucket Name, Bucket Prefix, and the Bucket Region that my bucket is located. You should also be aware that the Resource Data Sync and the sync S3 target bucket can be located in different regions. Another thing to note is that the CLI command for completing this step is displayed, in case I opt to utilize the AWS CLI for creating the Resource Data Sync. I click the Create button and my Resource Data Sync setup is complete.

After a few minutes, I can go to my S3 bucket and see that my instance inventory data is syncing to my S3 bucket successfully.

With this data syncing directly into S3, I can take advantage of the querying capabilities of the Amazon Athena service to view and query my instance inventory data. I create a folder, athenaresults, within my aws-blog-tew-posts-ec2 S3 bucket, and now off to the Athena console I go!

In the Athena console, I will change the Settings option to point to my athenaresults folder in my bucket by entering: s3://aws-blog-tew-posts-ec2/athenaresults. Now I can create a database named tewec2ssminventorydata for capturing and querying the data sent from SSM to my bucket, by entering in a CREATE DATABASE SQL statement in the Athena editor and clicking the Run Query button.

With my database created, I’ll switch to my tewec2ssminventorydata database and create a table to grab the inventory application data from the S3 bucket synced from the Systems Manager Resource Data Sync.

As the query success message notes, I’ll run the MSCK REPAIR TABLE tew_awsapplication command to partition the newly created table. Now I can run queries against the inventory data being synced from the EC2 Systems Manager to my Amazon S3 buckets. You can learn more about querying data with Amazon Athena on the product page and you can review my blog post on querying and encrypting data with Amazon Athena.


Now that I have query capability of this data it also means I can use Amazon QuickSight to visualize my data.

If you haven’t created an Amazon QuickSight account, you can quickly follow the getting started instructions to setup your QuickSight account. Since I already have a QuickSight account, I’ll go to the QuickSight dashboard and select the Manage Data button. On my Your Data Sets screen, I’ll select the New data set button.
Now I can create a dataset from my Athena table holding the Systems Manager Inventory Data by selecting Athena as my data source.

This takes me through a series of steps to create my data source from the Athena tewec2ssminventorydata database and the tew_awsapplication table.

After choosing Visualize to create my data set and analyze the data in the Athena table, I am now taken to the QuickSight dashboard where I can build graphs and visualizations for my EC2 System Manager inventory data.

Adding the applicationtype field to my graph, allows me to build a visualization using this data.



With the new Amazon EC2 Systems Manager Resource Data Sync capability to send inventory data to Amazon S3 buckets, you can now create robust data queries using Amazon Athena and build visualizations of this data with Amazon QuickSight.  No longer do you have to create custom scripts to aggregate your instance inventory data to an Amazon S3 bucket, now this data can be automatically synced and stored in Amazon S3 allowing you to keep your data even after your instance has been terminated. This new EC2 Systems Manager capability also allows you to send inventory data to S3 from multiple accounts and different regions.

To learn more about Amazon EC2 Systems Manager and EC2 Systems Manager Inventory, take a look at the product pages for the service. You can also build your own query and visualization solution for the EC2 instance inventory data captured in S3 by checking out the EC2 Systems Manager user guide on Using Resource Data Sync to Aggregate Inventory Data.

In the words of my favorite Vulcan, “Live long, query and visualize and prosper” with EC2 Systems Manager.




Hightail — Empowering Creative Collaboration in the Cloud

Hightail – formerly YouSendIt – streamlines how creative work is reviewed, improved, and approved by helping more than 50 million professionals around the world get great content in front of their audiences faster. Since its debut in 2004 as a file sharing company, Hightail shifted its strategic direction to focus on delivering value-added creative collaboration services and boasts a strong lineup of name-brand customers.

In today’s guest post, Hightail’s SVP of Technology Shiva Paranandi tells the company’s migration story, moving petabytes of data from on-premises to the cloud. He highlights their cloud vendor evaluation process and reasons for going all-in on AWS.

Hightail started as a way to help people easily share and store large files, but has since evolved into a creative collaboration tool. We became a place where users could not only control and share their digital assets, but also assemble their creative teams, connect with clients, develop creative workflows, and manage projects from start to finish. We now power collaboration services for major brands such as Lionsgate and Jimmy Kimmel Live!. With a growing list of domestic and international clients, we required more internal focus on product development and serving the users. We found that running our own data centers consumed more time, money, and manpower than we were willing to devote.

We needed an approach that would help us iterate more rapidly to meet customer needs and dramatically improve our time to market. We wanted to reduce data center costs and have the flexibility to scale up quickly in any given region around the globe. Setting up a data center in a new location took so long that it was limiting the pace of growth that we could achieve. In addition, we were tired of buying ahead of our needs, which meant we had storage capacity that we did not even use. We required a storage solution that was both tiered and highly scalable to reduce costs by allowing us to keep infrequently used data in inactive storage while also allowing us to resurface it quickly at the customer’s request. Our main drivers were agility and innovation, and the cloud enables these in a significant way. Given that, we decided to adopt a cloud-first policy that would enable us to spend time and money on initiatives that differentiate our business, instead of putting resources into managing our storage and computing infrastructure.

Comparing AWS Against Cloud Competitors

To kick off the migration, we did our due diligence by evaluating a variety of cloud vendors, including AWS, Google, IBM, and Microsoft. AWS stuck out as the clear winner for us. At one point, we considered combining services from multiple cloud providers to meet our needs, but decided the best route was to use AWS exclusively. When we factored in training, synchronization, support, and system availability along with other migration and management elements, it was just not practical to take a multi-cloud approach. With the best cost savings and an unmatched ecosystem of partner solutions, we did not need anyone else and chose to go all-in on AWS.

By migrating to AWS, we were able to secure the lowest cost-per-gigabyte pricing, gain access to a rich ecosystem, quickly develop in-house talent, and maintain SOC II compliance. The ecosystem was particularly important to us and set AWS apart from its competitors with its expansive list of partners. In fact, all the vendors we depend on for services such as previewing images, encoding videos, and serving up presentations were already a part of the network so we were easily able to leverage our existing investments and expertise. If we went with a different provider, it would have meant moving away from a platform that was already working so well for which was not the desired outcome for us. Also, the amount of talent we were able to build up in house on AWS technologies was astounding. Training our internal team to work with AWS was a simple process using available tools such as AWS conferences, training materials, and support.

Migrating Petabytes of Data

Going with AWS made things easier. In many instances, it gave us better functionality than what we were using in house. We moved multiple petabytes of data from on-premises storage to AWS with ease. AWS gave us great speeds with Direct Connect, so we were able to push all the data in a little more than three months with no user impact. We employed AWS Key Management Service to keep our data secure, which eased our minds through the move. We performed extensive QA testing before flipping users over to ensure low customer impact, using methods such as checksums between our data center and the data that got pushed to AWS.

Our new platform on AWS has greatly improved our user experience. We have seen huge improvement in reliability, performance, and uptime—all critical in our line of business. We are now able to achieve upload and download speeds up to 17 times faster than our previous data centers, and uptime has increased by orders of magnitude. Also, the time it takes us to deploy services to a new region has been cut by more than 90%. It used to take us at least six months to get a new region online, and now we can get a region up and running in less than three weeks. On AWS, we can even replicate data at the bucket level across regions for disaster recovery purposes.

To cut costs, we were successfully able to divide our storage infrastructure into frequently and infrequently accessed data. Tiered storage in Amazon S3 has been a huge advantage, allowing us to optimize our storage costs so we have more to invest in product development. We can now move data from inactive to active tiers instantly to meet customer needs and eliminated the need to overprovision our storage infrastructure. It is refreshing to see services automatically scale up or down during peak load times, and know that we are only paying for what we need.

Overall, we achieved our key strategic goal of focusing more on development and less on infrastructure. Our migration felt seamless, and the progress we were able to share is a true testament to how easy it has been for us to run our workloads on AWS. We attribute part of our successful migration to the dedicated support provided by the AWS team. They were pretty awesome. We had a couple of their technicians available 24/7 via chat, which proved to be essential during this large-scale migration.

-Shiva Paranandi, SVP of Technology at Hightail

Learning More

Learn more about cost-effective tiered data storage with Amazon S3, or dive deeper into our AWS Partner Ecosystem to see which solutions could best serve the needs of your company.

AWS Knowledge Center Video: Preparing to Send a Snowball Back to AWS

Do you know about the AWS Support Knowledge Center? It contains answers to some of the most frequently asked questions and other requests asked of our support team. Many of the answers even include a short video that serves to illustrate the process or to provide additional info on the topic.

For example, I recently stepped in to our studio and created a new video called Preparing to Send a Snowball Back to AWS. In 90 action-packed seconds, this video shows you how to power down the Snowball, stow the cables, lock the back panel, and verify that the proper return address is on the built-in display:

Visit the Knowledge Center to see other videos and to find answers to other questions that you might have about AWS.



File Interface to AWS Storage Gateway

I should probably have a blog category for “catching up from AWS re:Invent!” Last November we made a really important addition to the AWS Storage Gateway that I was too busy to research and write about at the time.

As a reminder, the Storage Gateway is a multi-protocol storage appliance that fits in between your existing applications and the AWS Cloud. Your applications and your client operating systems see the gateway as (depending on the configuration), a file server, a local disk volume, or a virtual tape library (VTL). Behind the scenes, the gateway uses Amazon Simple Storage Service (S3) for cost-effective, durable, and secure storage. Storage Gateway caches data locally and uses bandwidth management to optimize data transfers.

Storage Gateway is delivered as a self-contained virtual appliance that is easy to install, configure, and run (read the Storage Gateway User Guide to learn more). It allows you to take advantage of the scale, durability, and cost benefits of cloud storage from your existing environment. It reduces the process of moving existing files and directories into S3 to a simple drag and drop (or a CLI-based copy).

As is the case with many AWS services, the Storage Gateway has gained many features since we first launched it in 2012 (The AWS Storage Gateway – Integrate Your Existing On-Premises Applications with AWS Cloud Storage). At launch, the Storage Gateway allowed you to create storage volumes and to attach them as iSCSI devices, with options to store either the entire volume or a cache of the most frequently accessed data in the gateway, all backed by S3. Later, we added support for Virtual Tape Libraries (Create a Virtual Tape Library Using the AWS Storage Gateway). Earlier this year we added read-only file shares, user permission squashing, and scanning for added and removed objects.

New File Interface
At AWS re:Invent we launched a third option, and that’s what I’d like to tell you about today. You can now use the Storage Gateway as a virtual file server that you can mount on your on-premises servers and desktops. After you set it up in your data center or in the cloud, your configured buckets will be available as NFS mount points. Your application simply reads and writes files and directories over NFS; behind the scenes, the gateway turns these operations into object-level requests on your S3 buckets, where they are accessible natively (one S3 object per file). To create a file gateway, you simply visit the Storage Gateway Console, click on Get started, and choose File gateway:

Then choose your host platform: VMware ESXi or Amazon EC2:

I expect many of our customers to host the Storage Gateway on premises and to use it as a permanent or temporary bridge to the cloud. Use cases for this option include simplified backups, migration, archiving, analytics, storage tiering, and compute-intensive cloud-based processing. Once the data is in the cloud, you can take advantage of many features of S3 including multiple storage tiers (Infrequent Access and Glacier are great for archiving), storage analytics, tagging, and the like.

I don’t have much data on-premises so I’m going to run the Storage Gateway on an EC2 instance for this post. I launched the instance and set it up per the instructions on the screen, taking care to create the proper inbound security group rules (port 80 for HTTP access and port 2049 for NFS). I added 150 GiB of General Purpose SSD storage to be used as a cache:

After the instance launched I captured its public IP address and used it to connect to my newly launched gateway:

I set the time zone and assigned a name to my gateway and clicked on Activate gateway:

Then I configured the local storage as a cache, and clicked on Save and continue:

My gateway was up and running, and I could see it in the console:

Next, I clicked on Create file share to create an NFS share and associate it with an S3 bucket:

As you can see, I had the opportunity to choose my storage class (Standard or Standard – Infrequent Access in accord with my needs and my use case). The gateway needs to be able to upload files into my bucket; clicking on Create a new IAM role will create a role and a policy (read Granting Access to an Amazon S3 Destination to learn more).

I review my settings and click on Create file share:

By the way, Root squash is a feature of the AWS Storage Gateway, not a vegetable. When enabled (as it is by default) files that arrive as owned by root (user id 0) are mapped to user id 65534 (traditionally known as nobody). I can also set up default permissions for new files and new directories.

My new share is visible in the console, and available for use within seconds:

The console displays the appropriate mount commands for Linux, Microsoft Windows, and macOS. Those commands use the private IP address of the instance; in many cases you will want to use the public address instead (needless to say, you should exercise extreme care when you create a public NFS share, and maintain close control over the IP addresses that are allowed to connect).

I flipped over to the S3 console and inspected the bucket (jbarr-gw-1), finding it empty, as expected:

Then I turned to my EC2 instance, mounted the share, and copied some files to it:

I returned to the console and found a new folder (jeff_code) in my bucket, as expected. I ventured inside and found the files that I had copied to the share:

As you can see, my files are copied directly into S3 and are simply regular S3 objects. This means that I can use my existing S3 tools, code, and analytics to process them. For example:

  • Analytics – The new S3 metrics and analytics can be used to analyze the entire bucket or any directory tree within it:
  • CodeAWS Lambda and Amazon Rekognition can be used to process uploaded images; see Serverless Photo Recognition for some ideas and some code. I could also use Amazon Elasticsearch Service to index some or all of the files or Amazon EMR to process massive amounts of data.
  • Tools – I can process the existing objects in the bucket and I can also create new ones using the the S3 APIs. Any code or script that creates or removes should call the RefreshCache function to synchronize the contents of any gateways attached to the bucket (I can create a multi-site data distribution workflow by pointing multiple read-only gateways at the same bucket). I can also make use of existing, file-centric backup tools by using the share as the destination for my backups.

The gateway stores all of the file metadata (owner, group, permissions, and so forth) as S3 metadata:

Storage Gateway Resources
Here are some resources that will help you to learn more about the Storage Gateway:

PresentationDeep Dive on the AWS Storage Gateway:

White PaperFile Gateway for Hybrid Architectures – Overview and Best Practices:

Recent Videos:

Available Now
This cool AWS feature has been available since last November!



Amazon Redshift Spectrum – Exabyte-Scale In-Place Queries of S3 Data

Now that we can launch cloud-based compute and storage resources with a couple of clicks, the challenge is to use these resources to go from raw data to actionable results as quickly and efficiently as possible.

Amazon Redshift allows AWS customers to build petabyte-scale data warehouses that unify data from a variety of internal and external sources. Because Redshift is optimized for complex queries (often involving multiple joins) across large tables, it can handle large volumes of retail, inventory, and financial data without breaking a sweat. Once the data is loaded, our customers can make use of a plethora of enterprise reporting and business intelligence tools provided by the Redshift Partners.

One of the most challenging aspects of running a data warehouse involves loading data that is continuously changing and/or arriving at a rapid pace. In order to provide great query performance, loading data into a data warehouse includes compression, normalization, and optimization steps. While these steps can be automated and scaled, the loading process introduces overhead and complexity, and also gets in the way of those all-important actionable results.

Data formats present another interesting challenge. Some applications will process the data in its original form, outside of the data warehouse. Others will issue queries to the data warehouse. This model leads to storage inefficiencies because the data must be stored twice, and can also mean that results from one form of processing may not align with those from another due to delays introduced by the loading process.

Amazon Redshift Spectrum
In order to allow you to process your data as-is, where-is, while taking advantage of the power and flexibility of Amazon Redshift, we are launching Amazon Redshift Spectrum. You can use Spectrum to run complex queries on data stored in Amazon Simple Storage Service (S3), with no need for loading or other data prep.

You simply create a data source and issue your queries to your Redshift cluster as usual. Behind the scenes, Spectrum scales to thousands of instances on a per-query basis, ensuring that you get fast, consistent performance even as your data set grows up to an beyond an exabyte! Being able to query data stored in S3 means that you can scale your compute and your storage independently, with the full power of the Redshift query model and all of the reporting and business intelligence tools at your disposal. Your queries can reference any combination of data stored in Redshift tables and in S3.

When you issue a query, Redshift rips it apart and generates a query plan that minimizes the amount of S3 data that will be read, taking advantage of both column-oriented formats and data that is partitioned by date or another key. Then Redshift requests Spectrum workers from a large, shared pool and directs them to project, filter, and aggregate the S3 data. The final processing is performed within the Redshift cluster and the results are returned to you.

Because Spectrum operates on data that is stored in S3, you can process the data using other AWS services such as Amazon EMR and Amazon Athena. You can also implement hybrid models where frequently queried data is kept in Redshift local storage and the rest is S3, or where dimension tables are in Redshift along with the recent portions of the fact tables, with older data in S3. In order to drive even higher levels of concurrency, you can point multiple Redshift clusters at the same stored data.

Spectrum supports open, common data types including CSV/TSV, Parquet, SequenceFile, and RCFile. Files can be compressed using GZip or Snappy, with other data types and compression methods in the works.

Spectrum in Action
In order to get some first-hand experience with Spectrum I loaded up a sample data set and ran some queries!

I started by creating an external schema and database:

Then I created an external table within the database:

I ran a simple query to get a feel for the size of the data set (6.1 billion rows):

And then I ran a query that processed all of the rows:

As you can see, Spectrum was able to churn through all 6 billion rows in about 15 seconds. I checked my cluster’s performance metrics and it looked like I had enough CPU power to run many such queries simultaneously:

Available Now
Amazon Redshift Spectrum is available now and you can start using it today!

Spectrum pricing is based on the amount of data pulled from S3 during query processing and is charged at the rate of $5 per terabyte (you can save money by compressing your data and/or storing it in column-oriented form). You pay the usual charges to run your Redshift cluster and to store your data in S3, but there are no Spectrum charges when you are not running queries.


PS – Several people have asked about the relationship between Spectrum and Athena, and the applicability of both tools to different workloads. Fortunately, the newly updated Redshift FAQ addresses this question; see When should I use Amazon Athena vs. Redshift Spectrum? for more info.


S3 Storage Management Update – Analytics, Object Tagging, Inventory, and Metrics

Today I would like to tell you about four S3 features that will give you detailed insights into your storage and your access patterns. You can see what and how much you are storing, how it is being used, and you can make informed decisions about the use of S3 storage classes as a result. These features will be of value to everyone who uses S3, whether they have tens, thousands, millions, or billions of objects in their buckets. Here’s an overview:

S3 Analytics – You can analyze the storage and retrieval patterns for your objects and use the results to choose the most appropriate storage class. You can inspect the results of the analysis from within the S3 Console, or you can load them into your favorite BI tool and dive deep. Either way, you now have the means to gain a deep understanding of your storage patterns and to see how they relate to usage and growth.

S3 Object Tagging – You can associate multiple key-value pairs (tags) with each of your S3 objects, with ability to change them at any time. The tags can be used to manage and control access, set up S3 Lifecycle policies, customize the S3 Analytics, and filter the CloudWatch metrics. You can think of the bucket as a data lake, and use tags to create a taxonomy of the objects within the lake. This is more flexible than using the bucket and a prefix, and allows you to make semantic-style changes without renaming, moving, or copying objects.

S3 Inventory – You can speed up your business workflows and your big data jobs using S3 Inventory. This feature provides you with a CSV-formatted flat-file representation of the contents of all or part (as identified by a prefix) of a bucket, on a daily or weekly basis.

S3 CloudWatch Metrics – You can improve the performance of your S3-powered applications by monitoring and alarming on 13 new Amazon CloudWatch metrics.

Let’s dive in…

S3 Analytics
As an S3 user, you have your choice of three storage classes (Standard, Standard – Infrequent Access, and Glacier) and the ability to use S3’s Object Lifecycle Management feature to indicate when objects should be either expired or transitioned to Standard – Infrequent Access or Glacier.

The S3 Analytics feature gives you the information that you need to have in order to choose between Standard and Standard – Infrequent Access for your objects. Because many customers store several different types or categories of information in a single bucket for ease of processing, you have the ability to run the analytics on subsets (defined by a common prefix or tag value) of the objects in a bucket. Each subset is defined by a filter; each bucket can have up to 1000 filters. Here are the filters on my jbarr-public bucket:

Here’s how I define a simple filter that analyzes objects that are prefixed with www:

I could choose to filter by tag name and value instead. Here’s how I would analyze objects that have a tag named type with the value page:

Each filter can have at most one prefix, along with any desired number of tags. I can also choose to have the analytics exported to a file each day:

Once one or more filters are in place, the analytics are run on a daily basis and I can view them by simply clicking on the filter. For example, here’s what I see when I click on my Archival filter:

There’s a lot of good information in this analysis! I can see that:

  • Looking back 127 days, most of my objects that are older than 30 days are accessed infrequently. I can save money by creating a Lifecycle rule that will transition these objects to Standard – Infrequent Access storage.
  • At this point, the majority (6.39 PB) of my storage is in Standard storage; just a tiny fraction (3.24 GB) is in Standard – Infrequent Access storage (I don’t actually have that much data in my bucket! The S3 team was kind enough to load up my account with some test metrics that were, shall we say, very generous).
  • Over the 127 day observation period, between 16% and 30% of the Standard storage was retrieved on a per-day basis.
  • Objects between 30-45 days old, 45-60 days old, 60-90 days old, 90-180 days old, and over 180 days old are all infrequently accessed and good candidates for Standard – Infrequent Access storage.

You can set up storage class analysis from the Console, the CLI, or through the S3 APIs.

To learn more, read S3 Storage Class Analysis.

S3 Object Tagging
Tags supplement the existing location-based S3 object management model (buckets and prefixes) with the ability to manage storage based on what the object represents. For example, you can tag objects with the name of a department and then construct IAM policies that grant access based on the tag. This gives you a lot of flexibility , including the ability to effect changes by simply altering tags.

You can create, update, or delete them during any part of the object’s life cycle. Tags can be referenced in IAM policies, S3 Lifecycle policies, and in the Storage Analytics filters that I just showed you. Each object can have up to ten tags and each version of an object has a distinct set of tags. You can use tags to manage the expiration of objects via a lifecycle policy and you can set up CloudWatch metrics that reflect the activity around a particular tag.

The Properties page for each object displays the current set of tags and allows me to edit them:

You can also set up and access tags from the CLI or through the S3 APIs. When you use either of these two methods, you must always work in terms of the full tag set. For example, if an object has four tags and you want to add a fifth one, you must read the current set, add the new one, and then update the entire set.

Tags can be replicated across regions via Cross-Region Replication, but your IAM policy on the source side must enable the s3:GetObjectVersionTagging and s3:ReplicateTags actions.

Tags cost $0.01 per 10,000 tags per month. Requests that add or update tags (PUT and GET, respectively) are charged at the usual rates.

To learn more, read about S3 Object Tagging.

S3 Inventory
S3’s LIST operation is synchronous, and returns up to 1000 objects at a time, along with a key that can be used to initiate a second LIST that picks up where the first one leaves off. While this works well for small to medium-sized buckets and single-threaded programs, it can be challenging to use in conjunction with huge buckets or multiple threads.

The S3 team told me that many of our customers store tens or hundreds of millions of objects in a single bucket, and that buckets with a billion or more objects are not unusual. These buckets are often processed daily or weekly as part of a larger workflow and can benefit from a more direct way to gain access to the list of objects in a bucket.

With S3 Inventory, you can now arrange to receive daily or weekly inventory reports for any of your buckets. You can use a prefix to filter the report and you can choose to include optional fields such as size, storage class, and replication status. Reports can be sent to an S3 bucket in your account or (with proper permission settings) in another account.

Here’s how I request a daily inventory report called WebStuff, for all objects that start with www:

I also need to give S3 permission to write to the destination bucket (jbarr-s3-inventory). Here’s the policy that I used (see Granting Permission for Amazon S3 Inventory and Amazon S3 Analytics to learn more):

The inventory mechanism creates three files: a manifest, a checksum for the manifest, and a data file. The manifest contains the location of the data file along with a checksum for it:

   "fileSchema":"Bucket, Key, Size, LastModifiedDate, ETag, StorageClass",

Here’s what the data file looks like after it has been unzipped:

If you are using the inventory reports to power your daily or weekly processing workflow, don’t forget about S3 Notifications! The checksum file is written after the other two, so you can safely use it to get things moving. Also, don’t forget to use a lifecycle policy to manage your collection of inventory files since they will accumulate over time.

As an added bonus, using daily or weekly reports can save you up to 50%, when compared to multiple LIST operations. To learn more about this feature, read about S3 Storage Inventory.

S3 CloudWatch Metrics
S3 can now publish storage, request, and data transfer metrics to CloudWatch. The storage metrics are reported daily and are available at no extra cost. The request and data transfer metrics are available at one minute intervals (after some processing latency) and are billed at the standard CloudWatch rate. As is the case for the S3 Analytics, the CloudWatch metrics can be collected and reported for an entire bucket or for a subset selected via prefix or tags.

Here is the full set of metrics:

Storage Requests Data Transfer
  • Bucket Size Bytes
  • Number Of Objects
  • GET
  • LIST
  • PUT
  • POST
  • ALL
  • HEAD
  • 4xx Errors
  • 5xx Errors
  • Total Request Latency
  • First Byte Latency
  • Bytes Uploaded
  • Bytes Downloaded

The metrics are available within the S3 and CloudWatch consoles. Here’s what the Total Request Latency looks like in the S3 Console for my bucket:

I can also click on View in CloudWatch and set CloudWatch alarms for any desired metric. Perhaps I want to be notified if my bucket grows beyond 100 MB:

To learn more, read How Do I Configure Request Metrics for an S3 Bucket?

Available Now
You have had access to these features since late last year!. They are accessible through the updated S3 Console, which also includes many other new features (watch Introduction to the New Amazon S3 Console to learn more).



AWS Storage Update – S3 & Glacier Price Reductions + Additional Retrieval Options for Glacier

Back in 2006, we launched S3 with a revolutionary pay-as-you-go pricing model, with an initial price of 15 cents per GB per month. Over the intervening decade, we reduced the price per GB by 80%, launched S3 in every AWS Region, and enhanced the original one-size-fits-all model with user-driven features such as web site hosting, VPC integration, and IPv6 support, while adding new storage options including S3 Infrequent Access.

Because many AWS customers archive important data for legal, compliance, or other reasons and reference it only infrequently, we launched Glacier in 2012, and then gave you the ability to transition data between S3, S3 Infrequent Access, and Glacier by using lifecycle rules.

Today I have two big pieces of news for you: we are reducing the prices for S3 Standard Storage and for Glacier storage. We are also introducing additional retrieval options for Glacier.

S3 & Glacier Price Reduction
As long-time AWS customers already know, we work relentlessly to reduce our own costs, and to pass the resulting savings along in the form of a steady stream of AWS Price Reductions.

We are reducing the per-GB price for S3 Standard Storage in most AWS regions, effective December 1, 2016. The bill for your December usage will automatically reflect the new, lower prices. Here are the new prices for Standard Storage:

Regions 0-50 TB
($ / GB / Month)
51 – 500 TB
($ / GB / Month)
500+ TB
($ / GB / Month)
  • US East (Northern Virginia)
  • US East (Ohio)
  • US West (Oregon)
  • EU (Ireland)

(Reductions range from 23.33% to 23.64%)

 $0.0230 $0.0220 $0.0210
  • US West (Northern California)

(Reductions range from 20.53% to 21.21%)

 $0.0260 $0.0250 $0.0240
  • EU (Frankfurt)

(Reductions range from 24.24% to 24.38%)

 $0.0245 $0.0235 $0.0225
  • Asia Pacific (Singapore)
  • Asia Pacific (Tokyo)
  • Asia Pacific (Sydney)
  • Asia Pacific (Seoul)
  • Asia Pacific (Mumbai)

(Reductions range from 16.36% to 28.13%)

 $0.0250 $0.0240 $0.0230

As you can see from the table above, we are also simplifying the pricing model by consolidating six pricing tiers into three new tiers.

We are also reducing the price of Glacier storage in most AWS Regions. For example, you can now store 1 GB for 1 month in the US East (Northern Virginia), US West (Oregon), or EU (Ireland) Regions for just $0.004 (less than half a cent) per month, a 43% decrease. For reference purposes, this amount of storage cost $0.010 when we launched Glacier in 2012, and $0.007 after our last Glacier price reduction (a 30% decrease).

The lower pricing is a direct result of the scale that comes about when our customers trust us with trillions of objects, but it is just one of the benefits. Based on the feedback that I get when we add new features, the real value of a cloud storage platform is the rapid, steady evolution. Our customers often tell me that they love the fact that we anticipate their needs and respond with new features accordingly.

New Glacier Retrieval Options
Many AWS customers use Amazon Glacier as the archival component of their tiered storage architecture. Glacier allows them to meet compliance requirements (either organizational or regulatory) while allowing them to use any desired amount of cloud-based compute power to process and extract value from the data.

Today we are enhancing Glacier with two new retrieval options for your Glacier data. You can now pay a little bit more to expedite your data retrieval. Alternatively, you can indicate that speed is not of the essence and pay a lower price for retrieval.

We launched Glacier with a pricing model for data retrieval that was based on the amount of data that you had stored in Glacier and the rate at which you retrieved it. While this was an accurate reflection of our own costs to provide the service, it was somewhat difficult to explain. Today we are replacing the rate-based retrieval fees with simpler per-GB pricing.

Our customers in the Media and Entertainment industry archive their TV footage to Glacier. When an emergent situation calls for them to retrieve a specific piece of footage, minutes count and they want fast, cost-effective access to the footage. Healthcare customers are looking for rapid, “while you wait” access to archived medical imagery and genome data; photo archives and companies selling satellite data turn out to have similar requirements. On the other hand, some customers have the ability to plan their retrievals ahead of time, and are perfectly happy to get their data in 5 to 12 hours.

Taking all of this in to account, you can now select one of the following options for retrieving your data from Glacier (The original rate-based retrieval model is no longer applicable):

Standard retrieval is the new name for what Glacier already provides, and is the default for all API-driven retrieval requests. You get your data back in a matter of hours (typically 3 to 5), and pay $0.01 per GB along with $0.05 for every 1,000 requests.

Expedited retrieval addresses the need for “while you wait access.” You can get your data back quickly, with retrieval typically taking 1 to 5 minutes.  If you store (or plan to store) more than 100 TB of data in Glacier and need to make infrequent, yet urgent requests for subsets of your data, this is a great model for you (if you have less data, S3’s Infrequent Access storage class can be a better value). Retrievals cost $0.03 per GB and $0.01 per request.

Retrieval generally takes between 1 and 5 minutes, depending on overall demand. If you need to get your data back in this time frame even in rare situations where demand is exceptionally high, you can provision retrieval capacity. Once you have done this, all Expedited retrievals will automatically be served via your Provisioned capacity. Each unit of Provisioned capacity costs $100 per month and ensures that you can perform at least 3 Expedited Retrievals every 5 minutes, with up to 150 MB/second of retrieval throughput.

Bulk retrieval is a great fit for planned or non-urgent use cases, with retrieval typically taking 5 to 12 hours at a cost of $0.0025 per GB (75% less than for Standard Retrieval) along with $0.025 for every 1,000 requests. Bulk retrievals are perfect when you need to retrieve large amounts of data within a day, and are willing to wait a few extra hours in exchange for a very significant discount.

If you do not specify a retrieval option when you call InitiateJob to retrieve an archive, a Standard Retrieval will be initiated. Your existing jobs will continue to work as expected, and will be charged at the new rate.

To learn more, read about Data Retrieval in the Glacier FAQ.

As always, I am thrilled to be able to share this news with you, and I hope that you are equally excited!

If you want to learn more- we have a webinar coming up December 12th. Register here.



CloudTrail Update – Capture and Process Amazon S3 Object-Level API Activity

I would like to show you how several different AWS services can be used together to address a challenge faced by many of our customers. Along the way I will introduce you to a new AWS CloudTrail feature that launches today and show you how you can use it in conjunction with CloudWatch Events.

The Challenge
Our customers store many different types of mission-critical data in Amazon Simple Storage Service (S3) and want to be able to track object-level activity on their data. While some of this activity is captured and stored in the S3 access logs, the level of detail is limited and log delivery can take several hours. Customers, particularly in financial services and other regulated industries, are asking for additional detail, delivered on a more timely basis. For example, they would like to be able to know when a particular IAM user accesses sensitive information stored in a specific part of an S3 bucket.

In order to meet the needs of these customers, we are now giving CloudTrail the power to capture object-level API activity on S3 objects, which we call Data events (the original CloudTrail events are now called Management events). Data events include “read” operations such as GET, HEAD, and Get Object ACL as well as “write” operations such as PUT and POST. The level of detail captured for these operations is intended to provide support for many types of security, auditing, governance, and compliance use cases. For example, it can be used to scan newly uploaded data for Personally Identifiable Information (PII), audit attempts to access data in a protected bucket, or to verify that the desired access policies are in effect.

Processing Object-Level API Activity
Putting this all together, we can easily set up a Lambda function that will take a custom action whenever an S3 operation takes place on any object within a selected bucket or a selected folder within a bucket.

Before starting on this post, I created a new CloudTrail trail called jbarr-s3-trail:

I want to use this trail to log object-level activity on one of my S3 buckets (jbarr-s3-trail-demo). In order to do this I need to add an event selector to the trail. The selector is specific to S3, and allows me to focus on logging the events that are of interest to me. Event selectors are a new CloudTrail feature and are being introduced as part of today’s launch, in case you were wondering.

I indicate that I want to log both read and write events, and specify the bucket of interest. I can limit the events to part of the bucket by specifying a prefix, and I can also specify multiple buckets. I can also control the logging of Management events:

CloudTrail supports up to 5 event selectors per trail. Each event selector can specify up to 50 S3 buckets and optional bucket prefixes.

I set this up, opened my bucket in the S3 Console, uploaded a file, and took a look at one of the entries in the trail. Here’s what it looked like:

  "eventVersion": "1.05",
  "userIdentity": {
    "type": "Root",
    "principalId": "99999999999",
    "arn": "arn:aws:iam::99999999999:root",
    "accountId": "99999999999",
    "username": "jbarr",
    "sessionContext": {
      "attributes": {
        "creationDate": "2016-11-15T17:55:17Z",
        "mfaAuthenticated": "false"
  "eventTime": "2016-11-15T23:02:12Z",
  "eventSource": "",
  "eventName": "PutObject",
  "awsRegion": "us-east-1",
  "sourceIPAddress": "",
  "userAgent": "[S3Console/0.4]",
  "requestParameters": {
    "X-Amz-Date": "20161115T230211Z",
    "bucketName": "jbarr-s3-trail-demo",
    "X-Amz-Algorithm": "AWS4-HMAC-SHA256",
    "storageClass": "STANDARD",
    "cannedAcl": "private",
    "X-Amz-SignedHeaders": "Content-Type;Host;x-amz-acl;x-amz-storage-class",
    "X-Amz-Expires": "300",
    "key": "ie_sb_device_4.png"

Then I create a simple Lambda function:

Next, I create a CloudWatch Events rule that matches the function name of interest (PutObject) and invokes my Lambda function (S3Watcher):

Now I upload some files to my bucket and check to see that my Lambda function has been invoked as expected:

I can also find the CloudWatch entry that contains the output from my Lambda function:

Pricing and Availability
Data events are recorded only for the S3 buckets that you specify, and are charged at the rate of $0.10 per 100,000 events. This feature is available in all commercial AWS Regions.


Human Longevity, Inc. – Changing Medicine Through Genomics Research

Human Longevity, Inc. (HLI) is at the forefront of genomics research and wants to build the world’s largest database of human genomes along with related phenotype and clinical data, all in support of preventive healthcare. In today’s guest post, Yaron Turpaz,  Bryan Coon, and Ashley Van Zeeland talk about how they are using AWS to store the massive amount of data that is being generated as part of this effort to revolutionize medicine.


When Human Longevity, Inc. launched in 2013, our founders recognized the challenges ahead. A genome contains all the information needed to build and maintain an organism; in humans, a copy of the entire genome, which contains more than three billion DNA base pairs, is contained in all cells that have a nucleus. Our goal is to sequence one million genomes and deliver that information—along with integrated health records and disease-risk models—to researchers and physicians. They, in turn, can interpret the data to provide targeted, personalized health plans and identify the optimal treatment for cancer and other serious health risks far earlier than has been possible in the past. The intent is to transform medicine by fostering preventive healthcare and risk prevention in place of the traditional “sick care” model, when people wind up seeing their doctors only after symptoms manifest.

Our work in developing and applying large-scale computing and machine learning to genomics research entails the collection, analysis, and storage of immense amounts of data from DNA-sequencing technology provided by companies like Illumina. Raw data from a single genome consumes about 100 gigabytes; that number increases as we align the genomic information with annotation and phenotype sources and analyze it for health insights.

From the beginning, we knew our choice of compute and storage technology would have a direct impact on the success of the company. Using the cloud was clearly the best option. We’re experts in genomics, and don’t want to spend resources building and maintaining an IT infrastructure. We chose to go all in on AWS for the breadth of the platform, the critical scalability we need, and the expertise AWS has developed in big data. We also saw that the pace of innovation at AWS—and its deliberate strategy of keeping costs as low as possible for customers—would be critical in enabling our vision.

Leveraging the Range of AWS Services

 Spectral karyotype analysis / Image courtesy of Human Longevity, Inc.

Spectral karyotype analysis / Image courtesy of Human Longevity, Inc.

Today, we’re using a broad range of AWS services for all kinds of compute and storage tasks. For example, the HLI Knowledgebase leverages a distributed system infrastructure comprised of Amazon S3 storage and a large number of Amazon EC2 nodes. This helps us achieve resource isolation, scalability, speed of provisioning, and near real-time response time for our petabyte-scale database queries and dynamic cohort builder. The flexibility of AWS services makes it possible for our customized Amazon Machine Images and pre-built, BTRFS-partitioned Amazon EBS volumes to achieve turn-up time in seconds instead of minutes. We use Amazon EMR for executing Spark queries against our data lake at the scale we need. AWS Lambda is a fantastic tool for hooking into Amazon S3 events and communicating with apps, allowing us to simply drop in code with the business logic already taken care of. We use Auto Scaling based on demand, and AWS OpsWorks for managing a Docker pipeline.

We also leverage the cost controls provided by Amazon EC2 Spot and Reserved Instance types. When we first started, we used on-demand instances, but the costs started to grow significantly. With Spot and Reserved Instances, we can allocate compute resources based on specific needs and workflows. The flexibility of AWS services enables us to make extensive use of dockerized containers through the resource-management services provided by Apache Mesos. Hundreds of dynamic Amazon EC2 nodes in both our persistent and spot abstraction layers are dynamically adjusted to scale up or down based on usage demand and the latest AWS pricing information. We achieve substantial savings by sharing this dynamically scaled compute cluster with our Knowledgebase service and the internal genomic and oncology computation pipelines. This flexibility gives us the compute power we need while keeping costs down. We estimate these choices have helped us reduce our compute costs by up to 50 percent from the on-demand model.

We’ve also worked with AWS Professional Services to address a particularly hard data-storage challenge. We have genomics data in hundreds of Amazon S3 buckets, many of them in the petabyte range and containing billions of objects. Within these collections are millions of objects that are unused, or used once or twice and never to be used again. It can be overwhelming to sift through these billions of objects in search of one in particular. It presents an additional challenge when trying to identify what files or file types are candidates for the Amazon S3-Infrequent Access storage class. Professional Services helped us with a solution for indexing Amazon S3 objects that saves us time and money.

Moving Faster at Lower Cost
Our decision to use AWS came at the right time, occurring at the inflection point of two significant technologies: gene sequencing and cloud computing. Not long ago, it took a full year and cost about $100 million to sequence a single genome. Today we can sequence a genome in about three days for a few thousand dollars. This dramatic improvement in speed and lower cost, along with rapidly advancing visualization and analytics tools, allows us to collect and analyze vast amounts of data in close to real time. Users can take that data and test a hypothesis on a disease in a matter of days or hours, compared to months or years. That ultimately benefits patients.

Our business includes HLI Health Nucleus, a genomics-powered clinical research program that uses whole-genome sequence analysis, advanced clinical imaging, machine learning, and curated personal health information to deliver the most complete picture of individual health. We believe this will dramatically enhance the practice of medicine as physicians identify, treat, and prevent diseases, allowing their patients to live longer, healthier lives.

— Yaron Turpaz (Chief Information Officer), Bryan Coon (Head of Enterprise Services), and Ashley Van Zeeland (Chief Technology Officer).

Learn More
Learn more about how AWS supports genomics in the cloud, and see how genomics innovator Illumina uses AWS for accelerated, cost-effective gene sequencing.