Amazon Web Services Blog

  • Multi-AZ Support / Auto Failover for Amazon ElastiCache for Redis

    24 Oct 2014 in Amazon ElastiCache | permalink

    Like every AWS offering, Amazon ElastiCache started out simple and then grew in breadth and depth over time. Here's a brief recap of the most important milestones:

    • August 2011 - Initial launch with support for the Memcached caching engine in one AWS Region.
    • December 2011 - Expansion to four additional Regions.
    • March 2012 - The first of several price reductions.
    • April 2012 - Introduction of Reserved Cluster Nodes.
    • November 2012 - Introduction of four additional types of Cache Nodes.
    • September 2013 - Initial support for the Redis caching engine including Replication Groups with replicas for increased read throughput.
    • March 2014 - Another price reduction.
    • April 2014 - Backup and restore of Redis Clusters.
    • July 2014 - Support for M3 and R3 Cache Nodes.
    • July 2014 - Node placement across more than one Availability Zone in a Region.
    • September 2014 - Support for T2 Cache Nodes.

    When you start to use any of the AWS services, you should always anticipate a steady stream of enhancements. Some of them, as you can see from list above, will give you additional flexibility with regard to architecture, scalability, or location. Others will improve your cost structure by reducing prices or adding opportunities to purchase Reserved Instances. Another class of enhancements simplifies the task of building applications that are resilient and fault-tolerant.

    Multi-AZ Support for Redis
    Today's launch is designed to help you to add additional resilience and fault tolerance to your Redis Cache Clusters. You can now create a Replication Group that spans multiple Availability Zones with automatic failure detection and failover.

    After you have created a Multi-AZ Replication Group, ElastiCache will monitor the health and connectivity of the nodes. If the primary node fails, ElastiCache will select the read replica that has the lowest replication lag (in other words, the one that is the most current) and make it the primary node. It will then propagate a DNS change, create another read replica, and wire everything back together, with no administrative work on your side.

    This new level of automated fault detection and recovery will enhance the overall availability of your Redis Cache Clusters. The following situations will initiate the failover process:

    1. Loss of availability in the primary's Availability Zone.
    2. Loss of network connectivity to the primary.
    3. Failure of the primary.

    Creating a Multi-AZ Replication Group
    You can create a Multi-AZ Cache Replication Group by checking the Multi-AZ checkbox after selecting Create Cache Cluster:

    A diverse set of Availability Zones will be assigned by default. You can easily override them in order to better reflect the needs of your application:

    Multi-AZ for Existing Cache Clusters
    You can also modify your existing Cache Cluster to add Multi-AZ residency and automatic failover with a couple of clicks.

    Things to Know
    The Multi-AZ support in ElastiCache for Redis currently makes use of the asynchronous replication that is built in to newer versions (2.8.6 and beyond) of the Redis engine. As such, it is subject to its strengths and weaknesses. In particular, when a read replica connects to a primary for the first time or when the primary changes, the replica will perform a full synchronization with the primary. This ensures that the cached information is as current as possible, but it will impose an additional load on the primary and the read replica(s).

    The entire failover process, from detection to the resumption of normal caching behavior, will take several minutes. Your application's caching tier should have a strategy (and some code!) to deal with a cache that is momentarily unavailable.

    Available Now
    This new feature is available now in all public AWS Regions and you can start using it today. The feature is offered at no extra charge to all ElastiCache users.

    -- Jeff;

  • OpenID Connect Support for Amazon Cognito

    23 Oct 2014 in Amazon Cognito | permalink

    This past summer, we launched Cognito to simplify the task of authenticating users and storing, managing, and syncing their data across multiple devices. Cognito already supports a variety of identities — public provider identities (Facebook, Google, and Amazon), guest user identities, and recently announced developer authenticated identities.

    Today we are making Amazon Cognito even more flexible by enabling app developers to use identities from any provider that supports OpenID Connect (OIDC). For example, you can write AWS-powered apps that allow users to sign in using their user name and password from Salesforce or Ping Federate. OIDC is an open standard enables developers to leverage additional identity providers for authentication. This way they can focus on developing their app rather than dealing with user names and passwords.

    Today's launch adds OIDC provider identities to the list. Cognito takes the ID token that you obtain from the OIDC identity provider and uses it to manufacture unique Cognito IDs for each person who uses your app. You can use this identifier to save and synchronize user data across devices and to retrieve temporary, limited-privilege AWS credentials through the AWS Security Token Service.

    Building upon the support for SAML (Security Assertion Markup Language) that we launched last year, we hope that today's addition of support for OIDC demonstrates our commitment to open standards. To learn more and to see some sample code, see our new post, Building an App using Amazon Cognito and an OpenID Connect Identity Provider on the AWS Security Blog. If you are planning to attend Internet Identity Workshop next week, come meet the members of the team that added this support!

    -- Jeff;

  • Now Open - AWS Germany (Frankfurt) Region - EC2, DynamoDB, S3, and Much More

    23 Oct 2014 in Amazon EC2, Europe | permalink

    It is time to expand the AWS footprint once again, this time with a new Region in Frankfurt, Germany. AWS customers in Europe can now use the new EU (Frankfurt) Region along with the existing EU (Ireland) Region for fast, low-latency access to the suite of AWS infrastructure services. You can now build multi-Region applications with the assurance that your content will stay within the EU.

    New Region
    The new Frankfurt Region supports Amazon Elastic Compute Cloud (EC2) and related services including Amazon Elastic Block Store (EBS), Amazon Virtual Private Cloud, Auto Scaling, and Elastic Load Balancing.

    It also supports AWS Elastic Beanstalk, AWS CloudFormation, Amazon CloudFront, Amazon CloudSearch, AWS CloudTrail, Amazon CloudWatch, AWS Direct Connect, Amazon DynamoDB, Amazon Elastic MapReduce, AWS Storage Gateway, Amazon Glacier, AWS CloudHSM, AWS Identity and Access Management (IAM), Amazon Kinesis, AWS OpsWorks, Amazon Route 53, Amazon Relational Database Service (RDS), Amazon Redshift, Amazon Simple Storage Service (S3), Amazon Simple Notification Service (SNS), Amazon Simple Queue Service (SQS), and Amazon Simple Workflow Service (SWF).

    The Region supports all sizes of T2, M3, C3, R3, and I2 instances. All EC2 instances must be launched within a Virtual Private Cloud in this Region (see my blog post, Virtual Private Clouds for Everyone for more information).

    There are also three edge locations in Frankfurt for Amazon Route 53 and Amazon CloudFront.

    This is our eleventh Region (see the AWS Global Infrastructure map for more information). As usual, you can see the full list in the Region menu of the AWS Management Console:

    Rigorous Compliance
    Every AWS Region is designed and built to meet rigorous compliance standards including ISO 27001, SOC 1, PCI DSS Level 1, to name a few (see the AWS Compliance page for more info). AWS is fully compliant with all applicable EU Data Protection laws. For customers who wish to use AWS to store personal data, AWS provides a data processing agreement. More information on how customers can use AWS to meet EU Data Protection requirements can be found at AWS Data Protection.

    Customers
    Many organizations in Europe are already making use of AWS. Here's a very small sample:

    mytaxi (Slideshare presentation) is a very popular (10 million users and 45,000 taxis) taxi booking application. They use AWS to help them to service their global customer base in real time. They plan to use the new Region to provide even better service to their customers in Germany.


    Wunderlist (case study) was first attracted to AWS by, as they say, the "fantastic technology stack." Empowered by AWS, they have developed an agile deployment model that allows them to deploy new code several times per day. They can experiment more often (with very little risk) and can launch new products more quickly. They believe that the new AWS Region will benefit their customers in Germany and will also inspire the local startup scene.

    AWS Partner Network
    Members of the AWS Partner Network (APN) have been preparing for the launch of the new Region. Here's a sampling (send me email with launch day updates).

    Software AG is using AWS as a global host for ARIS Cloud, a Business Process Analysis-as-a-Service (BPAaaS) product. AWS allows Software AG to focus on their core competency—the development of great software and gives them the power to roll out new cloud products globally within days.

    Trend Micro is bringing their security solutions to the new region. Trend Micro Deep Security helps customers secure their AWS deployments and instances against the latest threats, including Shellshock and Heartbleed.

    Here are a few late-breaking (post-launch additions):

    1. BitNami - Support for the new Amazon Cloud Region in Germany.
    2. Appian - Appian Cloud Adds Local Hosting in Germany

    Here are some of the latest and greatest third party operating system AMIs in the new Region:

    1. Canonical - Ubuntu Server 14.04 LTS
    2. SUSE - SUSE Linux Enterprise Server 11 SP3

    For Developers - Signature Version 4 Support
    This new Region supports only Signature Version 4. If you have built applications with the AWS SDKs or the AWS Command Line Interface (CLI) and your API calls are being rejected, you should update to the newest SDK and CLI. To learn more, visit Using the AWS SDKs and Explorers.

    AWS Offices in Europe
    In order to support enterprises, government agencies, academic institutions, small-to-mid size companies, startups, and developers, there are AWS offices in Germany (Berlin, Munich), the UK (London), Ireland (Dublin), France (Paris), Luxembourg (Luxembourg City), Spain (Madrid), Sweden (Stockholm), and Italy (Milan).

    Use it Now
    This new Region is open for business now and you can start using it today!

    -- Jeff;

    PS - Like our US West (Oregon) and AWS GovCloud (US) Regions, this region uses carbon-free power!

  • MLB.com Statcast Debuts at the World Series - Powered by AWS

    23 Oct 2014 | permalink

    Yesterday, the team at MLB Advanced Media (MLBAM) launched MLB.com Statcast for the 2014 World Series. This cool new video experience, powered by AWS, demonstrates for fans how high-resolution cameras and radar equipment precisely track the position of the ball and all of the players on the field during a baseball game. The equipment captures 20,000 position metrics for the ball every second. It also captures 30 position metrics for each player every second.

    The data is used to create a newly introduced video overlay experience — MLB.com Statcast powered by AWS — to display the computed performance metrics that measure the performance of each player. This data, and the renderings that it creates, help to provide today's baseball fans with the detailed and engaging online content that they crave.

    Here are a couple of examples that will show you more about the data collected and displayed through Statcast, using a diving catch from Game 6 of the ALCS. First, the pitch:

    The reaction in center field:

    And the catch:

    Watch the complete video to see and hear the action!

    -- Jeff;

  • AWS Ad Tech Conference - This Friday in San Francisco!

    22 Oct 2014 in AWS Loft | permalink

    The advertising space is going through a rapid, technology-enabled, data-driven transformation!

    Many of the companies driving this change are using AWS services like Amazon Elastic MapReduce, Amazon Redshift, Amazon DynamoDB, Amazon Kinesis, and Amazon CloudFront to serve, ingest, process, store, analyze, track, and optimize their online advertising campaigns.

    If you work for an ad tech company in the San Francisco area you should consider attending a free one-day event for developers and architects this coming Friday (October 24th) in San Francisco.

    Attend, Learn, Meet
    If you attend the event you will get to learn AWS in a series of five technical deep dive sessions that are laser focused on the key AWS technologies that I mentioned above. You will also get to hear AWS customers such as Adroll (ad retargeting), Blinkx (video discovery and sharing), Bloomreach (big data marketing), Krux Digital (cross-screen data management), SetMedia (digital video classification), Tune and Viglink (automated monetization) share their real-life use cases, architectures, and the lessons they learned on their journey to the cloud. The day will end with a networking reception at 5:00 PM.

    This event is designed for developers and architects who are already familiar with AWS and are looking to increase their knowledge of key ad tech enabling services and learn directly from their industry peers. This is not an introductory or business-level event.

    Register Now
    The event runs from 10:00 AM to 6:00 PM this coming Friday. It will be held in the AWS Pop-up Loft at 925 Market Street in San Francisco. Registration is mandatory, space is limited, and there's no charge to attend. To register:

    1. Go to the AWS Pop-up Loft site and click Register to attend the AWS Loft. If this is your first time registering for an event at the AWS Pop-Up Loft, you'll need to create a new account first. Otherwise, just log in to the site first.
    2. On the Evening Events/Sessions go to Friday, 10/24/14, check the box for Advertising Technology Day and continue through the registration process.

    Agenda
    Here is the agenda for the day:

    Time Session
    9:30 AM Arrive and Register
    10:00 AM Customer Presentation (Viglink)
    10:30 AM Customer Presentation (Krux)
    11:00 AM Amazon EMR Best Practices
    11:30 AM Customer Presentation (Bloomreach)
    12:00 PM Lunch and Informal Q&A
    12:30 PM Amazon Redshift Best Practices
    1:00 PM Customer Presentation (Tune)
    1:30 PM Amazon Kinesis Best Practices
    2:00 PM Customer Presentation (SET Media)
    2:30 PM Amazon CloudFront Best Practices
    3:00 PM Customer Presentation (Blinkx)
    3:30 PM Amazon DynamoDB Best Practices
    4:00 PM Customer Presentation (AdRoll)
    4:30 PM Q&A
    5:00 PM Happy Hour Networking Reception

    -- Jeff;

  • New AWS Directory Service

    Virtually every organization uses a directory service such as Active Directory to allow computers to join domains, list and authenticate users, and to locate and connect to printers, and other network services including SQL Server databases. A centralized directory reduces the amount of administrative work that must be done when an employee joins the organization, changes roles, or leaves.

    With the advent of cloud-based services, an interesting challenge has arisen. By design, the directory is intended to be a central source of truth with regard to user identity. Administrators should not have to maintain one directory service for on-premises users and services, and a separate, parallel one for the cloud. Ideally, on-premises and cloud-based services could share and make use of a single, unified directory service.

    Perhaps you want to run Microsoft Windows on EC2 or centrally control access to AWS applications such as Amazon WorkSpaces or Amazon Zocalo. Setting up and then running a directory can be a fairly ambitious undertaking once you take in to account the need to procure and run hardware, install, configure and patch the operating system, and the directory, and so forth. This might be overkill if you have a user base of modest size and just want to use the AWS applications and exercise centralized control over users and permissions.

    The New AWS Directory Service
    Today we are introducing the AWS Directory Service to address these challenges! This managed service provides two types of directories. You can connect to an existing on-premises directory or you can set up and run a new, Samba-based directory in the Cloud.

    If your organization already has a directory, you can now make use of it from within the cloud using the AD Connector directory type. This is a gateway technology that serves as a cloud proxy to your existing directory, without the need for complex synchronization technology or federated sign-on. All communication between the AWS Cloud and your on-premises directory takes place over AWS Direct Connect or a secure VPN connection within a Amazon Virtual Private Cloud. The AD Connector is easy to set up (just a few parameters) and needs very little in the way of operational care and feeding. Once configured, your users can use their existing credentials (user name and password, with optional RADIUS authentication) to log in to WorkSpaces, Zocalo, EC2 instances running Microsoft Windows, and the AWS Management Console. The AD Connector is available in Small (up to 10,000 users, computers, groups, and other directory objects) and Large (up to 100,000 users, computers, groups, and other directory objects).

    If you don't currently have a directory and don't want to be bothered with all of the care and feeding that's traditionally been required, you can quickly and easily provision and run a Samba-based directory in the cloud using the Simple AD directory type. This directory supports most of the common Active Directory features including joins to Windows domains, management of Group Policies, and single sign-on to directory- powered apps. EC2 instances that run Windows can join domains and can be administered en masse using Group Policies for consistency. Amazon WorkSpaces and Amazon Zocalo can make use of the directory. Developers and system administrators can use their directory credentials to sign in to the AWS Management Console in order to manage AWS resources such as EC2 instances or S3 buckets.

    Getting Started
    Regardless of the directory type that you choose, getting started is quick and easy. Keep in mind, of course, that you are setting up an important piece of infrastructure and choose your names and passwords accordingly. Let's walk through the process of setting up each type of directory.

    I can create an AD Connector as a cloud-based proxy to an existing Active Directory running within my organization. I'll have to create a VPN connection from my Virtual Private Cloud to my on-premises network, making use of AWS Direct Connect if necessary. Then I will need to create an account with sufficient privileges to allow it handle lookup, authentication, and domain join requests. I'll also need the DNS name of the existing directory. With that information in hand, creating the AD Connector is a simple matter of filling in a form:

    I also have to provide it within information about my VPC, including the subnets where I'd like the directory servers to be hosted:

    The AD Connector will be up & running and ready to use within minutes!

    Creating a Simple AD in the cloud is also very simple and straightforward. Again, I need to choose one of my VPCs and then pick a pair of subnets within it for my directory servers:

    Again, the Simple AD will be up, running, and ready for use within minutes.

    Managing Directories
    Let's take a look at the management features that are available for the AD Connector and Simple AD. The Console shows me a list of all of my directories:

    I can dive in to the details with a click. As you can see at the bottom of this screen, I can also create a public endpoint for my directory. This will allow it to be used for sign-in to AWS applications such as Zocalo and WorkSpaces, and to the AWS Management Console:

    I can also configure the AWS applications and the Console to use the directory:

    I can also create, restore, and manage snapshot backups of my Simple AD (backups are done automatically every 24 hours; I can also initiate a manual backup at any desired time):

    Get Started Today
    Both types of directory are available now and you can start creating and using them today in the US East (Northern Virginia), US West (Oregon), Asia Pacific (Sydney), Asia Pacific (Tokyo), and Europe (Ireland) Regions. Prices start at $0.05 per hour for Small directories of either type and $0.15 per hour for Large directories of either type in the US East (Northern Virginia) Region. See the AWS Directory Service page for pricing information in the other AWS Regions.

    -- Jeff;

  • CloudFront Update - Trends, Metrics, Charts, More Timely Logs

    21 Oct 2014 in Amazon CloudFront | permalink

    The Amazon CloudFront team has added a slew of analytics and reporting features this year. I would like to recap a pair of recent releases and then introduce you to the features that we are releasing today. As you probably know, CloudFront is a content delivery web service that integrates with the other parts of AWS for easy and efficient low-latency delivery of content to end users.

    CloudFront Usage Charts
    We launched a set of CloudFront Usage Charts back in March. The charts let you track trends in data transfer and requests (both HTTP and HTTPS) for each of your active CloudFront web distributions. Data is shown with daily or hourly granularity. These charts are available to you at no extra charge. You don't have to make any changes to your distribution in order to collect the data or to view the charts. Here is a month's worth of data for one of my distributions:

    You can easily choose the distribution of interest, the desired time period and the reporting granularity:

    You can also narrow down the reports by billing region:

    Operational Metrics
    Earlier this month CloudFront began to publish a set of Operational Metrics to Amazon CloudWatch. These metrics are published every minute and reflect activity that's just a few minutes old, giving you information that is almost real-time in nature. As is the case with any CloudWatch metric, you can display and alarm on any of the items. The following metrics are available for each of your distributions:

    • Requests - Number of requests for all HTTP methods and for both HTTP and HTTPS requests.
    • BytesDownloaded - Number of bytes downloaded by viewers for GET, HEAD, and OPTIONS requests.
    • BytesUploaded - Number of bytes uploaded to the origin with CloudFront using POST and PUT requests.
    • TotalErrorRate - Percentage of all requests for which the HTTP status code is 4xx or 5xx.
    • 4xxErrorRate - Percentage of all requests for which the HTTP status code is 4xx.
    • 5xxErrorRate - Percentage of all requests for which the HTTP status code is 5xx.

    The first three metrics are absolute values and make the most sense when you view the Sum statistic. For example, here is the hourly request rate for my distribution:

    The other three metrics are percentages and the Average statistic is appropriate. Here is the error rate for my distribution (I had no idea that it was so high and need to spend some time investigating):

    Once I track this down (a task that will have to wait until after AWS re:Invent, I will set an Alarm as follows:

    The metrics are always delivered to the US East (Northern Virginia) Region; you'll want to make sure that it is selected in the Console's drop-down menu. Metrics are not emitted if the distribution has no traffic. As a consequence, the metric may not appear in CloudWatch if it has no requests.

    New - More Timely Logs
    Today we are improving the timeliness of the CloudFront logs. There are two aspects to this change. First, we are increasing the frequency with which CloudFront delivers log files to your Amazon Simple Storage Service (S3) bucket. Second, we are reducing the delay between data collection and data delivery. With these changes, the newest log files in your bucket will reflect events that have happened as recently as an hour ago.

    We have also improved the batching model as part of this release. As a result, many applications will see fewer files now than they did in the past, despite the increased delivery frequency.

    New - Cache Statistics & Popular Objects Report
    We are also launching a set of new Cache Statistics reports today. These reports are based on the entries in your log files and are available on a per-distribution and all-distribution basis with day-level granularity for any time frame within a 60-day period and hour-level granularity for any 14-day interval the same 60-day period. These reports allow filtering by viewer location. You can, for example, filter by continent in order to gain a better understanding of traffic characteristics that are dependent on the geographic location of your viewer.

    The following reports are available:

    • Total Requests - This report shows the total number of requests for all HTTP status codes and all methods.
    • Percentage of Viewer Requests by Result Type - This report shows cache hits, misses, and errors as percentages of total viewer requests.
    • Bytes Transferred to Viewers - This report shows the total number of bytes that CloudFront served to viewers in response to all requests for all HTTP methods. It also shows the number of bytes served to viewers for objects that were not in the edge cache (CloudFront node) at the time of the request. This is a good approximation for the number of bytes transferred from the origin.
    • HTTP Status Codes - This report shows the number of viewer requests by HTTP status code (2xx, 3xx, 4xx, and 5xx).
    • Unfinished GET Requests - This report shows the percentage of GET requests that didn't finish downloading the requested object, as a percentage of the total requests.

    Here are the reports:

    The new Popular Objects report shows request count, cache hit and cache miss counts, as well as error rates for the 50 most popular objects during the specified period. This helps you understand which content is most popular among your viewers, or identify any issues (such as high error rates) with your most requested objects. Here's a sample report from one of my distributions:

    Available Now
    The new reports and the more timely logs are available now. Data is collected in all public AWS Regions.

    -- Jeff;

    If you want to learn even more about these cool new features, please join us at 10:00 AM (PT) on November 20th for our Introduction to CloudFront Reporting Features webinar.

  • CloudWatch Update - Enhanced Support for Windows Log Files

    Earlier this year, we launched a log storage and monitoring feature for Amazon CloudWatch. As a quick recap, this feature allows you to upload log files from your Amazon Elastic Compute Cloud (EC2) instances to CloudWatch, where they are stored durably and easily monitored for specific symbols or messages.

    The EC2Config service runs on Microsoft Windows instances on EC2, and takes on a number of important tasks. For example it is responsible for uploading log files to CloudWatch. Today we are enhancing this service with support for Windows Performance Counter data and ETW (Event Tracing for Windows) logs. We are also adding support for custom log files.

    In order to use this feature, you must enable CloudWatch logs integration and then tell it which files to upload. You can do this from the instance by running EC2Config and checking Enable CloudWatch Logs integration:

    The file %PROGRAMFILES%\Amazon\Ec2ConfigService\Settings\AWS.EC2.Windows.CloudWatch.json specifies the files to be uploaded.

    To learn more about how this feature works and how to configure it, head on over to the AWS Application Management Blog and read about Using CloudWatch Logs with Amazon EC2 Running Microsoft Windows Server.

    -- Jeff;

  • Speak to Amazon Kinesis in Python

    My colleague Rahul Patil sent me a nice guest post. In the post Rahul shows you how to use the new Kinesis Client Library (KCL) for Python developers.

    -- Jeff;


    The Amazon Kinesis team is excited to release the Kinesis Client Library (KCL) for Python developers! Developers can use the KCL to build distributed applications that process streaming data reliably at scale. The KCL takes care of many of the complex tasks associated with distributed computing, such as load-balancing across multiple instances, responding to instance failures, checkpointing processed records, and reacting to changes in stream volume.

    You can download the KCL for Python using Github, or PyPi.

    Getting Started
    Once you are familiar with key concepts of Kinesis and KCL, you are ready to write your first application. Your code has the following duties:

    1. Set up application configuration parameters.
    2. Implement a record processor.

    The application configuration parameters are specified by adding a properties file. For example:

    # The python executable script 
    executableName = sample_kclpy_app.py
    
    # The name of an Amazon Kinesis stream to process.
    streamName = words
    
    # Unique KCL application name
    applicationName = PythonKCLSample
    
    # Read from the beginning of the stream
    initialPositionInStream = TRIM_HORIZON
    

    The above example configures KCL to process a Kinesis stream called "words" using the record processor supplied in sample_kclpy_app.py. The unique application name is used to coordinate amongst workers running on multiple instances.

    Developers have to implement the following three methods in their record processor:

    initialize(self, shard_id)
    process_records(self, records, checkpointer)
    shutdown(self, checkpointer, reason)
    

    initialize() and shutdown() are self-explanatory; they are called once in the lifecycle of the record processor to initialize and clean up the record processor respectively. If the shutdown reason is TERMINATE (because the shard has ended due to split/merge operations), then you must also take care to checkpoint all of the processed records.

    You implement the record processing logic inside the process_records() method. The code should loop through the batch of records and checkpoint at the end of the call. The KCL assumes that all of the records have been processed. In the event the worker fails, the checkpointing information is used by KCL to restart the processing of the shard at the last checkpointed record.

    # Process records and checkpoint at the end of the batch
        def process_records(self, records, checkpointer):
            for record in records:
                # record data is base64 encoded
                data = base64.b64decode(record.get('data'))
                ####################################       
                # Insert your processing logic here#
                ####################################       
           
            #checkpoint after you are done processing the batch  
            checkpointer.checkpoint()
    

    The KCL connects to the stream, enumerates shards, and instantiates a record processor for each shard. It pulls data records from the stream and pushes them into the corresponding record processor. The record processor is also responsible for checkpointing processed records.

    Since each record processor is associated with a unique shard, multiple record processors can run in parallel. To take advantage of multiple CPUs on the machine, each Python record processor runs in a separate process. If you run the same KCL application on multiple machines, the record processors will be load-balanced across these machines. This way, KCL enables you to seamlessly change machine types or alter the size of the fleet.

    Running the Sample
    The release also comes with a sample word counting application. Navigate to the amazon_kclpy directory and install the package.

    $ python setup.py download_jars
    $ python setup.py install
    

    A sample putter is provided to create a Kinesis stream called "words" and put random words into that stream. To start the sample putter, run:

    $ sample_kinesis_wordputter.py --stream words .p 1 -w cat -w dog -w bird
    

    You can now run the sample python application that processes records from the stream we just created:

    $ amazon_kclpy_helper.py --print_command --java <path-to-java> --properties samples/sample.properties
    

    Before running the samples, you'll want to make sure that your environment is configured to allow the samples to use your AWS credentials via the default AWS Credentials Provider Chain.

    Under the Hood - What You Should Know
    KCL for Python uses KCL for Java. We have implemented a Java based daemon, called MultiLangDaemon that does all the heavy lifting. Our approach has the daemon spawn a sub-process, which in turn runs the record processor, which can be written in any language. The MultiLangDaemon process and the record processor sub-process communicate with each other over STDIN and STDOUT using a defined protocol. There will be a one to one correspondence amongst record processors, child processes, and shards. For Python developers specifically, we have abstracted these implementation details away and expose an interface that enables you to focus on writing record processing logic in Python. This approach enables KCL to be language agnostic, while providing identical features and similar parallel processing model across all languages.

    Join the Kinesis Team
    The Amazon Kinesis team is looking for talented Web Developers and Software Development Engineers to push the boundaries of stream data processing! Here are some of our open positions:

    -- Rahul Patil

  • Next Generation Genomics With AWS

    21 Oct 2014 in Big Data, Genomics | permalink

    My colleague Matt Wood wrote a great guest post to announce new support for one of our genomics partners.

    -- Jeff;


    I am happy to announce that AWS will be supporting the work of our partner, Seven Bridges Genomics, who has been selected as one of the National Cancer Institute (NCI) Cancer Genomics Cloud Pilots. The cloud has become the new normal for genomics workloads, and AWS has been actively involved since the earliest days, from being the first cloud vendor to host the 1000 Genomes Project, to newer projects like designing synthetic microbes, and development of novel genomics algorithms that work at population scale. The NCI Cancer Genomics Cloud Pilots are focused on how the cloud has the potential to be a game changer in terms of scientific discovery and innovation in the diagnosis and treatment of cancer.

    The NCI Cancer Genomics Cloud Pilots will help address a problem in cancer genomics that is all too familiar to the wider genomics community: data portability. Today's typical research workflow involves downloading large data sets, (such as the previously mentioned 1000 Genomes Project or The Cancer Genome Atlas (TCGA)) to on-premises hardware, and running the analysis locally. Genomic datasets are growing at an exponential rate and becoming more complex as phenotype-genotype discoveries are made, making the current workflow slow and cumbersome for researchers. This data is difficult to maintain locally and share between organizations. As a result, genomic research and collaborations have become limited by the available IT infrastructure at any given institution.

    The NCI Cancer Genomics Cloud Pilots will take the natural step to solve this problem, by bringing the computation to where the data is, rather than the other way around. The goal of the NCI Cancer Genomics Cloud Pilots are to create cloud-hosted repositories for cancer genome data that reside alongside the tools, algorithms, and data analysis pipelines needed to make use of the data. These Pilots will provide ways to provision computational resources within the cloud so that researchers can analyze the data in place. By collocating data in the cloud with the necessary interface, algorithms, and self-provisioned resources, these Pilots will remove barriers to entry, allowing researchers to more easily participate in cancer research and accelerating the pace of discovery. This means more life-saving discoveries such as better ways to diagnose stomach cancer, or the identification of novel mutations in lung cancer that allow for new drug targets.

    The Pilots will also allow cancer researchers to provision compute clusters that change as their research needs change. They will have the necessary infrastructure to support their research when they need it, rather than make a guess at the resources that they will need in the future every time grant writing season starts. They will also be able to ask many more novel questions of the data, now that they are no longer constrained by a static set of computational resources.

    Finally, the NCI Cancer Genomics Pilots will help researchers collaborate. When data sets are publicly shared, it becomes simple to exchange and share all the tools necessary to reproduce and expand upon another lab's work. Other researchers will then be able to leverage that software within the community, or perhaps even in an unrelated field of study, resulting in even more ideas be generated.

    Since 2009, Seven Bridges Genomics has developed a platform to allow biomedical researchers to leverage AWS's cloud infrastructure to focus on their science rather than managing computational resources for storage and execution. Additionally, Seven Bridges has developed security measures to ensure compliance with Health Insurance Portability and Accountability Act (HIPAA) for all data stored in the cloud. For the NCI Cancer Genomics Cloud Pilots, the team will adapt the platform to meet the specific needs of the cancer research community as the develop over the course of the Pilot. If you are interested in following the work being done by Seven Bridges Genomics or giving feedback as their work on the NCI Cancer Genomics Cloud Pilots progresses, you can do so here.

    We look forward to the journey ahead with Seven Bridges Genomics. You can learn more about AWS and Genomics here.

    -- Matt Wood, General Manager, Data Science