Tag: AWS Lambda

How Eliza Corporation Moved Healthcare Data to the Cloud

by NorthBay Solutions | on | | Comments

This is a guest post by Laxmikanth Malladi, Chief Architect at NorthBay. NorthBay is an AWS Advanced Consulting Partner and an AWS Big Data Competency Partner

“Pay-for-performance” in healthcare pays providers more to keep the people under their care healthier. This is a departure from fee-for-service where payments are for each service used. Pay-for-performance arrangements provide financial incentives to hospitals, physicians, and other healthcare providers to carry out improvements and achieve optimal outcomes for patients.

Eliza Corporation, a company that focuses on health engagement management, acts on behalf of healthcare organizations such as hospitals, clinics, pharmacies, and insurance companies. This allows them to engage people at the right time, with the right message, and in the right medium. By meeting them where they are in life, Eliza can capture relevant metrics and analyze the overall value provided by healthcare.

Eliza analyzes more than 200 million such outreaches per year, primarily through outbound phone calls with interactive voice responses (IVR) and other channels. For Eliza, outreach results are the questions and responses that form a decision tree, with each question and response captured as a pair:

<question, response>: <“Did you visit your physician in the last 30 days?” , “Yes”>

This type of data has been characteristic and distinctive for Eliza and poses challenges in processing and analyzing. For example, you can’t have a table with fixed columns to store the data.

The majority of data at Eliza takes the form of outreach results captured as a set of <attribute> and <attribute value> pairs. Other data sets at Eliza include structured data for the members to target for outreach. This data is received from various systems that include customers, claims data, pharmacy data, electronic medical records (EMR/EHR) data, and enrichment data. There are considerable variety and quality considerations in the data that Eliza deals with for keeping the business running.

NorthBay was chosen as the big data partner to architect and implement a data infrastructure to improve the overall performance of Eliza’s process. NorthBay architected a data lake on AWS for Eliza’s use case and implemented majority of the data lake components by following the best practice recommendations from the AWS white paper “Building a Data Lake on AWS.”

In this post, I discuss some of the practical challenges faced during the implementation of the data lake for Eliza and the corresponding details of the ways we solved these issues with AWS. The challenges we faced involved the variety of data and a need for a common view of the data.

Data transformation

This section highlights some of the transformations done to overcome the challenges related to data obfuscation, cleansing, and mapping.

The following architecture depicts the flow for each of these processes.


  • The Amazon S3 manifest file or time-based event triggers an AWS Lambda function.
  • The Lambda function launches an AWS Data Pipeline orchestration process passing the relevant parameters.
  • The Data Pipeline process creates a transient Amazon EMR resource and submits the appropriate Hadoop job.
  • The Hadoop job is configured to read the relevant metadata tables from Amazon DynamoDB and AWS KMS (for encrypt/decrypt operations).
  • Using the metadata, the Hadoop job transforms the input data to put results in the appropriate S3 location.
  • When the Hadoop job is complete, an Amazon SNS topic is notified for further processing.


Building Event-Driven Batch Analytics on AWS

by Karthik Sonti | on | | Comments

Karthik Sonti is a Senior Big Data Architect with AWS Professional Services

Modern businesses typically collect data from internal and external sources at various frequencies throughout the day. These data sources could be franchise stores, subsidiaries, or new systems integrated as a result of merger and acquisitions.

For example, a retail chain might collect point-of-sale (POS) data from all franchise stores three times a day to get insights into sales as well as to identify the right number of staff at a given time in any given store. As each franchise functions as an independent business, the format and structure of the data might not be consistent across the board. Depending on the geographical region, each franchise would provide data at a different frequency and the analysis of these datasets should wait until all the required data is provided (event-driven) from the individual franchises. In most cases, the individual data volumes received from each franchise are usually small but the velocity of the data being generated and the collective volume can be challenging to manage.

In this post, I walk you through an architectural approach as well as a sample implementation on how to collect, process, and analyze data for event-driven applications in AWS.


The architecture diagram below depicts the components and the data flow needed for a event-driven batch analytics system. At a high-level, this architecture approach leverages Amazon S3 for storing source, intermediate, and final output data; AWS Lambda for intermediate file level ETL and state management; Amazon RDS as the state persistent store; Amazon EMR for aggregated ETL (heavy lifting, consolidated transformation, and loading engine); and Amazon Redshift as the data warehouse hosting data needed for reporting.

In this architecture, each location on S3 stores data at a certain state of transformation. When new data is placed at a specific location, an S3 event is raised that triggers a Lambda function responsible for the next transformation in the chain. You can use this event-driven approach to create sophisticated ETL processes, and to syndicate data availability at a given point in the chain.



Processing VPC Flow Logs with Amazon EMR

by Michael Wallman | on | | Comments

Michael Wallman is a senior consultant with AWS ProServ

It’s easy to understand network patterns in small AWS deployments where software stacks are well defined and managed. But as teams and usage grow, its gets harder to understand which systems communicate with each other, and on what ports. This often results in overly permissive security groups.

In this post, I show you how to gain valuable insight into your network by using Amazon EMR and Amazon VPC Flow Logs. The walkthrough implements a pattern often found in network equipment called ‘Top Talkers’, an ordered list of the heaviest network users, but the model can also be used for many other types of network analysis. Customers have successfully used this process to lock down security groups, analyze traffic patterns, and create network graphs.

VPC Flow Logs

VPC Flow Logs enables the capture of IP information flowing to and from network interfaces within a VPC. Each Flow Log record is a 5-tuple set of 5 different values that specify the source, destination, and protocol for an Internet protocol (IP) flow:

version account-id interface-id srcaddr dstaddr srcport dstport protocol packets bytes start end action log-status

VPC Flow Logs can be enabled on a single network interface, on a subnet, or an entire VPC. When enabled on the VPC, it begins log collection on all network interfaces within that VPC. In large deployments of tens of thousands of instances, Flow Logs can easily generate terabytes of compressed log data per hour!

To process this data at scale, this post takes you through the steps in the following graphic.


Data Lake Ingestion: Automatically Partition Hive External Tables with AWS

by Songzhi Liu | on | | Comments

Songzhi Liu is a Professional Services Consultant with AWS

The data lake concept has become more and more popular among enterprise customers because it collects data from different sources and stores it where it can be easily combined, governed, and accessed.

On the AWS cloud, Amazon S3 is a good candidate for a data lake implementation, with large-scale data storage. Amazon EMR provides transparent scalability and seamless compatibility with many big data applications on Hadoop. However, no matter what kind of storage or processing is used, data must be defined.

In this post, I introduce a simple data ingestion and preparation framework based on AWS Lambda, Amazon DynamoDB, and Apache Hive on EMR for data from different sources landing in S3. This solution lets Hive pick up new partitions as data is loaded into S3 because Hive by itself cannot detect new partitions as data lands.

Apache Hive

Hive is a great choice as it is a general data interfacing language thanks to its well-designed Metastore and other related projects like HCatalog. Many other Hadoop applications like Pig, Spark, and Presto, etc. can leverage the schemas defined in Hive.

Moreover, external tables make Hive a great data definition language to define the data coming from different sources on S3, such as streaming data from Amazon Kinesis, log files from Amazon CloudWatch and AWS CloudTrail, or data ingested using other Hadoop applications like Sqoop or Flume.

To maximize the efficiency of data organization in Hive, you should leverage external tables and partitioning. By properly partitioning the data, you can largely reduce the amount of data needs to be retrieved and improve the efficiency during ETL or other types of analysis.

Solving the problem with AWS services

For many of the aforementioned services or applications, data is loaded periodically, as in one batch every 15 minutes. Because Hive external tables don’t pick up new partitions automatically, you need to update and add new partitions manually; this is difficult to manage at scale. A framework based on Lambda, DynamoDB, and S3 can assist with this challenge.

Architectural diagram

As data is ingested from different sources to S3, new partitions are added by this framework and become available in the predefined Hive external tables.


Simplify Management of Amazon Redshift Snapshots using AWS Lambda

by Ian Meyers | on | | Comments

Ian Meyers is a Solutions Architecture Senior Manager with AWS

Amazon Redshift is a fast, fully managed, petabyte-scale data warehouse that makes it simple and cost-effective to analyze all your data using your existing business intelligence tools. A cluster is automatically backed up to Amazon S3 by default, and three automatic snapshots of the cluster are retained for 24 hours. You can also convert these automatic snapshots to ‘manual’, which means they are kept forever. Snapshots are incremental, so they only store the changes made since the last snapshot was taken, and are very space efficient.

You can restore manual snapshots into new clusters at any time, or you can use them to do table restores, without having to use any third-party backup/recovery software. (For an overview of how to build systems that use disaster recovery best practices, see the AWS white paper Using AWS for Disaster Recovery.)

When creating cluster backups for a production system, you must carefully consider two dimensions:

  • RTO: Recovery Time Objective. How long does it take to recover from  disaster recovery scenario?
  • RPO: Recovery Point Objective. When you have recovered, to what point in time will the system be consistent?

Recovery Time Objective

When using Amazon Redshift, your RTO is determined by the node type you are using, how many of those nodes you have, and the size of the data they store. It is vital that you practice restoration from snapshots created on the cluster to correctly determine Recovery Time Objective. It is also important that you re-test the restore performance any time you resize the cluster or your data volume changes significantly.

Recovery Point Objective

Automated backups are triggered based on a threshold of blocks changed or after a certain amount of time. For a cluster with minimal changes to data, a backup is taken after approximately every 8 hours. For a cluster which churns a massive amount of data, backups can be taken several times per hour. If you find that your data churn rate isn’t triggering automated backups at a frequency which satisfies your RPO then this utility can be leveraged to supplement the existing automated backups with additional manual snapshots in order to guarantee the targeted RPO.

What’s New?


Real-time in-memory OLTP and Analytics with Apache Ignite on AWS

by Babu Elumalai | on | | Comments

Babu Elumalai is a Solutions Architect with AWS

Organizations are generating tremendous amounts of data, and they increasingly need tools and systems that help them use this data to make decisions. The data has both immediate value (for example, trying to understand how a new promotion is performing in real time) and historic value (trying to understand the month-over-month revenue of launched offers on a specific product).

The Lambda  architecture (not AWS Lambda) helps you gain insight into immediate and historic data by having a speed layer and a batch layer. You can use the speed layer for real-time insights and the batch layer for historical analysis.

In this post, we’ll walk through how to:

  1. Build a Lambda architecture using Apache Ignite
  2. Use Apache Ignite to perform ANSI SQL on real-time data
  3. Use Apache Ignite as a cache for online transaction processing (OLTP) reads

To illustrate these approaches, we’ll discuss a simple order-processing application. We will extend the architecture to implement analytics pipelines and then look at how to use Apache Ignite for real-time analytics.

A classic online application

Let’s assume that you’ve built a system to handle the order-processing pipeline for your organization. You have an immutable stream of order documents that are persisted in the OLTP data store. You use Amazon DynamoDB to store the order documents coming from the application.

Below is an example order payload for this system:

{'BillAddress': '5719 Hence Falls New Jovannitown  NJ 31939', 'BillCity': 'NJ', 'ShipMethod': '1-day', 'UnitPrice': 14, 'BillPostalCode': 31939, 'OrderQty': 1, 'OrderDate': 20160314050030, 'ProductCategory': 'Healthcare'}

{'BillAddress': '89460 Johanna Cape Suite 704 New Fionamouth  NV 71586-3118', 'BillCity': 'NV', 'ShipMethod': '1-hour', 'UnitPrice': 3, 'BillPostalCode': 71586, 'OrderQty': 1, 'OrderDate': 20160314050030, 'ProductCategory': 'Electronics'}

Here is example code that I used to generate sample order data like the preceding and write the sample orders  into DynamoDB.

The illustration following shows the current architecture for this example.


From SQL to Microservices: Integrating AWS Lambda with Relational Databases

by Bob Strahan | on | | Comments

Bob Strahan is a Senior Consultant with AWS Professional Services

AWS Lambda has emerged as excellent compute platform for modern microservices architecture, driving dramatic advancements in flexibility, resilience, scale and cost effectiveness. Many customers can take advantage of this transformational technology from within their existing relational database applications. In this post, we explore how to integrate your Amazon EC2-hosted Oracle or PostgreSQL database with AWS Lambda, allowing your database application to use a microservices architecture.

Here are a few reasons why you might find this capability useful:

  • Instrumentation: Use database triggers to call a Lambda function when important data is changed in the database. Your Lambda function can easily integrate with Amazon CloudWatch, allowing you to create custom metrics, dashboards and alarms based on changes to your data.
  • Outbound streaming: Again, use triggers to call Lambda when key data is modified. Your Lambda function can post messages to other AWS services such as Amazon SQS, Amazon SNS, Amazon SES, or Amazon Kinesis Firehose, to send notifications, trigger external workflows, or to push events and data to downstream systems, such as an Amazon Redshift data warehouse.
  • Access external data sources: Call Lambda functions from within your SQL code to retrieve data from external web services, read messages from Amazon Kinesis streams, query data from other databases, and more.
  • Incremental modernization: Improve agility, scalability, and reliability, and eliminate database vendor lock-in by evolving in steps from an existing monolithic database design to a well-architected, modern microservices approach. You can use a microservices architecture to migrate business logic embodied in database procedures into database-agnostic Lambda functions while preserving compatibility with remaining SQL packages.

I’ll revisit these scenarios in Part 2, but first you need to establish the interface that enables SQL code to invoke Lambda functions.


Analyze a Time Series in Real Time with AWS Lambda, Amazon Kinesis and Amazon DynamoDB Streams

by JustGiving | on | | Comments

This is a guest post by Richard Freeman, Ph.D., a solutions architect and data scientist at JustGiving. JustGiving in their own words: We are one of the world’s largest social platforms for giving that’s helped 26.1 million registered users in 196 countries raise $3.8 billion for over 27,000 good causes.”


As more devices, sensors and web servers continuously collect real-time streaming data, there is a growing need to analyze, understand and react to events as they occur, rather than waiting for a report that is generated the next day. For example, your support staff could be immediately notified of sudden peaks in traffic, abnormal events, or suspicious activities, so they can quickly take the appropriate corrective actions to minimize service downtime, data leaks or financial loss.

Traditionally, this would have gone through a data warehouse or a NoSQL database, and the data pipeline code could be custom built or based on third-party software. These models resulted in analysis that had a long propagation delay: the time between a check out occurring and the event being available for analysis would typically be several hours. Using a streaming analytics architecture, we can provide analysis of events typically within one minute or less.

Amazon Kinesis Streams is a service that can continuously capture and store terabytes of data from hundreds or thousands of sources. This might include website clickstreams, financial transactions, social media feeds, application logs, and location-tracking events. A variety of software platforms can be used to build an Amazon Kinesis consumer application, including the Kinesis Client Library (KCL), Apache Spark Streaming, or Elastic MapReduce via Hive.

Using Lambda and DynamoDB gives you a truly serverless architecture, where all the infrastructure including security and scalability is managed by AWS. Lambda supports function creation in Java, Node.js, and Python; at JustGiving, we use Python to give us expressiveness and flexibility in building this type of analysis.

This post explains how to perform time-series analysis on a stream of Amazon Kinesis records, without the need for any servers or clusters, using AWS Lambda, Amazon Kinesis Streams, Amazon DynamoDB and Amazon CloudWatch.  We demonstrate how to do time-series analysis on live web analytics events stored in Amazon Kinesis Streams and present the results in near real-time for use cases like live key performance indicators, ad-hoc analytics, and quality assurance, as used in our AWS-based data science and analytics  RAVEN (Reporting, Analytics, Visualization, Experimental, Networks) platform at JustGiving.


Building a Near Real-Time Discovery Platform with AWS

by Assaf Mentzer | on | | Comments

Assaf Mentzer is a Senior Consultant for AWS Professional Services

In the spirit of the U.S presidential election of 2016, in this post I use Twitter public streams to analyze the candidates’ performance, both Republican and Democrat, in a near real-time fashion. I show you how to integrate AWS managed services—Amazon Kinesis Firehose, AWS Lambda (Python function), and Amazon Elasticsearch Service—to create an end-to-end, near real-time discovery platform.

The following screenshot is an example of a Kibana dashboard on top of geo-tagged tweet data. This screenshot was taken during the fourth republican presidential debate (November 10th, 2015).

Kibana dashboard on top of geotagged tweet data


Using AWS Lambda for Event-driven Data Processing Pipelines

by Vadim Astakhov | on | | Comments

Vadim Astakhov is a Solutions Architect with AWS

Some big data customers want to analyze new data in response to a specific event, and they might already have well-defined pipelines to perform batch processing, orchestrated by AWS Data Pipeline. One example of event-triggered pipelines is when data analysts must analyze data as soon as it arrives, so that they can immediately respond to partners. Scheduling is not an optimal solution in this situation. The main question is how to schedule data processing at an arbitrary time using Data Pipeline, which relies on schedulers.

Here’s a solution. First, create a simple pipeline and test it with data from Amazon S3, then add an Amazon SNS topic to notify the customer when the pipeline is finished so data analysts can review the result. Lastly, create an AWS Lambda function to activate Data Pipeline when new data is successfully committed into an S3 bucket—without managing any scheduling activity. This post will show you how.

Solution that activates Data Pipeline when new data is committed to S3

When Data Pipeline activity can be scheduled, customers can define preconditions that see whether data exists on S3 and then allocate resources. However, the use of Lambda is a good mechanism when Data Pipeline needs to be activated at a random time.

Cloning pipelines for future use

In this scenario, the customer’s pipeline has been activated through some scheduled activity but the customer wants to be able to invoke the same pipeline in response to an ad-hoc event such as a new data commit to an S3 bucket. The customer has already developed a “template” pipeline that has reached the Finished state.

One way to re-initiate the pipeline is to keep the JSON file with the pipeline definition on S3 and use it to create a new pipeline. Some customers have multiple versions of the same pipeline stored on S3 but are willing to clone and reuse only the version of the pipeline that has been recently executed. The light way to accommodate such request can be done by getting the pipeline definition from the finished pipeline and creating a clone. This approach relies on recently-executed pipelines and does not require the customer to keep a registry of pipeline versions from S3 and track which version has been executed recently.