AWS Database Blog

EK Is the New ELK: Simplify Log Analytics by Transforming Data Natively in Amazon Elasticsearch Service

by huajames | on | in Elasticsearch | | Comments

James Huang is a solutions architect at Amazon Web Services.

AWS recently announced support for Elasticsearch 5.1. The announcement mentioned a few important improvements in the open source software and in Amazon Elasticsearch Service (Amazon ES), the managed service.  Elasticsearch 5.1 includes three important changes: a new scripting language called Painless, significant improvements to indexing throughput, and a new ingest model, which is described in this blog post.

Many operations teams use Logstash or homegrown pipelines of data transformation scripts and tools to modify log data before it’s stored and indexed in Elasticsearch. The new ingest model makes use of pipelines and processors offered in Elasticsearch 5.1 and Amazon ES to simplify your log analytics architecture and centralize administration of your data transformation logic.

If you’re using the ELK stack, this method allows you to remove Logstash and to instead use Beats to ship logs directly to Elasticsearch for processing and indexing.

Ingestion and pipelines
In Elasticsearch 5.1, you can think of a pipeline as a factory line of processors, where each processor has a single task to perform on the data (for example, applying a grok pattern).

The following are some of the well-known processors built into Elasticsearch 5.1:

  • Grok
  • JSON
  • Date
  • Convert
  • Trim
  • Sort

For more information about processors, see the Elasticsearch Reference 5.1.

Because Amazon ES is a managed service, it does not support custom processor plugins.  However, you can use the script processor with an inline script (where your code is not referenced through a file or id) to customize logic.

Scenario
Say we’re working on some access log data where a single row of data looks like this:

10.0.0.3 GET /index.jsp 1024 0.134

In this scenario, we have an IP address, HTTP method, name of the document to be retrieved, size (in bytes), and the latency of the request.

It’s common to transform this into separate searchable fields by using a grok pattern. Before Elasticsearch 5.1, that meant using Logstash or scripting tools that modified the data into separate JSON attributes before it was sent to Elasticsearch. Now, however, you can set up a dedicated pipeline for this web access log format. The pipeline can use a grok processor.

First, you’ll need a cluster. For instructions, see Getting Started with Amazon Elasticsearch Service Domains.

When you create your domain, be sure to choose 5.1 from the Elasticsearch version drop-down box, as shown here:

CreateESDomain

After the cluster is installed, copy the DNS endpoint from the dashboard.

Set up the client and access proxy
To protect a production Elasticsearch cluster, secure it as described in the Amazon ES documentation and in the How to Control Access to Your Amazon Elasticsearch Service Domain blog post.

For testing purposes, assuming there is no sensitive data being used, you might be able to use a simple access proxy like this one.

Note: This signing proxy was developed by a third party, not AWS.  AWS is not responsible for the functioning or suitability of external content. This proxy can be used for development and testing, but is not suitable for production workloads.

To make things easier, we’ll use a Chrome plugin called Sense as our REST client.

To install the plugin, open a Chrome browser session, navigate to

https://chrome.google.com/webstore/detail/sense-beta/lhjgkmllcaadmopgmanpapmpjgmfcfig, and then click the button to add the plugin to your browser.

Or you can navigate to the Chrome web store, search for Sense, and then add the plugin, as shown here:

Extension

Any REST client will do, but Sense understands Elasticsearch APIs and will auto-complete parts of your request for you.  Plus, it’s got fancy colors.

In the Sense client, the left side is for making requests. The right side (with a black background) displays the results of those requests.

If you’ve already installed the access proxy, you can point your Sense client to your local machine where the proxy is running.  For example, instead of https://foo.bar.amazonaws.com:9200, you would point it to http://localhost:9200, as shown here:

ServerLocalhost

Pipeline and processor

To create the pipeline, in Sense, type (or copy and paste) the following into the left side of the request window:

PUT /_ingest/pipeline/my51pipeline
{
  "description" : "my sample pipeline to show grok",
  "processors": [
    {
      "grok": {
        "field": "message",
        "patterns": ["%{IPV4:myclientip} %{WORD:httpmethod} %{URIPATHPARAM:uri_requested} %{NUMBER:sizebytes} %{NUMBER:duration}"]
      }
    }
  ]
}

Here’s what we are doing:

  1. Defining a pipeline named my51pipeline.
  2. Adding a single processor into the pipeline (in this case, the built-in processor, grok).
  3. Setting up the pattern for grok (which field values to assign to which attribute names).

For grok patterns that are useful for Elastic Load Balancing and Amazon S3 logs and can be used out of the box, see https://github.com/elastic/elasticsearch/tree/master/modules/ingest-common/src/main/resources/patterns.

The format of the grok pattern is a series of %{SYNTAX:SEMANTIC} pairs.

%{IPV4:myclientip} indicates that you want grok to pull out the first column of information, which should match the regex behind the IPV4 grok pattern, and that the value should be assigned to myclientip.

If all goes well, then 10.0.0.3 will go into the attribute named myclientip upon ingest only when this pipeline and processor are used.

10.0.0.3 GET /index.jsp 1024 0.134

So let’s try it.

PUT /_ingest/pipeline/my51pipeline
{
  "description" : "my sample pipeline to show grok",
  "processors": [
    {
      "grok": {
        "field": "message",
        "patterns": ["%{IPV4:myclientip} %{WORD:httpmethod} %{URIPATHPARAM:uri_requested} %{NUMBER:sizebytes} %{NUMBER:duration}"]
      }
    }
  ]
}

Copy and paste the pipeline into the left side of Sense, and then click the green triangle on the right of the entry.

On the right side of Sense, you should see:

{

   “acknowledged”: true

}

So far, so good.  Let’s see the pipeline definition.

Type this into Sense:

GET /_ingest/pipeline

You should now see the pipeline you just created:

Pipeline

Now let’s feed in some data and inspect it. This first time we’re not going to use the pipeline:

PUT /logindex/accesslog/notpipelined1
{
“message”: “10.0.0.3 GET /index.jsp 1024 0.134”
}

We’re creating an index in Elasticsearch called logindex and putting in some data of type accesslog. This document’s id is “notpipelined1“.

Click the green triangle and get the result:

Result

Now let’s view that entry.

On the left side of Sense, type the following and then click the green triangle.

GET /logindex/accesslog/notpipelined1

Sense

The good news is that we have data in the cluster. The bad news is that the data is not split according to each field, so is not as usable for searching and filtering.

Let’s try it again, this time with the pipeline:

PUT /logindex/accesslog/ispipelined2?pipeline=my51pipeline
{
“message”: “10.0.0.3 GET /index.jsp 1024 0.134”
}

Note: If you were using a Filebeat, you could specify the pipeline name in your filebeat.yml file as follows:

output.elasticsearch:
 hosts: [“myamazonelasticsearch.amazonaws.com:9200”]
pipeline: my51pipeline

In Sense, you’ll get an acknowledgement of the success.

Let’s observe the result. Type the following into Sense:

GET /logindex/accesslog/ispipelined2

LogIndex

You have now separated the fields, thanks to your pipeline and grok processor.

In addition to ingesting and transforming the data, you might also want to set up an index template and mappings to ensure that your data is typed properly. For more information, see Index Templates.

This new ingest model allows businesses to transform data in a centralized, native manner in Amazon Elasticsearch Service.

Log shipping and transformation alternatives
An alternative option for getting log data into Amazon ES is to send log data to Amazon CloudWatch Logs through a downloadable agent. The log data can then be streamed into Amazon ES through a Lambda function. For more information, see Streaming CloudWatch Logs Data to Amazon Elasticsearch Service.

This method is compatible with the new ingest model because you have access to the subscription Lambda function.

Log data can also be ingested through a downloadable Java agent, which can buffer that log data to an Amazon Kinesis Firehose stream.  If you want to transform or enrich log data, Amazon Kinesis Firehose lets you call a Lambda function for custom processing.  For more information, see Amazon Kinesis Firehose Data Transformation.  The advantage of using a Lambda function for transformation is that it can call other services, such as Amazon DynamoDB, to enrich your data.  The disadvantage of this method of ETL is that the processors in Elasticsearch must be written in your transformation Lambda function.

Happy pipelining!

Implement Linked Servers with Amazon RDS for Microsoft SQL Server

by Richard Waymire | on | in EC2, RDS SQL Server | | Comments

Richard Waymire is a principal database specialist solutions architect for Amazon Web Services

Linked servers allow Microsoft SQL Server to run SQL Server statements on other instances of database servers. In this blog post, we will focus on connectivity either to other instances of SQL Server in Amazon RDS or to SQL Server instances hosted in Amazon EC2. We will examine three connectivity scenarios:

  • EC2 SQL Server to RDS SQL Server
  • RDS SQL Server to EC2 SQL Server
  • RDS SQL Server to RDS SQL Server

For each of these scenarios to succeed, we assume that all network traffic is happening within a single VPC. Also, for the RDS SQL Server to RDS SQL Server scenario, we assume that both instances are not publicly available, because communications between them need to use the private IP addresses within the VPC. AWS recommends that RDS SQL Server installations should generally be only privately accessible (that is, not directly exposed to the Internet).

SQL Server Management Studio (SSMS) requires system admin rights to create a linked server using the graphical interface, which isn’t available in an RDS environment. Unfortunately, this approach prevents users from even launching the dialog box to use to create the needed Transact-SQL script. However, calling the stored procedure sp_addlinkedserver and then the stored procedure sp_addlinkedserverlogin directly from a query window lets an administrator add linked servers to the RDS installation of SQL Server.

For each of the examples following, you must allow network traffic by using the appropriate TCP port through the Security group for each inbound instance of SQL Server. In other words, if you’re connecting EC2 SQL Server to RDS SQL Server, you must allow traffic from the IP address of the EC2 instance.

EC2 SQL Server to RDS SQL Server

For this scenario, an EC2 instance of SQL Server is connecting to an instance of RDS SQL Server. This scenario is the simplest one, because you can use either use the SSMS graphical interface or just submit the Transact-SQL statements to create the linked server. The example code here is a connection to an RDS server in the cloud. Variables you should replace are listed in angle brackets. Note that in this example we are impersonating a single RDS standard user account.

EXEC master.dbo.sp_addlinkedserver @server = N’LinkedtoRDS’, @srvproduct=N’‘, @provider=N’SQLNCLI’, @datasrc=N'<myserver>.CB2XKFSFFMY7.US-WEST-2.RDS.AMAZONAWS.COM’;

EXEC master.dbo.sp_addlinkedsrvlogin @rmtsrvname=N’LinkedtoRDS‘,@useself=N’False’,@locallogin=NULL,@rmtuser=N'<username>’,@rmtpassword=‘<password>’;

If you are using a common active directory (the AWS Directory Service), then you can change the linked server security to something like this:

EXEC master.dbo.sp_addlinkedsrvlogin @rmtsrvname=N’LinkedtoRDS’,@useself=N’True’;

When the linked server is in place, you then use standard four-part naming to reference a table, view, and so on, on the remote server:

SELECT * FROM LinkedtoRDS.TestDB.dbo.t1;

If the RDS instance of SQL Server was a private instance (that is, not publicly available), then the EC2 instance needs to be in the same VPC as the RDS SQL Server instance for connectivity. If the RDS SQL Server instance is publicly available, then this scenario also works with an on-premises SQL Server installation.

RDS SQL Server to EC2 SQL Server

For this scenario, both the RDS instance of SQL Server and the EC2 SQL Server should be available within the same VPC. However, this time the EC2 instance should not be publicly accessible. If the EC2 instance is publicly accessible, then you need to refer to the private IP address of the EC2 instance within the VPC.

InstanceIP

The code now looks like this:

EXEC master.dbo.sp_addlinkedserver @server = N’REPLTest2′, @srvproduct=N”, @provider=N’SQLNCLI’, @datasrc=N’10.0.0.135′;

EXEC master.dbo.sp_addlinkedsrvlogin @rmtsrvname=N’REPLTest2′,@useself=N’False‘,@locallogin=NULL,@rmtuser=N'<username>’,@rmtpassword=‘<password>’;

GO

This change is required due to the current network setup of the RDS SQL Server installation. Be aware that this setup might change in the future.

RDS SQL Server to RDS SQL Server

For this scenario, both instances of RDS SQL Server must be private instances (not publicly accessible) within the same VPC. If the RDS SQL Server instances are publicly accessible, then you need to refer to the private IP of the RDS instances when creating the linked servers. Because you don’t have access to the private IP addresses of the servers, you need to use an EC2 instance in the same VPC to run NSLookup on the private instances of RDS. Do this for each RDS instance name.

> NSlookup privatesql.cb2xkfsffmy7.us-west-2.rds.amazonaws.com

Address: 10.0.4.236

From there, the code is very much like the RDS to EC2 code, except now you use the private IP address of the remote RDS instance.

EXEC master.dbo.sp_addlinkedserver @server = N’RDSPrivate‘, @srvproduct=N”, @provider=N’SQLNCLI’, @datasrc=N’10.0.4.236′;

EXEC master.dbo.sp_addlinkedsrvlogin @rmtsrvname=N’RDSPrivate’,@useself=N’False’,@locallogin=NULL,@rmtuser=N'<username>’,@rmtpassword=‘<password>’;

GO

Just as we did preceding, use a four-part name to reference the remote RDS instance:

SELECT * from RDSPrivate.TestDB.dbo.t1;

Note one limitation for all RDS instances: If the physical server supporting your RDS SQL Server instance changes (such as when upgrading to a new version of SQL Server or changing the instance type), the private IP address of the RDS instance of SQL Server might change. This change can happen with no alerts or notifications to your administrative team. Thus, if the linked server connection fails, your first troubleshooting steps should include verifying that the private IP address of the RDS instance has not changed. If this risk is unacceptable in your environment, then AWS doesn’t recommend using the linked server feature with RDS SQL Server.

Note: RDS SQL Server currently doesn’t replicate linked servers to the mirrored database server in a Multi-AZ deployment. If the linked servers are added before the configuration is changed to add mirroring, then the linked servers are copied, but only once. Alternatively, you can create the linked servers on the primary instance, fail over to the mirrored server instance, and then create the linked servers again so that they are on both instances of RDS SQL Server.

Although the documentation states that in general linked servers are not supported, as shown in this post there are in fact several supported scenarios with linked servers for RDS SQL Server. However, these are the only supported scenarios at this time.

ElastiCache for Redis Backup and Restore with Cluster Resizing

by Darin Briskman | on | in ElastiCache, Redis | | Comments

Clusters provide a lot of capabilities in Amazon ElastiCache for Redis. Using autosharding to distribute data across the shards of a cluster enables high performance (over 4.5m writes/sec and 20m reads/sec, with each operation typically under 500µs) and high capacity (up to 3.5TiB of data) with high reliability and with recovery of failed nodes usually within 30 seconds.

When you create an ElastiCache cluster, you set the number of shards in the cluster. Each shard has one primary node (for reads and writes) and from zero to five replica nodes (for reads and failover protection). A cluster can be as small as a single shard with zero replicas (1 node) and as large as 15 shards each with 5 replicas (90 total nodes). You also choose the instance type that will be used by all the nodes in your cluster, which range from the tiny cache.t2.micro (0.555 GiB of memory) through the m3/m4 family, to the huge cache.r3.8xlarge (237 GiB).

But what happens if you need to change these choices? How can you change to a larger or smaller number of shards or to a different instance type?

Fortunately, ElastiCache makes this easy. All you need to do is take a backup.

Backups, which are comprised of point in time snapshots of individual shards, are the backup and restore tool for ElastiCache. You can create a backup of your cluster at any time using the AWS Console or AWS CLI.

Once the backup is created, you can restore it to any ElastiCache for Redis cluster, as long as the version of the cluster is at least as current as that of the backup and the total memory in the cluster is large enough to contain the Redis keyspace stored within the backup.

ShardCluster

So, for example, if you started with a 3-shard cluster and want to expand it to a 5-shard cluster, just:

  1. Backup the 3-shard cluster
  2. Restore the backup into a new 5-shard cluster

You can use this technique to easily test different cluster configurations to figure out the best combination of shards and instance types for your needs.

You can also restore snapshots taken from non-ElastiCache for Redis clusters, if you’re using Redis off the Cloud or running it yourself on EC2. Just copy your snapshots to S3, and then use them with ElastiCache Import. ElastiCache will parse the key space within the specified snapshots and restore it into an ElastiCache for Redis Cluster with the number of shards and Redis slot ranges of your choice.

ImportClusterData

Try it out!

Amazon ElastiCache: Utilizing Redis Geospatial Capabilities

by Josh Kahn | on | in ElastiCache, Redis | | Comments

Josh Kahn is a Solutions Architect at Amazon Web Services.

Amazon ElastiCache makes it easy to deploy and manage a highly available and scalable in-memory data store in the cloud. Among the open source in-memory engines available to you for use with ElastiCache is Redis, which added powerful geospatial capabilities in version 3.2. This post explores using these capabilities to build an app that locates nearby bike share stations in Chicago. You can easily extend the app to other cities or to use other geo datasets.

Benefits of Amazon ElastiCache
You can use Amazon ElastiCache to accelerate your high volume application workloads by caching your data in-memory providing sub-millisecond data retrieval performance. When used in conjunction with any database including Amazon RDS or Amazon DynamoDB, ElastiCache can alleviate the pressure associated with heavy request loads, increase overall application performance and reduce costs associated with scaling for throughput on other databases.

As an AWS managed service, Amazon ElastiCache also removes the undifferentiated heavy lifting involved with administrating Redis. There are a variety of application patterns that can be used to leverage ElastiCache with your data, some of which will be discussed in this post.

Our Bike Share Application
Bike sharing programs are becoming increasingly popular in major cities throughout the world. Our app will enable users to search for nearby Divvy bike share locations in Chicago. In this post, we will explore the associated backend services, as shown in the following image.

BikeShareApp

We’ll use the AWS Serverless Application Model (SAM) to manage and deploy our AWS resources, including supporting storage (DynamoDB and ElastiCache). SAM extends AWS CloudFormation to simplify deployment of serverless functionality, including AWS Lambda and Amazon API Gateway. SAM also eases deployment using specialized tasks available in the AWS CLI.

Reviewing the architecture above, our SAM template will create two public-facing API actions, one to prime our data store (SetupStationFunction) and one to find nearby stations (ListStationsFunctions). It will also create an ElastiCache cluster, DynamoDB table (our system of record), and DynamoDB Stream as well as a Lambda function to process that stream (StreamProcessorFunction).

Complete sample code for our example is available at https://github.com/awslabs/elasticache-geospatial-public-bikes. The README provides detailed instruction on how to deploy the application in your own environment, making use of AWS CloudFormation.

Integrating with ElastiCache
When working with ElastiCache, you can use familiar libraries to connect to and interact with your Redis and Memcached clusters. Here, we will demonstrate an integration with Redis using a popular open source library for Node.js (redis). The version of the redis library used in this sample (version 2.6.5) doesn’t yet directly support Redis GEO commands, so here we make use the generic send_command function.

Loading Bike Share Stations
Like many other bike share programs, Divvy exposes a complete listing and current availability by using a public API. We’ll use this API to load our station data into DynamoDB and, in turn, ElastiCache by using a DynamoDB Stream. If you have deployed our sample code, it includes a setup function that you can call to add station data to DynamoDB:

$ curl –L 'https://<YOUR_API_DOMAIN>/Prod/stations/setup'

As stations are added to the DynamoDB table, we use a write-through pattern in which an AWS Lambda function is triggered by a DynamoDB Stream to add the station’s geo data to ElastiCache. To store geospatial information, we use the Redis GEOADD command, which can be called in our Node.js example as follows:

const redis = require('redis');

// first, we connect to our Redis cluster
var client = redis.createClient({
  host: ELASTICACHE_HOST,
  port: ELASTICACHE_PORT
});

// next, we can store the user's most recent location
client.send_command('GEOADD',
  [ 'public-bikes:stations',
  latitude,
  longitude,
  stationId ], (error, reply) => {
  client.quit();
  if (error) { ... }
  else { ... }
});

Querying for Nearby Stations
The major benefit of our app is that you can use it to search for nearby bike share stations. With the station data already stored in ElastiCache, we will make use of the powerful GEORADIUS Redis command. In our Node.js sample, this is as follows:

const redis = require('redis');

// first, we’ll setup a few search options
var searchOptions = {
  radius: 2,
  unit: 'mi',
  count: 10
};

// then, we connect to our Redis cluster
var client = redis.createClient({
  host: ELASTICACHE_HOST,
  port: ELASTICACHE_PORT
});

// finally, we can query for nearby users
client.send_command('GEORADIUS',
  ['public-bikes:stations',
  longitude,
  latitude,
  searchOptions.radius,
  searchOptions.unit, // m | mi | km | ft
  'WITHDIST',
  'WITHCOORD',
'  COUNT',
  searchOptions.count ], (error, reply) => {
  client.quit();
  if (error) { ... }
  else { ... }
});

In the preceding sample, Redis will return up to 10 stations within a 2 mile radius of the current user. The response will also include the distance (in miles) and the coordinates of the station. For our example, station name, coordinates, and distance will be the extent of the data that we return to the user. You may wish to append additional data in some manner, for example, via another query against ElastiCache or DynamoDB.

If you have deployed our sample code, you can query for nearby stations utilizing the following (note the query string parameters already include valid coordinates for downtown Chicago):

$ curl –L 'https://<YOUR_API_DOMAIN>/Prod/stations?latitude=41.8802596&longitude=-87.6346818'
The resulting output will look like the following (note that we have massaged the response from Redis):

[{
  "name": "Wacker Dr & Washington St-Chicago",
  "distance": "0.2484 mi",
  "coordinates": {
  "latitude": 41.88327238502640881,
  "longitude": -87.63731449842453003
}
}, {
  "name": "State St & Harrison St-Chicago",
  "distance": "0.5589 mi",
  "coordinates": {
  "latitude": 41.87405360416989453,
  "longitude": -87.62771755456924438
  }
...
]

We’ve now demonstrated how to add geospatial data to Amazon ElastiCache for Redis and query that data to identify nearby bike share stations. Now you can easily use this data in a mobile app or website.

Removing Old Stations
Finally, we will also want to add the capability to remove stations from the cache if, for example, no bikes are available or a station is moved. To remove an item from the Geo index, we can use the ZREM Redis command as shown following:

const redis = require('redis');

// first, we connect to our Redis cluster
var client = redis.createClient({
  host: ELASTICACHE_HOST,
  port: ELASTICACHE_PORT
});

// we can now remove the user’s location from the cache
client.send_command('ZREM',
  [ 'public-bikes:stations',
  stationId ], (error, reply) => {
  client.quit()
  if (error) { ... }
  else { ... }
});

In examining the sample code, you will notice that this code is part of our DynamoDB stream processor function. To remove a station, we remove it from the source data (in this case, DynamoDB) and the processor propagates that change to ElastiCache. Notice that our stream handler contains logic that examines the type of operation (record.eventName) and then perform the necessary action as described in the preceding sections.

Powering a Mobile App
With our API complete, we can use it to power a mobile application such as the React Native sample included in our same code.  The mobile application can pass the user’s current location to our backend API, which returns a listing of nearby bikes stations.  The app then parses and displays these stations, along with each station’s distance from the user’s current location.

App

Extending Our Example
Although our goal in this post is to describe a general approach for managing and querying geospatial data in Amazon ElastiCache, you can also extend the sample code in several meaningful ways. First, you can load the full set of Divvy bikes locations (note that DynamoDB’s batchWrite function supports a maximum of 25 records at a time). You can also load bike stations in other cities, such as New York, San Francisco, or Melbourne, Australia. Or you can modify the sample further to build a restaurant finder or plot local points of interest.

Amazon ElastiCache for Redis can enable advanced geospatial capabilities by using the Redis 3.2 engine for numerous use cases that require low latency and high performance. Redis Geospatial capabilities are available in ElastiCache today!

Get Started with Amazon Elasticsearch Service: How Many Shards Do I Need?

by handler | on | in Elasticsearch |

Dr. Jon Handler (@_searchgeek) is an AWS solutions architect specializing in search technologies.

Welcome to this introductory series on Elasticsearch and Amazon Elasticsearch Service (Amazon ES). In this and future blog posts, we provide the basic information that you need to get started with Elasticsearch on AWS.

How many shards?
Elasticsearch can take in large amounts of data, split it into smaller units, called shards, and distribute those shards across a dynamically changing set of instances. When you create an Elasticsearch index, you set the shard count for that index. Because you can’t change the shard count of an existing index, you have to make the decision on shard count before sending your first document. To begin, set the shard count based on your calculated index size, using 30 GB as a target size for each shard.

Number of Shards = Index Size / 30GB

To learn how to calculate your index size, see the blog post Get Started with Amazon Elasticsearch Service: How Many Data Instances Do I Need?

As you send data and queries to the cluster, continuously evaluate the resource usage and adjust the shard count based on the performance of the cluster.

What is a shard?
WhatIsShardA search engine has two jobs: Create an index from a set of documents, and search that index to compute the best, matching documents. If your index is small enough, a single data structure on a single machine can easily store that index. For larger document sets, in cases where a single machine is not large enough to hold the index, or in cases where a single machine can’t compute your search results, the index is split into pieces. These pieces are called shards in Elasticsearch. Each document is routed to a shard that is calculated, by default, by using a hash of that document’s ID.

A shard is both a unit of storage and a unit of computation. Elasticsearch deploys shards independently to the instances in the cluster to parallelize the storage and processing for the index. And it does this elastically (hence the “elastic” in the name “Elasticsearch”). If you add more instances to a cluster, Amazon Elasticsearch Service automatically rebalances the shards of the cluster, moving them between instances.

As storage, shards are distinct from one another. The document set in one shard doesn’t overlap the document set in any other shard. This approach makes shards independent for storage.

As computational units, shards are also distinct from one another. Each shard is an instance of an Apache Lucene index that computes results on the documents it holds. Because all of the shards comprise the index, they must function together to process each query and update request for that index. To process a query, Elasticsearch routes the query to all shards in the index. Each shard computes its response locally and then these responses are aggregated for the final response. To process a write request (a document addition or an update to an existing document), Elasticsearch routes the request to the appropriate shard.

Elasticsearch has two different kinds of shards
There are two kinds of shard in Elasticsearch—primary shards and replica shards. The primary shard receives all writes first. It passes new documents off to all replicas for indexing. By default, it then waits for the replicas to acknowledge the write before returning success to the caller. The primary and a replica shard are redundant storage for the data, hardening the cluster to the loss of an instance.

ESCluster1

In the example shown, the Elasticsearch cluster has three data instances. There are two indices, green and blue, each of which has three shards. The primary for each shard is outlined in red. Each shard also has a single replica, shown with no outline. Elasticsearch maps shards to instances based on a number of rules. The most basic rule is that primary and replica shards are never put onto the same instance.

Focus on the storage first
We’ve described two kinds of workloads that our customers run: single index and rolling index. Single index workloads use an external “source of truth” repository that holds all the content. Rolling index workloads receive data continuously. That data is put into a changing set of indices, based on a timestamp and an indexing period—usually one 24-hour day.

With each of these workloads, the place to start calculating sharding is the storage size required for the index. Treat each shard as a unit of storage first, and you can find a baseline for how many shards you need. For single-index workloads, divide the total storage by 30 GB to get the initial shard count. For rolling index workloads, divide a single time period’s index size by 30 GB to get the initial shard count.

Don’t be afraid of using a single shard!
If you have less than 30 GB of data in your index, you should use a single shard for your index. Some people have a gut feeling that “more is better.” Resist the temptation! Shards are both computational and storage entities. Each shard you add to an index distributes the processing of requests for that index across an additional CPU. Performance decreases because you’ll be using more processors than needed, requiring extra computation to manage and combine the results. You will also add network overhead for the scatter-gather of the query and responses.

Set the shard count
You set the shard count when you create each index, with the Elasticsearch create index API action. With Amazon Elasticsearch Service, an API call might be:

>>> curl –XPUT https://search-tweets2-ldij2zmbn6c5oijkrljEXAMPLE.us-west-2.es.amazonaws.com/tweet -d ‘{

“settings”: {

“index” : {

“number_of_shards”: 2,

“number_of_replicas”: 1,

}

}

}’

If you have a single index workload, you only have to do this once, when you create your index for the first time. If you have a rolling index workload, you create a new index regularly. Use the _template API to automate applying settings to all new indices whose name matches the template.

>>> curl –XPUT https://search-tweets2-ldij2zmbn6c5oijkrljEXAMPLE.us-west-2.es.amazonaws.com/_template/template1 -d ‘{

“template”: “logs*”,

“settings”: {

“index” : {

“number_of_shards”: 2,

“number_of_replicas”: 1,

}

}

}’

Be sure to replace the endpoint and template name in these examples with your endpoint and template name.

Any new index that you create whose name has “logs” as a prefix will have two shards and one replica.

Adjust according to workload
What we’ve covered so far is the simplest layer of the sharding question. In a future post, we’ll cover the next level down: adjusting shard count based on usage. If you’re just getting started, use index size divided by 30 GB to start when selecting a shard count. Make sure to set the shard count on your index before sending any data to it.

Let me know about your adventures in sharding!

For more information, take a look at Amazon Elasticsearch Service’s details page.

Reduce Resource Consumption by Consolidating Your Sharded System into Aurora

by Ed Murray | on | in Aurora, DMS | | Comments

Ed Murray is a manager at Amazon Web Services.

When faced with the prospect of scaling your relational workload, you have generally had two options available: scale up or scale out. Scaling up is easy—just buy a bigger database host. Scaling out is more difficult. It requires you to shard your database into several physically independent pieces that can each run on its own host.

Despite the difficulty, it has been a recent trend for people to scale out. Availability of commodity hardware coupled with the increased need for system resources has made the effort required to shard a workload worthwhile. One drawback of a sharded system is the added administration costs: If you have four shards, you have four databases to manage. Even so, in many cases scaling out was more cost-effective than scaling up, especially with the advent of managed database services such as Amazon RDS that virtually eliminate the administrative requirements of a relational database.

But wait—what’s with the word was? With the advent of Amazon Aurora, scaling up might be back on the table. Amazon Aurora is an incredibly scalable, MySQL compatible, managed database service. Aurora offers instance sizes from 2 vCPUs and 4 GiB of memory all the way up to 32 vCPUs and 244 GiB of memory. Amazon Aurora automatically grows storage as needed, from 10 GB up to 64 TB. You can also add up to 15 low latency read replicas across three Availability Zones to further scale read capacity. The best part about that is, the storage is shared with your read replicas!

You might, in fact, save bunches of money by consolidating your sharded system into a single Aurora instance or fewer shards running on Aurora. That is exactly what this blog post is all about.

Getting initial data into Aurora for sharding
Most sharded systems follow a standard pattern: Data is divided based on some key, such as customer ID, and distributed across shards by using some mapping function. There are variations—for example, some systems keep reference data in a separate system or in a single shard, and some replicate the reference data to all shards. No matter how data is divided, the complexity in sharding typically lies in the application layer, making the consolidation of shards a relatively easy exercise.

Now you’re probably thinking, “Okay sure, but exactly how do I get my data in to Aurora?” For discussion purposes, let’s assume you are running on MySQL and have a system that consists of four clean shards. By clean, I mean no shard contains data that technically belongs in another shard. Let’s also assume that one shard contains nonsharded reference data. Your sharded system looks something like the following diagram—the “map” shown represents the application mapping traffic to each individual shard.

ShardMap

Because you’re running on MySQL, you can use one of the methods listed in the Aurora documentation to establish a beachhead in Aurora. For simplicity, it’s probably easier to migrate instance 1 first because it contains the additional reference data. However, it doesn’t really matter all that much which shard you choose to migrate first. Once the migration is complete, your mapping function will point to your Aurora instance instead of instance 1 and your system will look something like the following diagram.

ShardMap1

Migrating the rest of your data
At this point, you should evaluate how well Aurora is handling your particular workload and make any required adjustments. Once you’re satisfied with your configuration, you’re ready to migrate another shard, say shard2, into your Aurora instance… but how?

Enter AWS Database Migration Service (AWS DMS)! AWS DMS was built for just these situations. You can use DMS to copy the data from shard2 into your new Aurora instance. Even better, you can do so while continuing to take transactions in shard2. DMS will collect these transactions and apply them to your Aurora instance after the bulk data has been completely loaded. DMS will continue moving transactions from shard2 to your Aurora instance, keeping the two in sync, until you are ready to use your Aurora instance instead of shard2. When you are ready to “flip” between shard2 and your Aurora instance, you simply stop taking transactions in shard2, let DMS apply the final few transactions to your Aurora instance, and update your map to direct traffic previously meant for shard2 to your Aurora instance. When you’ve done so, your system will look something like the following.

ShardMap2

From this point, you can use the Database Migration Service to migrate your final two shards into your Aurora instance. The resulting system will look like the following.

ShardMap3

Dealing with shards that aren’t clean
Great! You’ve now migrated your sharded system into a single Aurora instance and saved a bunch of money! But we made a couple of assumptions. We assumed your shards were clean and your source database was MySQL. What if those assumptions aren’t true?

Let’s take a look at the case where your shards aren’t clean. For example, suppose your system initially consisted of two shards and at some point you split those two shards into four shards. As part of the resharding process, you simply made copies of shard 1 and 2 to create shards 3 and 4 and updated your mapping function. The result is something like the following.

AppTier

The diagram looks more complicated than the situation is. Ideally, when resharding like this you purge the data that is no longer pertinent to the shard—the grayed-out data. However, it’s not always required and sometimes can be difficult, so people tend to leave it alone. This approach results in “dirty” shards—shards that hold redundant data that is dormant and not used. When you try to consolidate these shards, you run into issues where the rows of active data collide with those duplicate stagnant rows that should have been deleted.

What to do? You can purge those stagnant rows prior to performing your shard consolidation. However, it might be difficult to identify the rows that are stagnant, especially if your mapping function consists of a hash of an ID value (a fairly common practice).

Don’t despair! There might be another option! If the data within each of your shards is contained within a single database, you can use DMS to consolidate each shard of your system into a single MySQL database within your Aurora instance. You can then use your existing mapping scheme to direct transactions to the appropriate database within your Aurora instance. Using our example, the resultant Aurora instance looks something like the following.

ShardMap4

Using databases other than MySQL
One of the other assumptions we made was that your sharded system uses MySQL for a database engine. What if it doesn’t? What if you use Oracle or Microsoft SQL Server? Fear not! The AWS Schema Conversion Tool can help!

As stated in the documentation, “The AWS Schema Conversion Tool makes heterogeneous database migrations easy by automatically converting the source database schema and a majority of the custom code to a format compatible with the target database.” Often, sharded systems don’t include a significant amount of business logic embedded in the database by using stored procedures and triggers. This business logic has usually already been moved to the application. If you’re running a sharded system on a database engine that is supported as a source by the AWS Schema Conversion Tool, it might be worth your while to investigate whether a conversion and consolidation to Aurora is feasible. Not only will you benefit from the consolidation, you’ll also benefit from the migration to an open source platform.

Digging deeper
Interested in digging a little deeper? Excellent! We’ve put together an exercise to illustrate how you can use AWS Database Migration Service to help consolidate your sharded database into one or more Aurora instances. For our example, we use a sharded version of the MySQL sample database provided by the DMS team. We’ve made this sharded version available to you as RDS snapshots so you can follow along and provided full instructions. The required snapshots are publicly available and are called mysql-sampledb-master-shard, mysql-sampledb-shard1, and mysql-sampledb-shard2, as described in the instructions.

Set Access Control for Amazon Elasticsearch Service

by handler | on | in Elasticsearch | | Comments

Dr. Jon Handler (@_searchgeek) is an AWS solutions architect specializing in search technologies.

Securing your Amazon Elasticsearch Service (Amazon ES) domain helps ensure your data cannot be accessed or altered by unauthorized users. Most customers want the security of IP address- or identity-based access policies, but choose open access out of convenience. Because a domain with open access will accept requests to create, view, modify, and delete data from the Amazon ES domain from any party on the Internet, this option is not appropriate for most customers.

In an earlier blog post, How to Control Access to Your Amazon Elasticsearch Service Domain, we explored access control in depth. In this blog post, we’ll share some easy ways to get started with IAM policies for your domain. Although it takes some work to set up an AWS Identity and Access Management (IAM) policy, this “ounce of prevention” can prevent a ton of work later.

Key access control concepts in Amazon Elasticsearch Service

KeyAccessControlThe domain Amazon ES creates for you includes the nodes in the Elasticsearch cluster and resources from several AWS services. When Amazon ES creates your domain, it launches instances into a service-controlled VPC.  Those instances are fronted by Elastic Load Balancing (ELB), and the endpoint for the load balancer is published through Route 53. Requests to the domain pass through the ELB load balancer, which routes them to the domain’s EC2 instances. No matter where the request goes, the instance contacts IAM to determine whether the request is authorized. Unauthorized requests are blocked and dropped.

The key to understanding how IAM policies are applied and resolved rests on the following:

  • Policy location: IAM policies can be attached to your domain or to individual users or roles. If a policy is attached to your domain, it’s called a resource-based policy. If it’s attached to a user or role, it’s called a user-based policy.
  • Policy resolution: IAM collects all user-based and resource-based policies that apply to a request in order to determine whether the request is authorized. For more information, see How to Control Access to Your Amazon Elasticsearch Service Domain blog post.

Whether you create a resource-based policy, a user-based policy, or a mix of the two, IAM will respect all policies in place for a given request.

If you use the wizard in the Amazon ES console to create your domain, Amazon Elasticsearch Service provides several template IAM policies for different kinds of access.

SetupAccessPolocy

  • If you select Allow or deny access to one or more AWS accounts or IAM users: You specify which IAM users or roles should have access to your domain. All requests to the domain must be signed with AWS Signature Version 4 signing. When a request reaches the domain, it is forwarded to IAM for signature verification and access control.
  • If you select Allow access to the domain from specific IP(s): You specify an IP or CIDR block. Anonymous (unsigned) requests from that IP address range are allowed.
  • If you select Deny access to the domain: No request, signed or unsigned, is allowed.
  • If you select Allow open access to the domain: All requests from all users are allowed.  If you select this template, you will receive a pop-up warning from Amazon ES.

Simple IP address-based setup
When you’re just getting started with Amazon Elasticsearch Service, you want to load some data quickly, run a few queries (either from the command line or using Kibana), and do some deeper-dive inspection and monitoring from the command line. An open-access policy is the fastest way to get started because it allows tools like native Elasticsearch clients, curl, and your web browser to interact with the cluster.

By its nature, open access is not secure. We do not recommend an open-access policy. With IP address-based access control, you can secure your domain. Unauthorized visitors or port scanners will be rejected with messages like:

{“Message”: “User: anonymous is not authorized to perform: es:ESHttpGet on resource:<domain ARN>”}

If you are doing your development from an EC2 instance, you can set up your VPC to assign it a public IP address or an elastic IP address (EIP).

VPCIP

With this simple setup, you can choose the Allow access to the domain from specific IPs option to generate a policy like this one:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "",
      "Effect": "Allow",
      "Principal": {
        "AWS": "*"
      },
      "Action": "es:*",
      "Condition": {
        "IpAddress": {
          "aws:SourceIp": [
            "111.222.333.444/0"
          ]
        }
      },
      "Resource": "arn:aws:es:us-west-2:123456789012:domain/mydomain/*"
    }
  ]

Be sure to replace the IP address, account ID, and domain name (in red) with your own values.

This setup supports the basic activities you need to do for development and testing. You can send commands to the cluster directly with tools like curl. You can run Kibana on that EC2 instance, and your IP address-based access policy will allow full access. All other requests to the domain, whether signed or anonymous, from a different IP address will be denied by IAM.

Setup for use with Kinesis Firehose, Amazon CloudWatch Logs, or AWS IoT
Sending data from another AWS service is an easy way to get started with Amazon Elasticsearch Service. To create a Kinesis Firehose stream or to stream CloudWatch Logs data to Amazon ES, you’ll need to create a role to allow these services to write to a domain. IAM’s policy resolution will allow access from the other service to write data to your domain. An IP address-based policy will allow access to your EC2 instance for your commands and Kibana. It will deny access to all other requests.

For information about setting up secure access for AWS IoT, see the Analyze Device-Generated Data with AWS IoT and Amazon Elasticsearch Service blog post, which discusses how to use an IP address-based policy.

Setup for when you don’t know the IP address
In many cases, you don’t have a static IP address for the source of your requests. You could be running Kibana on a set of nodes in your data center, supporting a mobile application, or sending requests from an auto-scaled set of web servers or Logstash instances.

Arch3

In these cases, you can use a reverse proxy at a known IP address in your VPC to forward requests to your Amazon Elasticsearch Service domain from Kibana and use Signature Version 4 signing with user-based authentication to send requests from your application servers.

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "",
      "Effect": "Allow",
      "Principal": {
        "AWS": "arn:aws:iam::123456789012:role/webserver "
      },
      "Action": ["es:ESHttpGet"],
      "Resource": "arn:aws:es:us-west-2:123456789012:domain/mydomain/*"
    },
    {
      "Sid": "",
      "Effect": "Allow",
      "Principal": {
        "AWS": "*"
      },
      "Action": "es:*",
      "Condition": {
        "IpAddress": {
          "aws:SourceIp": [
            "111.222.333.444/0",   <-- proxy IP address
            "112.223.334.445/0"    <-- IP address of your instance
          ]
        }
      },
      "Resource": "arn:aws:es:us-west-2:123456789012:domain/mydomain/*"
    }

  ]
}

Substitute the IP address of the proxy server into the IP address-based policy to allow your requests from the proxy. You can also add another IP address to the policy to open access for your command line and Kibana.

In the preceding example policy, we have restricted the Action allowed for the web server role to HTTP GET calls. The IAM policies that you write allow a range of commands and HTTP methods, so you to set up access controls for a range of actors. For more information, see the How to Control Access to Your Amazon Elasticsearch Service Domain blog post.

Use a proxy to simplify request signing
Creating a proxy allows you to control access to your domain by using the IP address of the proxy as a source of identity. You control access by emitting only authorized requests from your proxy. You can also use Signature Version 4 request signing to provide the identity behind the request. Amazon ES uses IAM to authenticate these requests and allow or deny them.

To implement request signing, you need to write code, and that can be an additional hurdle for development, for command-line or Kibana access. There are open-source projects that provide a small application that accepts requests, signs them with Signature Version 4, and then forwards them to AWS.

You can find one such signing proxy here: https://github.com/abutaha/aws-es-proxy. This small application listens on port 9200 and forwards signed requests to Amazon Elasticsearch Service.

Note: This signing proxy was developed by a third party, not AWS. It is suitable for development and test, but not for production workloads. AWS is not responsible for the functioning or suitability of external content. With this signing proxy, you can use the Allow or deny access to one or more AWS accounts or IAM users template to set the policy on your domain to the following:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "",
      "Effect": "Allow",
      "Principal": {
        "AWS": "arn:aws:iam::123456789012:user/susan"
      },
      "Action": "es:*",
      "Resource": "arn:aws:es:us-west-2:123456789012:domain/mydomain/*"
    }
  ]
}

Replace the user ARN and domain ARN in red with those from the generated policy. Run the proxy, and have it listen on 127.0.0.1:9200. Then you can use curl to send Elasticsearch API calls to http://127.0.0.1:9200/ and they will be forwarded to your domain. If you want to run Kibana locally, on your development machine, point it at http://127.0.0.1:9200/ in kibana.yml.

Use CloudTrail to monitor for changes to access policies
Amazon CloudTrail provides logs of the requests you send to AWS when you interact with various services. Amazon Elasticsearch Service sends CloudTrail events for all administrative actions, like creating domains, updating domain configuration, adding tags, and so on. We recommend that you monitor CloudTrail events for the CreateElasticsearchDomain and UpdateElasticsearchDomainConfig API calls to validate access policies as people in your organization create or modify domains. You can use these logs to review all access policies and ensure they conform to the practices we’ve discussed.

Conclusion
We hope we’ve shown you that it’s easy to set up access policies that meet your needs during test and development. If you have questions or comments, please leave your feedback in the comments.

Choosing the Right DynamoDB Partition Key

by Gowri Balasubramanian | on | in DynamoDB | | Comments

Gowri Balasubramanian is a solutions architect at Amazon Web Services

This blog post will cover important considerations and strategies for choosing the right partition key while migrating from a relational database to DynamoDB. This is an important step in the design  and building of scalable and reliable applications on top of DynamoDB.

What is a partition key?

DynamoDB supports two types of primary keys:

  • Partition key: Also known as a hash key, the partition key is composed of a single attribute. Attributes in DynamoDB are similar in many ways to fields or columns in other database systems.
  • Partition key and sort key: Referred to as a composite primary key or hash-range key, this type of key is composed of two attributes. The first attribute is the partition key, and the second attribute is the sort key. Here is an example:

PartitionKey

Figure 1 A DynamoDB table with a composite primary key

Why do I need a partition key?

DynamoDB stores data as groups of attributes, known as items. Items are similar to rows or records in other database systems. DynamoDB stores and retrieves each item based on the primary key value which must be unique. Items are distributed across 10 GB storage units, called partitions (physical storage internal to DynamoDB). Each table has one or more partitions, as shown in Figure 2. For more information, see the Understand Partition Behavior in the DynamoDB Developer Guide.

DynamoDB uses the partition key’s value as an input to an internal hash function. The output from the hash function determines the partition in which the item will be stored. Each item’s location is determined by the hash value of its partition key.

All items with the same partition key are stored together, and for composite partition keys, are ordered by the sort key value. DynamoDB will split partitions by sort key if the collection size grows bigger than 10 GB.

Partition

Figure 2 DynamoDB tables and partitions

Partition keys and request throttling

DynamoDB evenly distributes provisioned throughput (read capacity units and write capacity units) between partitions. Therefore, it limits the throughput per partition based on your provisioned capacity. If your read or write throughput exceeds this level for a single partition, your requests may be throttled with a ProvisionedThroughputExceededexceptions error.

Reading/writing above the limit can be caused by:

  • Uneven distribution of data due to the wrong choice of partition key.
  • Frequent accessing of the same key in a partition (the most popular item, also known as a hot key).
  • A request rate greater than the provisioned throughput.

To avoid request throttling, design your DynamoDB table with the right partition key to meet your access requirements and provide even distribution of data.

Recommendations for partition keys

Use  high-cardinality attributes. These are attributes that have distinct values for each item  like e-mail id, employee_no, customerid, sessionid, ordered, and so on.

Use composite attributes. Try to combine more than one attribute to form a unique key, if that meets your access pattern. For example, consider an orders table with customerid+productid+countrycode as the partition key and order_date as the sort key.

Cache the popular items when there is a high volume of read traffic. The cache acts as a low-pass filter, preventing reads of unusually popular items from swamping partitions. For example, consider a table that has deals information for products. Some deals are expected to be more popular than others during major sale events like Black Friday or Cyber Monday.

Add random numbers/digits from a predetermined range for write-heavy use cases. If you expect a large volume of writes for a partition key, use an additional prefix or suffix (a fixed number from predeternmined range, say 1-10) and add it to the partition key.  For example,  consider a table of invoice transactions. A single invoice can contain thousands of transactions per client. How do we enforce uniqueness and ability to query/update the invoice details for high-volumetric clients?

Here is the recommended table layout for this scenario:

  • Partition key: Add a random suffix (1-10 or 1-100) with the InvoiceNumber,  depending on the number of items per InvoiceNumber.
  • Sort key: ClientTransactionid.
    Partition Key Sort Key Attribute1
    InvoiceNumber+Randomsuffix ClientTransactionid Invoice_Date
    121212-1 Client1_trans1 2016-05-17 01.36.45
    121212-1 Client1-trans2 2016-05-18 01.36.30
    121212-2 Client2_trans1 2016-06-15 01.36.20
    121212-2 Client2_trans2 2016-07-1 01.36.15
  • This combination will give us a good spread through the partitions. The sort key can be used to filter for a specific client (for example, where InvoiceNumber=121212-1 and ClientTransactionid begins with Client1).
  • Because we have a random number appended to our partition key (1-10), we need to query the table 10 times for a given InvoiceNumber. Our partition key could be 121212-[1-10], so we need to query where partition key is 121212-1 and ClientTransactionid begins with Client1. We need to repeat this for 121212-2, on up to 121212-10 and then merge the results.

Here’s an alternative to using the random number (1-10) option: If you will always use Invoice_Date to filter the query results, you can append a portion of Invoice_Date (such as Day, Month and Year (DDMMYYYY)) to the InvoiceNumber to spread the writes. If you need to identify the list of invoices for Client1 for Sep2016, then you need to issue query 30 times, where InvoiceNumber=121212-01SEP2016 to 121212-30SEP2016 and ClientTransactionid starts with Client1. On the other hand, if your access pattern is to typically fetch invoices on particular date, then you can use the partition key to query directly.

Note:

After the prefix range is decided, there is no easy way to further spread the data because prefix modifications also require application-level changes. Therefore, consider how hot each partition key could get and add enough random prefixes (with buffer) to accomodate the anticipated future growth.

This option will induce additional latency for reads due to X number of read requests per query.

Segregate the tables for large vs. small use cases. Using a random number with a partition key works well if there are large number of items for a partition key. What if you have a mix of small and large items for a partition key?  For example, what if you have one customer/product with tens of thousands of items per partition key and another customer with only a few items per partition key?  In this scenario, using a random prefix will be ineffective for small customers.

Consider the same InvoiceTransactions table. Assume that you know your client profile in advance and your clients can be classified as a small and big. Small invoices have 10 transactions and big invoices have ~10,000 transactions.

Here is the recommended table layout for this scenario:

Table Name Partition Key Sort Key
InvoiceTransactions_big InvoiceNumber+Randomsuffix ClientTransactionid
InvoiceTransactions_small InvoiceNumber ClientTransactionid

In this case, the random prefix logic is only applied to the big table where it is expected to store tens of thousands of invoice transactions per client. The queries against the small table do not require multiple API calls.

Anti-patterns for partition keys

Use of sequences or unique IDs generated by the DB engine as partition key. It’s common to use sequences (schema.sequence.NEXTVAL) as the primary key to enforce uniqueness in Oracle tables. They are not usually used for accessing the data.

The following is an example schema layout for an order table that has been migrated from Oracle to DynamoDB. The main table partition key (TransactionID) is populated by a UID. A global secondary index (GSI) is created on OrderID and Order_Date for query purposes.

Partition Key Attribute1 Attribute2
TransactionID OrderID Order_Date
1111111 Customer1-1 2016-05-17 01.36.45
1111112 Customer1-2 2016-05-18 01.36.30
1111113 Customer2-1 2016-05-18 01.36.30

Here are the potential issues with this approach:

  • You  can’t use TransactionID for any query purposes, so you lose the ability to use the partition key to perform a fast lookup of data.
  • GSIs support eventual consistency only, with additional costs for reads and writes.

Note:  You can use the conditional writes feature instead of sequences to enforce uniqueness and prevent the overwriting of an item.

Low-cardinality attributes like product_code as partition key and order_date as sort key.

  • This design greatly increases the likelihood of hot partition issues. For example, if one product is more popular, then the reads and writes for that key will be high, resulting in throttling issues.
  • Except for scan, DynamoDB API operations require an equal operator (EQ) on the partition key for tables and GSIs. As a result, the partition key must be something that is easily queried by your application with a simple lookup (for example, using key=value, which returns either a unique item or fewer items). There is a 1 MB limit on items you can fetch through a single Query operation, which means that you will need to paginate using LastEvaluatedKey, which is not optimal.

Do not build your own hash on top of the key like customer_id before storing it in DynamoDB. DynamoDB will hash your key and distribute it across partitions. Double hashing will only add complexity to your design.

Do not lift and shift primary keys from the source database without analyzing the data model and access patterns of the target DynamoDB table.

Conclusion

When it comes to DynamoDB partition key strategies, no single solution will fit all use cases. You should evaluate various approaches based on your data ingestion and access pattern, then choose the most appropriate key which has the least probability of hitting throttling issues.

Reference

Design for Uniform Data Access Across Items in Your Tables in the DynamoDB Developer Guide.

How to Migrate Your Oracle Database to Amazon Aurora

by Ed Murray | on | in Aurora, DMS, Schema Conversion Tool (SCT) | | Comments

Ed Murray is a manager at Amazon Web Services.

This blog post gives you a quick overview of how you can use the AWS Schema Conversion Tool (AWS SCT) and AWS Database Migration Service (AWS DMS) to facilitate and simplify migrating your commercial database to Amazon Aurora. In this case, we focus on migrating from Oracle to the MySQL-compatible Amazon Aurora.

Changing database engines can be daunting. However, the value proposition of a highly scalable, cost-effective fully managed service such as Amazon Aurora can make the challenge worth it—especially when there are tools to simplify the process. When migrating your database from one engine to another, there are two major things to consider: the conversion of the schema and code objects, and the migration and conversion of the data itself. Fortunately, AWS has tools to facilitate both the conversion and migration of your database.

The AWS Schema Conversion Tool helps simplify heterogeneous database migrations by automatically converting your source database schema and a majority of the custom code to a format compatible with your new target database. The custom code that the tool converts includes views, stored procedures, and functions. Any code that the tool cannot convert automatically is clearly marked so that you can convert it yourself. The AWS Database Migration Service helps you migrate your data easily and securely with minimal downtime.

Great! So where do we begin?

Working with AWS SCT
Usually, the first step in every migration is an assessment of feasibility and effort. You can use AWS SCT to generate a high-level overview of the effort required to convert your Oracle database to Aurora. The SCT runs on a number of operating systems. For the purposes of this blog, we’ll run the tool on Windows. To download a copy of the SCT, see the Installing and Updating the AWS Schema Conversion Tool documentation. To find overall documentation for SCT, start with What Is the AWS Schema Conversion Tool?

Although this blog post won’t cover the installation and configuration of the SCT, it’s important to point out that you need to install drivers for Oracle and MySQL to connect the SCT to your source and target databases. After connecting to your Oracle source database, you can right-click on any given schema and generate an assessment report. The assessment report tells you at a very high level how much of the schema can be converted automatically from Oracle to Aurora and the work that is left after conversion. You can find an example report following.

AssessmentReport

In addition to the assessment report, the SCT also tells you how exactly each object was converted. If an object can’t be converted, the SCT tells you why and give you hints on how you can remedy the situation.

AssessmentReport1

In the likely event that 100% of your schema doesn’t convert from Oracle to Aurora, you can remedy the situation in a few ways:

  • Modify objects on your source Oracle database so the SCT can better convert them to Aurora.
  • Convert the schema as-is and modify the scripts generated by the SCT before applying them to your Aurora database.
  • Ignore the inconvertible objects and replace or ignore them on the target. For example, suppose you have a function that calls the sys.dbms_random package in Oracle. This package doesn’t exist in Aurora. To remedy this, you can do the following:
    • Push the generation of the random value into the application code and pass it into the function as a parameter. You can choose to make this modification on your source database before the conversion, or on your target after the conversion.
    • Modify the code generated by the SCT to use the RAND() function, which is available in MySQL, and apply the new code to your Aurora database.

As another example, suppose you are using sequences in Oracle to populate some of your unique identifiers. Aurora doesn’t support sequences, so to remedy this you can do the following:

  • Use the auto-increment feature of Aurora to populate your unique identifiers automatically. If you go this route, you’ll likely want to create a script to modify your target tables after you create the schema in your Aurora database.
  • Create an alternative method for generating unique identifiers (using a function or something similar), and replace your references to the sequences with your new function. You can do this on your Oracle source before the conversion or in your Aurora database after the conversion.
  • You might need to use both techniques.

In general, a good approach to using the SCT as part of your migration should include the following:

  • Generate the SCT assessment report and use it to develop a plan to close any gaps in the conversion. If you have multiple systems that are candidates for migration, use the SCT assessment report to help determine which system you should tackle first.
  • Review the action items and determine an appropriate remedy for each item that fails in the conversion.
  • You will likely want to iterate on this process, using it in conjunction with the AWS Database Migration Service to load data into your new schema while you test your application against your new Aurora database.

Which brings us to AWS DMS!

Working with AWS DMS
AWS DMS can be used to load data from your Oracle source database into your new Aurora target database. The great thing about DMS is that in addition to loading your bulk data, it captures and applies ongoing transactions. It brings and keeps your Oracle source and Aurora target databases in sync until you’re ready to cut over. This approach can significantly reduce the outage time required to complete your migration. Any DMS migration includes the following elements: a source endpoint, Oracle; a target endpoint, Aurora; a replication server; and a task.

When migrating from Oracle to Aurora, you’ll want to configure your task to migrate existing data and replicate ongoing changes. Doing this directs DMS to capture transactions while it migrates the bulk data. Once the bulk data has been loaded, DMS begins applying captured transactions bringing the Oracle and Aurora databases in sync. When you are ready to cut over to Aurora, you simply stop the application, let DMS apply the few remaining transactions, and start the application pointing to your new Aurora database.

There are a few things to consider when using DMS to migrate from Oracle to Aurora:

Supplemental logging. For DMS to capture changes from your Oracle source, you need to enable supplemental logging. Detailed instructions can be found in the DMS documentation.

The three phases of DMS. DMS goes through three phases when migrating data and replicating ongoing changes:

  • Bulk load: During the bulk load phase of a migration, DMS loads your tables individually n tables at a time; by default, n = 8. You can configure this number by using the DMS Management Console or the AWS CLI.
  • Application of cached transactions: During the bulk load phase, DMS captures changes to your source database. Once the bulk load is completed for a table, DMS applies any cached changes to that table as soon as possible, as if they were part of the bulk load.
  • Transactional apply: Once the bulk load is complete for all tables, DMS will begin applying captured changes as transactions, rather than as single table updates.

Secondary indexes. In some situations, you might want to remove secondary indexes during the bulk load phase of DMS for performance reasons. If you do choose to remove some or all of your secondary indexes during the bulk phase, you probably should pause the migration and add them back during the transactional apply phase. You can safely pause your migration after the full load is complete for all tables.

Foreign keys, triggers, and so on. Because bulk load is done table by table, foreign keys in your Aurora target might be violated during the bulk load and cached transaction phases of your migration. You can disable foreign key checks by adding the following as an extra connection attribute of your target Aurora endpoint definition: initstmt=SET FOREIGN_KEY_CHECKS=0. In general, you should develop a strategy for dealing with anything that might be disrupted or negatively affected by bulk loading of data. For example, to avoid issues, you may want to postpone the installation of triggers to the cut-over phase of the migration.

Data types. When migrating to a new database engine, it’s critical to understand which data types are supported and how the source data types are transformed into the new target data types. For this example, you should check Oracle source datatypes and Aurora target datatypes in the DMS documentation.

Performance: The overall performance of your migration can vary depending on the amount, type, and distribution of data in your Oracle source.  The Database Migration Services Best Practices whitepaper has some good recommendations for optimizing the performance of your migration.

To recap the process:

  1. Use the SCT assessment report to get an overview of the task at hand. If you have multiple candidates for migration to Aurora, this report can help you decide which one to tackle first.
  2. Practice generating your target schema and loading it using DMS to flush out any migration steps that might be required before and after processing.
  3. Test your application on your target system to ensure it will work as expected in your new environment. Try to test your application in a configuration that resembles your production configuration, including load, network configuration, and so on.
  4. Practice the actual migration including generating your schema, loading your data, applying postprocessing steps, bringing the target system in sync with the source, and any cut-over steps you might require.
  5. Keep in mind that neither the SCT nor DMS require you to migrate your entire system in one fell swoop. You can use these tools to effectively migrate and rearchitect your system piecemeal if you want to.

Before you begin your actual migration, we recommend you thoroughly read the documentation for both the SCT and DMS. We also recommend you read the step-by-step walkthrough and the Database Migration Services Best Practices whitepaper.

If you’d like to use our sample database to get a feel for using the tools, you can find it in the AWS GitHub repository.

Although this blog post is not intended to outline all possible steps or considerations that might be required for your particular situation, it should give you a good idea how you can use the SCT and DMS to relieve you of the shackles of your proprietary Oracle database. Good luck and happy migrating!

 

Performing Parallel Queries and Phrase Searching with Amazon RDS for PostgreSQL 9.6.1

by Jeff Levine | on | in RDS PostgreSQL | | Comments

Jeff Levine is a solutions architect for Amazon Web Services.

As a solutions architect for Amazon Web Services, one of my responsibilities is to help our customers to learn about new features and capabilities of our services. Amazon RDS makes it possible to set up, operate, and scale relational databases in the cloud without having to focus on time-consuming administrative tasks.   Amazon RDS recently announced the support of PostgreSQL 9.6.1. In this blog post, I will discuss two of the features included in this release: the ability to perform parallel queries and phrase searches.

Parallel Queries
Parallel queries make it possible to use multiple CPU cores to speed up the execution of queries by creating multiple worker processes. Each worker process performs the query on a subset of data. The results are then combined to produce the final query result.  The number of worker processes is determined by the value of the parameter max_parallel_workers_per_gather. The default value of this parameter is 0, which disables parallelism.

In the examples that follow, I’ve created a VPC with a PostgreSQL RDS database on a db.m4.large instance class with 2 vCPUs and 8 GB RAM. I’m going to use pgcli, a command line tool for PostgreSQL, to interact with the database named pg961db to create a table.

CLi

 

The create table statement uses the generate_series function to create a set of integers from 1 to 100 million that serve as identifiers for the rows of the table. I am also using the random() and floor() functions to generate a column with a random number from 1 to 9. I then use the select statement to show the first three rows of the table. Now I will run a standard query:

CLi2

 

In the preceding screenshot,  max_parallel_workers_per_gather is set to 0 to disable parallel queries and then perform a select. Notice that the query plan calls for a sequential scan (as denoted by Seq Scan). The total query time is approximately 44 seconds. Now let’s see what happens when parallel queries are enabled:

CLi3

In this example, max_parallel_workers_per_gather is set to 2 to enable parallel queries. The output from explain shows that two workers have been launched and that the total processing time was reduced to approximately 29 seconds, a 34% reduction in the time required to perform the same query with only one worker.

This example shows how parallel queries can reduce the processing time of queries using multiple workers in a divide and conquer manner. This is especially useful when processing aggregate functions such as avg() in the preceding example. The resulting increase in efficiency will depend on a number of factors, including the data itself, the specific query, and the database instance class.

Phrase Searches
PostgreSQL 9.6.1 also introduced the ability to perform phrase searches, in which a search query consists of an ordered set of lexemes rather than, as with standard searches, an unordered set. Consider these standard search examples:

CLi4

In the first example, the search for “Amazon Web Services” in “Amazon Web Services has over 90 services” yields the expected result of True. However, the search for “Amazon Web Services” in “The Web Travel Services Agency offers tours of the Amazon” also yields a result of True even though it is likely not what we wish to see. To understand why this happens, see how the plainto_tsquery() function parses query strings:

CLi5

 

The parser first applies the English language processing rules to the string “Amazon Web Services” by normalizing the case of each word and extracting the root of each token before converting it to a lexeme (hence, the change of “services” to “servic”). The resulting query string consists of the lexemes separated by the intersection operator “&.” A match occurs if all of the lexemes are found, regardless of order. PostgreSQL 9.6.1 includes support for phrase search query strings using the phraseto_tsquery() function, as shown here:

CLi6

The query string contains the same three lexemes, but with a new ordering operator <->, which means “is followed by.” For a match to occur, the three lexemes must appear in the correct order. This capability makes it much easier to refine searches and produce the most desirable results.

Amazon RDS for PostgreSQL version 9.6.1 brings the latest capabilities of PostgreSQL to the cloud as a managed service. The parallel query feature can increase the performance of searches by using additional CPU capacity to run multiple worker processes. The phrase search capability provides for a more tailored search experience where the specific order of terms is required. It’s easy to get started. Use our Getting Started Guide to create your PostgreSQL database in just a few clicks.