AWS Database Blog

Set Access Control for Amazon Elasticsearch Service

by handler | on | in Elasticsearch | | Comments

Dr. Jon Handler (@_searchgeek) is an AWS solutions architect specializing in search technologies.

Securing your Amazon Elasticsearch Service (Amazon ES) domain helps ensure your data cannot be accessed or altered by unauthorized users. Most customers want the security of IP address- or identity-based access policies, but choose open access out of convenience. Because a domain with open access will accept requests to create, view, modify, and delete data from the Amazon ES domain from any party on the Internet, this option is not appropriate for most customers.

In an earlier blog post, How to Control Access to Your Amazon Elasticsearch Service Domain, we explored access control in depth. In this blog post, we’ll share some easy ways to get started with IAM policies for your domain. Although it takes some work to set up an AWS Identity and Access Management (IAM) policy, this “ounce of prevention” can prevent a ton of work later.

Key access control concepts in Amazon Elasticsearch Service

KeyAccessControlThe domain Amazon ES creates for you includes the nodes in the Elasticsearch cluster and resources from several AWS services. When Amazon ES creates your domain, it launches instances into a service-controlled VPC.  Those instances are fronted by Elastic Load Balancing (ELB), and the endpoint for the load balancer is published through Route 53. Requests to the domain pass through the ELB load balancer, which routes them to the domain’s EC2 instances. No matter where the request goes, the instance contacts IAM to determine whether the request is authorized. Unauthorized requests are blocked and dropped.

The key to understanding how IAM policies are applied and resolved rests on the following:

  • Policy location: IAM policies can be attached to your domain or to individual users or roles. If a policy is attached to your domain, it’s called a resource-based policy. If it’s attached to a user or role, it’s called a user-based policy.
  • Policy resolution: IAM collects all user-based and resource-based policies that apply to a request in order to determine whether the request is authorized. For more information, see How to Control Access to Your Amazon Elasticsearch Service Domain blog post.

Whether you create a resource-based policy, a user-based policy, or a mix of the two, IAM will respect all policies in place for a given request.

If you use the wizard in the Amazon ES console to create your domain, Amazon Elasticsearch Service provides several template IAM policies for different kinds of access.

SetupAccessPolocy

  • If you select Allow or deny access to one or more AWS accounts or IAM users: You specify which IAM users or roles should have access to your domain. All requests to the domain must be signed with AWS Signature Version 4 signing. When a request reaches the domain, it is forwarded to IAM for signature verification and access control.
  • If you select Allow access to the domain from specific IP(s): You specify an IP or CIDR block. Anonymous (unsigned) requests from that IP address range are allowed.
  • If you select Deny access to the domain: No request, signed or unsigned, is allowed.
  • If you select Allow open access to the domain: All requests from all users are allowed.  If you select this template, you will receive a pop-up warning from Amazon ES.

Simple IP address-based setup
When you’re just getting started with Amazon Elasticsearch Service, you want to load some data quickly, run a few queries (either from the command line or using Kibana), and do some deeper-dive inspection and monitoring from the command line. An open-access policy is the fastest way to get started because it allows tools like native Elasticsearch clients, curl, and your web browser to interact with the cluster.

By its nature, open access is not secure. We do not recommend an open-access policy. With IP address-based access control, you can secure your domain. Unauthorized visitors or port scanners will be rejected with messages like:

{“Message”: “User: anonymous is not authorized to perform: es:ESHttpGet on resource:<domain ARN>”}

If you are doing your development from an EC2 instance, you can set up your VPC to assign it a public IP address or an elastic IP address (EIP).

VPCIP

With this simple setup, you can choose the Allow access to the domain from specific IPs option to generate a policy like this one:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "",
      "Effect": "Allow",
      "Principal": {
        "AWS": "*"
      },
      "Action": "es:*",
      "Condition": {
        "IpAddress": {
          "aws:SourceIp": [
            "111.222.333.444/0"
          ]
        }
      },
      "Resource": "arn:aws:es:us-west-2:123456789012:domain/mydomain/*"
    }
  ]

Be sure to replace the IP address, account ID, and domain name (in red) with your own values.

This setup supports the basic activities you need to do for development and testing. You can send commands to the cluster directly with tools like curl. You can run Kibana on that EC2 instance, and your IP address-based access policy will allow full access. All other requests to the domain, whether signed or anonymous, from a different IP address will be denied by IAM.

Setup for use with Kinesis Firehose, Amazon CloudWatch Logs, or AWS IoT
Sending data from another AWS service is an easy way to get started with Amazon Elasticsearch Service. To create a Kinesis Firehose stream or to stream CloudWatch Logs data to Amazon ES, you’ll need to create a role to allow these services to write to a domain. IAM’s policy resolution will allow access from the other service to write data to your domain. An IP address-based policy will allow access to your EC2 instance for your commands and Kibana. It will deny access to all other requests.

For information about setting up secure access for AWS IoT, see the Analyze Device-Generated Data with AWS IoT and Amazon Elasticsearch Service blog post, which discusses how to use an IP address-based policy.

Setup for when you don’t know the IP address
In many cases, you don’t have a static IP address for the source of your requests. You could be running Kibana on a set of nodes in your data center, supporting a mobile application, or sending requests from an auto-scaled set of web servers or Logstash instances.

Arch3

In these cases, you can use a reverse proxy at a known IP address in your VPC to forward requests to your Amazon Elasticsearch Service domain from Kibana and use Signature Version 4 signing with user-based authentication to send requests from your application servers.

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "",
      "Effect": "Allow",
      "Principal": {
        "AWS": "arn:aws:iam::123456789012:role/webserver "
      },
      "Action": ["es:ESHttpGet"],
      "Resource": "arn:aws:es:us-west-2:123456789012:domain/mydomain/*"
    },
    {
      "Sid": "",
      "Effect": "Allow",
      "Principal": {
        "AWS": "*"
      },
      "Action": "es:*",
      "Condition": {
        "IpAddress": {
          "aws:SourceIp": [
            "111.222.333.444/0",   <-- proxy IP address
            "112.223.334.445/0"    <-- IP address of your instance
          ]
        }
      },
      "Resource": "arn:aws:es:us-west-2:123456789012:domain/mydomain/*"
    }

  ]
}

Substitute the IP address of the proxy server into the IP address-based policy to allow your requests from the proxy. You can also add another IP address to the policy to open access for your command line and Kibana.

In the preceding example policy, we have restricted the Action allowed for the web server role to HTTP GET calls. The IAM policies that you write allow a range of commands and HTTP methods, so you to set up access controls for a range of actors. For more information, see the How to Control Access to Your Amazon Elasticsearch Service Domain blog post.

Use a proxy to simplify request signing
Creating a proxy allows you to control access to your domain by using the IP address of the proxy as a source of identity. You control access by emitting only authorized requests from your proxy. You can also use Signature Version 4 request signing to provide the identity behind the request. Amazon ES uses IAM to authenticate these requests and allow or deny them.

To implement request signing, you need to write code, and that can be an additional hurdle for development, for command-line or Kibana access. There are open-source projects that provide a small application that accepts requests, signs them with Signature Version 4, and then forwards them to AWS.

You can find one such signing proxy here: https://github.com/abutaha/aws-es-proxy. This small application listens on port 9200 and forwards signed requests to Amazon Elasticsearch Service.

Note: This signing proxy was developed by a third party, not AWS. It is suitable for development and test, but not for production workloads. AWS is not responsible for the functioning or suitability of external content. With this signing proxy, you can use the Allow or deny access to one or more AWS accounts or IAM users template to set the policy on your domain to the following:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "",
      "Effect": "Allow",
      "Principal": {
        "AWS": "arn:aws:iam::123456789012:user/susan"
      },
      "Action": "es:*",
      "Resource": "arn:aws:es:us-west-2:123456789012:domain/mydomain/*"
    }
  ]
}

Replace the user ARN and domain ARN in red with those from the generated policy. Run the proxy, and have it listen on 127.0.0.1:9200. Then you can use curl to send Elasticsearch API calls to http://127.0.0.1:9200/ and they will be forwarded to your domain. If you want to run Kibana locally, on your development machine, point it at http://127.0.0.1:9200/ in kibana.yml.

Use CloudTrail to monitor for changes to access policies
Amazon CloudTrail provides logs of the requests you send to AWS when you interact with various services. Amazon Elasticsearch Service sends CloudTrail events for all administrative actions, like creating domains, updating domain configuration, adding tags, and so on. We recommend that you monitor CloudTrail events for the CreateElasticsearchDomain and UpdateElasticsearchDomainConfig API calls to validate access policies as people in your organization create or modify domains. You can use these logs to review all access policies and ensure they conform to the practices we’ve discussed.

Conclusion
We hope we’ve shown you that it’s easy to set up access policies that meet your needs during test and development. If you have questions or comments, please leave your feedback in the comments.

Choosing the Right DynamoDB Partition Key

by Gowri Balasubramanian | on | in DynamoDB | | Comments

Gowri Balasubramanian is a solutions architect at Amazon Web Services

This blog post will cover important considerations and strategies for choosing the right partition key while migrating from a relational database to DynamoDB. This is an important step in the design  and building of scalable and reliable applications on top of DynamoDB.

What is a partition key?

DynamoDB supports two types of primary keys:

  • Partition key: Also known as a hash key, the partition key is composed of a single attribute. Attributes in DynamoDB are similar in many ways to fields or columns in other database systems.
  • Partition key and sort key: Referred to as a composite primary key or hash-range key, this type of key is composed of two attributes. The first attribute is the partition key, and the second attribute is the sort key. Here is an example:

PartitionKey

Figure 1 A DynamoDB table with a composite primary key

Why do I need a partition key?

DynamoDB stores data as groups of attributes, known as items. Items are similar to rows or records in other database systems. DynamoDB stores and retrieves each item based on the primary key value which must be unique. Items are distributed across 10 GB storage units, called partitions (physical storage internal to DynamoDB). Each table has one or more partitions, as shown in Figure 2. For more information, see the Understand Partition Behavior in the DynamoDB Developer Guide.

DynamoDB uses the partition key’s value as an input to an internal hash function. The output from the hash function determines the partition in which the item will be stored. Each item’s location is determined by the hash value of its partition key.

All items with the same partition key are stored together, and for composite partition keys, are ordered by the sort key value. DynamoDB will split partitions by sort key if the collection size grows bigger than 10 GB.

Partition

Figure 2 DynamoDB tables and partitions

Partition keys and request throttling

DynamoDB evenly distributes provisioned throughput (read capacity units and write capacity units) between partitions. Therefore, it limits the throughput per partition based on your provisioned capacity. If your read or write throughput exceeds this level for a single partition, your requests may be throttled with a ProvisionedThroughputExceededexceptions error.

Reading/writing above the limit can be caused by:

  • Uneven distribution of data due to the wrong choice of partition key.
  • Frequent accessing of the same key in a partition (the most popular item, also known as a hot key).
  • A request rate greater than the provisioned throughput.

To avoid request throttling, design your DynamoDB table with the right partition key to meet your access requirements and provide even distribution of data.

Recommendations for partition keys

Use  high-cardinality attributes. These are attributes that have distinct values for each item  like e-mail id, employee_no, customerid, sessionid, ordered, and so on.

Use composite attributes. Try to combine more than one attribute to form a unique key, if that meets your access pattern. For example, consider an orders table with customerid+productid+countrycode as the partition key and order_date as the sort key.

Cache the popular items when there is a high volume of read traffic. The cache acts as a low-pass filter, preventing reads of unusually popular items from swamping partitions. For example, consider a table that has deals information for products. Some deals are expected to be more popular than others during major sale events like Black Friday or Cyber Monday.

Add random numbers/digits from a predetermined range for write-heavy use cases. If you expect a large volume of writes for a partition key, use an additional prefix or suffix (a fixed number from predeternmined range, say 1-10) and add it to the partition key.  For example,  consider a table of invoice transactions. A single invoice can contain thousands of transactions per client. How do we enforce uniqueness and ability to query/update the invoice details for high-volumetric clients?

Here is the recommended table layout for this scenario:

  • Partition key: Add a random suffix (1-10 or 1-100) with the InvoiceNumber,  depending on the number of items per InvoiceNumber.
  • Sort key: ClientTransactionid.
    Partition Key Sort Key Attribute1
    InvoiceNumber+Randomsuffix ClientTransactionid Invoice_Date
    121212-1 Client1_trans1 2016-05-17 01.36.45
    121212-1 Client1-trans2 2016-05-18 01.36.30
    121212-2 Client2_trans1 2016-06-15 01.36.20
    121212-2 Client2_trans2 2016-07-1 01.36.15
  • This combination will give us a good spread through the partitions. The sort key can be used to filter for a specific client (for example, where InvoiceNumber=121212-1 and ClientTransactionid begins with Client1).
  • Because we have a random number appended to our partition key (1-10), we need to query the table 10 times for a given InvoiceNumber. Our partition key could be 121212-[1-10], so we need to query where partition key is 121212-1 and ClientTransactionid begins with Client1. We need to repeat this for 121212-2, on up to 121212-10 and then merge the results.

Here’s an alternative to using the random number (1-10) option: If you will always use Invoice_Date to filter the query results, you can append a portion of Invoice_Date (such as Day, Month and Year (DDMMYYYY)) to the InvoiceNumber to spread the writes. If you need to identify the list of invoices for Client1 for Sep2016, then you need to issue query 30 times, where InvoiceNumber=121212-01SEP2016 to 121212-30SEP2016 and ClientTransactionid starts with Client1. On the other hand, if your access pattern is to typically fetch invoices on particular date, then you can use the partition key to query directly.

Note:

After the prefix range is decided, there is no easy way to further spread the data because prefix modifications also require application-level changes. Therefore, consider how hot each partition key could get and add enough random prefixes (with buffer) to accomodate the anticipated future growth.

This option will induce additional latency for reads due to X number of read requests per query.

Segregate the tables for large vs. small use cases. Using a random number with a partition key works well if there are large number of items for a partition key. What if you have a mix of small and large items for a partition key?  For example, what if you have one customer/product with tens of thousands of items per partition key and another customer with only a few items per partition key?  In this scenario, using a random prefix will be ineffective for small customers.

Consider the same InvoiceTransactions table. Assume that you know your client profile in advance and your clients can be classified as a small and big. Small invoices have 10 transactions and big invoices have ~10,000 transactions.

Here is the recommended table layout for this scenario:

Table Name Partition Key Sort Key
InvoiceTransactions_big InvoiceNumber+Randomsuffix ClientTransactionid
InvoiceTransactions_small InvoiceNumber ClientTransactionid

In this case, the random prefix logic is only applied to the big table where it is expected to store tens of thousands of invoice transactions per client. The queries against the small table do not require multiple API calls.

Anti-patterns for partition keys

Use of sequences or unique IDs generated by the DB engine as partition key. It’s common to use sequences (schema.sequence.NEXTVAL) as the primary key to enforce uniqueness in Oracle tables. They are not usually used for accessing the data.

The following is an example schema layout for an order table that has been migrated from Oracle to DynamoDB. The main table partition key (TransactionID) is populated by a UID. A global secondary index (GSI) is created on OrderID and Order_Date for query purposes.

Partition Key Attribute1 Attribute2
TransactionID OrderID Order_Date
1111111 Customer1-1 2016-05-17 01.36.45
1111112 Customer1-2 2016-05-18 01.36.30
1111113 Customer2-1 2016-05-18 01.36.30

Here are the potential issues with this approach:

  • You  can’t use TransactionID for any query purposes, so you lose the ability to use the partition key to perform a fast lookup of data.
  • GSIs support eventual consistency only, with additional costs for reads and writes.

Note:  You can use the conditional writes feature instead of sequences to enforce uniqueness and prevent the overwriting of an item.

Low-cardinality attributes like product_code as partition key and order_date as sort key.

  • This design greatly increases the likelihood of hot partition issues. For example, if one product is more popular, then the reads and writes for that key will be high, resulting in throttling issues.
  • Except for scan, DynamoDB API operations require an equal operator (EQ) on the partition key for tables and GSIs. As a result, the partition key must be something that is easily queried by your application with a simple lookup (for example, using key=value, which returns either a unique item or fewer items). There is a 1 MB limit on items you can fetch through a single Query operation, which means that you will need to paginate using LastEvaluatedKey, which is not optimal.

Do not build your own hash on top of the key like customer_id before storing it in DynamoDB. DynamoDB will hash your key and distribute it across partitions. Double hashing will only add complexity to your design.

Do not lift and shift primary keys from the source database without analyzing the data model and access patterns of the target DynamoDB table.

Conclusion

When it comes to DynamoDB partition key strategies, no single solution will fit all use cases. You should evaluate various approaches based on your data ingestion and access pattern, then choose the most appropriate key which has the least probability of hitting throttling issues.

Reference

Design for Uniform Data Access Across Items in Your Tables in the DynamoDB Developer Guide.

How to Migrate Your Oracle Database to Amazon Aurora

by Ed Murray | on | in Aurora, DMS, Schema Conversion Tool (SCT) | | Comments

Ed Murray is a manager at Amazon Web Services.

This blog post gives you a quick overview of how you can use the AWS Schema Conversion Tool (AWS SCT) and AWS Database Migration Service (AWS DMS) to facilitate and simplify migrating your commercial database to Amazon Aurora. In this case, we focus on migrating from Oracle to the MySQL-compatible Amazon Aurora.

Changing database engines can be daunting. However, the value proposition of a highly scalable, cost-effective fully managed service such as Amazon Aurora can make the challenge worth it—especially when there are tools to simplify the process. When migrating your database from one engine to another, there are two major things to consider: the conversion of the schema and code objects, and the migration and conversion of the data itself. Fortunately, AWS has tools to facilitate both the conversion and migration of your database.

The AWS Schema Conversion Tool helps simplify heterogeneous database migrations by automatically converting your source database schema and a majority of the custom code to a format compatible with your new target database. The custom code that the tool converts includes views, stored procedures, and functions. Any code that the tool cannot convert automatically is clearly marked so that you can convert it yourself. The AWS Database Migration Service helps you migrate your data easily and securely with minimal downtime.

Great! So where do we begin?

Working with AWS SCT
Usually, the first step in every migration is an assessment of feasibility and effort. You can use AWS SCT to generate a high-level overview of the effort required to convert your Oracle database to Aurora. The SCT runs on a number of operating systems. For the purposes of this blog, we’ll run the tool on Windows. To download a copy of the SCT, see the Installing and Updating the AWS Schema Conversion Tool documentation. To find overall documentation for SCT, start with What Is the AWS Schema Conversion Tool?

Although this blog post won’t cover the installation and configuration of the SCT, it’s important to point out that you need to install drivers for Oracle and MySQL to connect the SCT to your source and target databases. After connecting to your Oracle source database, you can right-click on any given schema and generate an assessment report. The assessment report tells you at a very high level how much of the schema can be converted automatically from Oracle to Aurora and the work that is left after conversion. You can find an example report following.

AssessmentReport

In addition to the assessment report, the SCT also tells you how exactly each object was converted. If an object can’t be converted, the SCT tells you why and give you hints on how you can remedy the situation.

AssessmentReport1

In the likely event that 100% of your schema doesn’t convert from Oracle to Aurora, you can remedy the situation in a few ways:

  • Modify objects on your source Oracle database so the SCT can better convert them to Aurora.
  • Convert the schema as-is and modify the scripts generated by the SCT before applying them to your Aurora database.
  • Ignore the inconvertible objects and replace or ignore them on the target. For example, suppose you have a function that calls the sys.dbms_random package in Oracle. This package doesn’t exist in Aurora. To remedy this, you can do the following:
    • Push the generation of the random value into the application code and pass it into the function as a parameter. You can choose to make this modification on your source database before the conversion, or on your target after the conversion.
    • Modify the code generated by the SCT to use the RAND() function, which is available in MySQL, and apply the new code to your Aurora database.

As another example, suppose you are using sequences in Oracle to populate some of your unique identifiers. Aurora doesn’t support sequences, so to remedy this you can do the following:

  • Use the auto-increment feature of Aurora to populate your unique identifiers automatically. If you go this route, you’ll likely want to create a script to modify your target tables after you create the schema in your Aurora database.
  • Create an alternative method for generating unique identifiers (using a function or something similar), and replace your references to the sequences with your new function. You can do this on your Oracle source before the conversion or in your Aurora database after the conversion.
  • You might need to use both techniques.

In general, a good approach to using the SCT as part of your migration should include the following:

  • Generate the SCT assessment report and use it to develop a plan to close any gaps in the conversion. If you have multiple systems that are candidates for migration, use the SCT assessment report to help determine which system you should tackle first.
  • Review the action items and determine an appropriate remedy for each item that fails in the conversion.
  • You will likely want to iterate on this process, using it in conjunction with the AWS Database Migration Service to load data into your new schema while you test your application against your new Aurora database.

Which brings us to AWS DMS!

Working with AWS DMS
AWS DMS can be used to load data from your Oracle source database into your new Aurora target database. The great thing about DMS is that in addition to loading your bulk data, it captures and applies ongoing transactions. It brings and keeps your Oracle source and Aurora target databases in sync until you’re ready to cut over. This approach can significantly reduce the outage time required to complete your migration. Any DMS migration includes the following elements: a source endpoint, Oracle; a target endpoint, Aurora; a replication server; and a task.

When migrating from Oracle to Aurora, you’ll want to configure your task to migrate existing data and replicate ongoing changes. Doing this directs DMS to capture transactions while it migrates the bulk data. Once the bulk data has been loaded, DMS begins applying captured transactions bringing the Oracle and Aurora databases in sync. When you are ready to cut over to Aurora, you simply stop the application, let DMS apply the few remaining transactions, and start the application pointing to your new Aurora database.

There are a few things to consider when using DMS to migrate from Oracle to Aurora:

Supplemental logging. For DMS to capture changes from your Oracle source, you need to enable supplemental logging. Detailed instructions can be found in the DMS documentation.

The three phases of DMS. DMS goes through three phases when migrating data and replicating ongoing changes:

  • Bulk load: During the bulk load phase of a migration, DMS loads your tables individually n tables at a time; by default, n = 8. You can configure this number by using the DMS Management Console or the AWS CLI.
  • Application of cached transactions: During the bulk load phase, DMS captures changes to your source database. Once the bulk load is completed for a table, DMS applies any cached changes to that table as soon as possible, as if they were part of the bulk load.
  • Transactional apply: Once the bulk load is complete for all tables, DMS will begin applying captured changes as transactions, rather than as single table updates.

Secondary indexes. In some situations, you might want to remove secondary indexes during the bulk load phase of DMS for performance reasons. If you do choose to remove some or all of your secondary indexes during the bulk phase, you probably should pause the migration and add them back during the transactional apply phase. You can safely pause your migration after the full load is complete for all tables.

Foreign keys, triggers, and so on. Because bulk load is done table by table, foreign keys in your Aurora target might be violated during the bulk load and cached transaction phases of your migration. You can disable foreign key checks by adding the following as an extra connection attribute of your target Aurora endpoint definition: initstmt=SET FOREIGN_KEY_CHECKS=0. In general, you should develop a strategy for dealing with anything that might be disrupted or negatively affected by bulk loading of data. For example, to avoid issues, you may want to postpone the installation of triggers to the cut-over phase of the migration.

Data types. When migrating to a new database engine, it’s critical to understand which data types are supported and how the source data types are transformed into the new target data types. For this example, you should check Oracle source datatypes and Aurora target datatypes in the DMS documentation.

Performance: The overall performance of your migration can vary depending on the amount, type, and distribution of data in your Oracle source.  The Database Migration Services Best Practices whitepaper has some good recommendations for optimizing the performance of your migration.

To recap the process:

  1. Use the SCT assessment report to get an overview of the task at hand. If you have multiple candidates for migration to Aurora, this report can help you decide which one to tackle first.
  2. Practice generating your target schema and loading it using DMS to flush out any migration steps that might be required before and after processing.
  3. Test your application on your target system to ensure it will work as expected in your new environment. Try to test your application in a configuration that resembles your production configuration, including load, network configuration, and so on.
  4. Practice the actual migration including generating your schema, loading your data, applying postprocessing steps, bringing the target system in sync with the source, and any cut-over steps you might require.
  5. Keep in mind that neither the SCT nor DMS require you to migrate your entire system in one fell swoop. You can use these tools to effectively migrate and rearchitect your system piecemeal if you want to.

Before you begin your actual migration, we recommend you thoroughly read the documentation for both the SCT and DMS. We also recommend you read the step-by-step walkthrough and the Database Migration Services Best Practices whitepaper.

If you’d like to use our sample database to get a feel for using the tools, you can find it in the AWS GitHub repository.

Although this blog post is not intended to outline all possible steps or considerations that might be required for your particular situation, it should give you a good idea how you can use the SCT and DMS to relieve you of the shackles of your proprietary Oracle database. Good luck and happy migrating!

 

Performing Parallel Queries and Phrase Searching with Amazon RDS for PostgreSQL 9.6.1

by Jeff Levine | on | in RDS PostgreSQL | | Comments

Jeff Levine is a solutions architect for Amazon Web Services.

As a solutions architect for Amazon Web Services, one of my responsibilities is to help our customers to learn about new features and capabilities of our services. Amazon RDS makes it possible to set up, operate, and scale relational databases in the cloud without having to focus on time-consuming administrative tasks.   Amazon RDS recently announced the support of PostgreSQL 9.6.1. In this blog post, I will discuss two of the features included in this release: the ability to perform parallel queries and phrase searches.

Parallel Queries
Parallel queries make it possible to use multiple CPU cores to speed up the execution of queries by creating multiple worker processes. Each worker process performs the query on a subset of data. The results are then combined to produce the final query result.  The number of worker processes is determined by the value of the parameter max_parallel_workers_per_gather. The default value of this parameter is 0, which disables parallelism.

In the examples that follow, I’ve created a VPC with a PostgreSQL RDS database on a db.m4.large instance class with 2 vCPUs and 8 GB RAM. I’m going to use pgcli, a command line tool for PostgreSQL, to interact with the database named pg961db to create a table.

CLi

 

The create table statement uses the generate_series function to create a set of integers from 1 to 100 million that serve as identifiers for the rows of the table. I am also using the random() and floor() functions to generate a column with a random number from 1 to 9. I then use the select statement to show the first three rows of the table. Now I will run a standard query:

CLi2

 

In the preceding screenshot,  max_parallel_workers_per_gather is set to 0 to disable parallel queries and then perform a select. Notice that the query plan calls for a sequential scan (as denoted by Seq Scan). The total query time is approximately 44 seconds. Now let’s see what happens when parallel queries are enabled:

CLi3

In this example, max_parallel_workers_per_gather is set to 2 to enable parallel queries. The output from explain shows that two workers have been launched and that the total processing time was reduced to approximately 29 seconds, a 34% reduction in the time required to perform the same query with only one worker.

This example shows how parallel queries can reduce the processing time of queries using multiple workers in a divide and conquer manner. This is especially useful when processing aggregate functions such as avg() in the preceding example. The resulting increase in efficiency will depend on a number of factors, including the data itself, the specific query, and the database instance class.

Phrase Searches
PostgreSQL 9.6.1 also introduced the ability to perform phrase searches, in which a search query consists of an ordered set of lexemes rather than, as with standard searches, an unordered set. Consider these standard search examples:

CLi4

In the first example, the search for “Amazon Web Services” in “Amazon Web Services has over 90 services” yields the expected result of True. However, the search for “Amazon Web Services” in “The Web Travel Services Agency offers tours of the Amazon” also yields a result of True even though it is likely not what we wish to see. To understand why this happens, see how the plainto_tsquery() function parses query strings:

CLi5

 

The parser first applies the English language processing rules to the string “Amazon Web Services” by normalizing the case of each word and extracting the root of each token before converting it to a lexeme (hence, the change of “services” to “servic”). The resulting query string consists of the lexemes separated by the intersection operator “&.” A match occurs if all of the lexemes are found, regardless of order. PostgreSQL 9.6.1 includes support for phrase search query strings using the phraseto_tsquery() function, as shown here:

CLi6

The query string contains the same three lexemes, but with a new ordering operator <->, which means “is followed by.” For a match to occur, the three lexemes must appear in the correct order. This capability makes it much easier to refine searches and produce the most desirable results.

Amazon RDS for PostgreSQL version 9.6.1 brings the latest capabilities of PostgreSQL to the cloud as a managed service. The parallel query feature can increase the performance of searches by using additional CPU capacity to run multiple worker processes. The phrase search capability provides for a more tailored search experience where the specific order of terms is required. It’s easy to get started. Use our Getting Started Guide to create your PostgreSQL database in just a few clicks.

Streaming Changes in a Database with Amazon Kinesis

by Emmanuel Espina | on | in Kinesis, RDS MySQL | | Comments

Emmanuel Espina is a software development engineer at Amazon Web Services.

In this blog post, I will discuss how to integrate a central relational database with other systems by streaming its modifications through Amazon Kinesis.

The following diagram shows a common architectural design in distributed systems. It includes a central storage referred to as a “single source of truth” and several derived “satellite” systems that consume this central storage.

SingleSourceOfTruth

You could use this design architecture and have a relational database as the central data store, taking advantage of the transactional capabilities of this system for maintaining the integrity of the data. A derived system in this context could be a full-text search system that observes this single source of truth for changes, transforms and filters those modifications, and finally updates its internal indexes. Another example could be a columnar storage more appropriate for OLAP queries. In general, any system that requires taking action upon modification of individual rows of the central relational system is a good candidate to become a derived data store.

A naive implementation for these kinds of architectures will have the derived systems issuing queries periodically to retrieve modified rows, essentially polling the central database with a SELECT-based query.

A better implementation for this architecture is one that uses an asynchronous stream of updates. Because databases usually have a transaction log where all of the changes in rows are stored, if this stream of changes is exposed to external observer systems, those systems could attach to these streams and start processing and filtering row modifications.  I will show a basic implementation of this schema using MySQL as the central database and Amazon Kinesis as the message bus.

Normally, MYSQL binlog is exposed to read replicas that read all of the changes on the master and then apply them locally. In this post, I am going to create a generalized read replica that will publish changes to an Amazon Kinesis stream instead of applying the modifications to a local database.

Fetch

One important detail of this method is that the consumers won’t receive SQL queries. Those can be exposed too, but in general observers won’t be very interested in SQL unless they maintain a SQL-compatible replica of the data themselves. Instead, they will receive modified entities (rows) one by one. The benefits of this approach are that consumers do not need to understand SQL and the single source of truth does not need to know who will be consuming its changes. That means that different teams can work without coordinating among themselves on the required data format. Even better, given the capabilities of Amazon Kinesis clients to read from a specific point in time, each consumer will process messages at its own pace. This is why a message bus is one of the less coupled ways to integrate your systems.

In the example used in this post, the rows fetcher is a regular Python process that will attach to the central database, simulating a read replica.

The database can be either Amazon RDS or any installation of MySQL. In the case of RDS, the fetcher process must be installed on a different host (for example, EC2) because it is not possible to install custom software on RDS instance hosts. For external installations, the fetcher process can be installed on the same host as the database.

Prepare the master MySQL instance

The MySQL master (the single source of truth) must be configured as if it were a master for regular replication. Binlogs must be enabled and working in ROW format to receive individual modified rows. (Otherwise, you would end up with SQL queries only.) For information, see The Binary Log on the MySQL site.

To enable the binlog, add these two lines to your my.cnf configuration file:
log_bin=<path to binlog>
binlog_format=ROW

It is possible to get row-based logging by setting the transaction isolation level to READ-COMMITTED at the global or session level for all connections (for example, using init_connect or a database API like JDBC).

If you are using RDS (MySql 5.6+), things are easy! You can create the required configuration by enabling periodic backups (binlogs are disabled if backups are not enabled) and updating the parameter group variable binlog_format to ROW. (You can do this from the RDS Dashboard under Parameter Groups.)

ParameterGroup

Add permissions

If you are using the default user created by RDS, you might already have these permissions. If not, you’ll need to create a user with REPLICATION SLAVE permissions. For information, see Creating a User for Replication.

mysql> CREATE USER 'repl'@'%.mydomain.com' IDENTIFIED BY 'slavepass';
mysql> GRANT REPLICATION SLAVE ON *.* TO 'repl'@'%.mydomain.com';

Create an Amazon Kinesis stream

You need an Amazon Kinesis stream and  boto3 client credentials. For information about client credentials, see the Boto 3 documentation.

Open the Amazon Kinesis console and choose Create Stream.

WelcomeToKinesis

Enter the name of your stream and the number of shards. In this example, there is a single shard.

Stream

After a few minutes, your stream will be ready to accept row modifications!

CreateStreamAssign permissions to your CLI user

You can use the AWS Key Management Service to give permissions to the CLI user that will be accessing this stream.

ServiceList

In this example, that user is KinesisRDSIntegration. You can create a user or use an existing one, but you need to add permissions for writing to the Amazon Kinesis stream.

AttachPolicy

You can create a policy specific for your stream. This example uses a standard policy that gives complete access to Amazon Kinesis.

AttachPolicy1

Connecting to the master and publishing changes

To install libraries required by the Python publisher, run the following command:

pip install mysql-replication boto3

For more detailed instructions, see:

https://github.com/noplay/python-mysql-replication

https://boto3.readthedocs.io/en/latest/guide/quickstart.html

Here is the Python script that performs the magic. Remember to replace the <HOST>, <PORT>, <USER>, <PASSWORD> and <STREAM_NAME> variables with the values for your configuration.

import json
import boto3

from pymysqlreplication import BinLogStreamReader
from pymysqlreplication.row_event import (
  DeleteRowsEvent,
  UpdateRowsEvent,
  WriteRowsEvent,
)

def main():
  kinesis = boto3.client("kinesis")

  stream = BinLogStreamReader(
    connection_settings= {
      "host": "<HOST>",
      "port": <PORT>,
      "user": "<USER>",
      "passwd": "<PASSWORD>"},
    server_id=100,
    blocking=True,
    resume_stream=True,
    only_events=[DeleteRowsEvent, WriteRowsEvent, UpdateRowsEvent])

  for binlogevent in stream:
    for row in binlogevent.rows:
      event = {"schema": binlogevent.schema,
      "table": binlogevent.table,
      "type": type(binlogevent).__name__,
      "row": row
      }

      kinesis.put_record(StreamName="<STREAM_NAME>", Data=json.dumps(event), PartitionKey="default")
      print json.dumps(event)

if __name__ == "__main__":
   main()

This script will publish each modified row as an Amazon Kinesis record, serialized in JSON format.

Consuming the messages

Now you are ready to consume the modified records. Any consumer code would work. If you use the code in this post, you will get messages in this format:

  • {"table": "Users", "row": {"values": {"Name": "Foo User", "idUsers": 123}}, "type": "WriteRowsEvent", "schema": "kinesistest"}
  • {"table": "Users", "row": {"values": {"Name": "Bar user", "idUsers": 124}}, "type": "WriteRowsEvent", "schema": "kinesistest"}
  • {"table": "Users", "row": {"before_values": {"Name": "Foo User", "idUsers": 123}, "after_values": {"Name": "Bar User", "idUsers": 123}}, "type": "UpdateRowsEvent", "schema": "kinesistest"}

Summary

In this blog post, I have shown how to expose the changes stream to the records of a database using a fake read replica and Amazon Kinesis. Many data-oriented companies are using architectures similar to this. The example provided in this post, while not ready for a real production environment, can be used to experiment with this integration style and improve the scaling capabilities of your enterprise architecture. The most complex part is probably what is already solved behind the scenes by Amazon Kinesis. You only need to provide the glue!

Additional resources

What every software engineer should know about real-time data’s unifying abstraction

All aboard the Databus: LinkedIn’s scalable consistent change data capture platform

Get Started with Amazon Elasticsearch Service: How Many Data Instances Do I Need?

by handler | on | in Elasticsearch | | Comments

Jon Handler (@_searchgeek) is a principal solutions architect at Amazon Web Services.

Welcome to the first in a series of blog posts about Elasticsearch and Amazon Elasticsearch Service, where we will provide the information you need to get started with Elasticsearch on AWS.

How many instances will you need?
When you create an Amazon Elasticsearch Service domain, this is one of the first questions to answer.

ElasticSearch

To determine the number of data nodes to deploy in your Elasticsearch cluster, you’ll need to test and iterate. Start by setting the instance count based on the storage required to hold your indices, with a minimum of two instances to provide redundancy.

Storage Needed = Source Data x Source:Index Ratio x (Replicas + 1)

First, figure out how much source data you will hold in your indices. Then, apply a source-data to index-size ratio to determine base index size. Finally, multiply by the number of replicas you are going to store plus one (replica count is 0-based) to get the total storage required. As soon as you know the storage required, you can pick a storage option for the data nodes that dictates how much storage you will have per node. To get the node count, divide the total storage required by the storage per node.

Instances Needed = Storage Needed / Storage per data node

As you send data and queries to the cluster, continuously evaluate the resource usage and adjust the node count based on the performance of the cluster. If you run out of storage space, add data nodes or increase your Amazon Elastic Block Store (Amazon EBS) volume size. If you need more compute, increase the instance type, or add more data nodes. With Amazon Elasticsearch Service, you can make these changes dynamically, with no down time.

Determine how much source data you have
To figure out how much storage you need for your indices, start by figuring out how much source data you will be storing in the cluster. In the world of search engines, the collection of source data is called the corpus. Broadly speaking, there are two kinds of workloads AWS customers run:

  • Single index workloads use an external “source of truth” repository that holds all of the content. You write scripts to put the content into the single index for search, and that index is updated incrementally as the source of truth changes. These are commonly full-text workloads like website, document, and e-commerce search.
  • Rolling index workloads receive data continuously. The data is put into a changing set of indices, based on a timestamp and an indexing period (usually one day). Documents in these indices are not usually updated. New indices are created each day and the oldest index is removed after some retention period. These are commonly for analytics use cases like log analytics, time-series processing, and clickstream analytics.

Index

If you have a single index workload, you already know how much data you have. Simply check your source of truth for how much data you’re storing, and use that figure. If you are collecting data from multiple sources (such as documents and metadata), sum up the size of all data sources to get the total.

If you have a rolling index workload, you’ll need to calculate how much data you will be storing, based on a single time period and a retention length. A very common case is to store the logs generated every 24 hours (the time period) for two weeks (the retention period). If you don’t already know how much log data you’re generating daily, you can get a rough estimate based on 256 bytes per log line times the number of log lines you’re generating daily. Multiply your daily source data size by the number of days in the retention period to determine the total source data size.

How much index space?
IndexSpace

The amount of storage space you’ll use for your index depends on a number of factors. As you send your documents to Elasticsearch, they are processed to create the index structures to support searching them. The on-disk size of these index structures depends on your data and the schema you set up. In practice, and using the default settings, the ratio of source data to index size is usually approximately 1:1.1.

For all practical purposes, and remembering to leave 10% overhead, you can use the source data size as the required index storage size.

Replicas increase the index size
Elasticsearch allows you to set (and change dynamically) a number of replicas for your index. The most important reason to use a replica is to create redundancy in the cluster. For production workloads and for all cases where you cannot tolerate data loss, we recommend using a single replica for redundancy. You might need more replicas to increase query processing capacity. We’ll cover that in a future post. You can have node-level redundancy only if you have more than one node. A single node, even with a replica, will not provide high availability.

Each replica is a full copy of the index, at the shard level. As such, it uses the same storage as the primary copy of the index. If you are using one replica, double the amount of storage for the index.

What’s the storage per instance?
When you configure your Amazon Elasticsearch Service domain, you choose your storage option: instance (ephemeral) storage or EBS storage. If you choose instance storage, then the storage per data node is already set based on your instance type selection. If you choose EBS storage, you can configure the amount of storage per instance, up to the Amazon Elasticsearch Service EBS storage limit for that instance type.

For example, if you choose to use m3.medium.elasticsearch instances and choose instance store as your storage option, each node will have 4 GB of SSD storage.  If you choose to use EBS as your storage, you can attach up to 100 GB to each m3.medium.elasticsearch instance.

The amount of usable storage per instance is less than the total storage available. The service files and operating system files take 3% of the storage on an m3.medium (less on larger instances). The service also reserves 20% of the disk, up to a maximum of 20 GB. What this means, especially for smaller instance types, is that if your computed storage is close to the boundary, opt for an additional data node.

Putting it all together
Let’s work through two examples.

The first example is a single index workload that represents a product catalog for an e-commerce website. The company has a catalog of 100,000 products that take up 1 GB of storage in its database. We multiply this 1 GB by the compression ratio (1.0) to get 1 GB of index size. The company will have one replica, so the total storage required is 2 GB. Because they have 4 GB available for each m3.medium.elasticsearch instance, the company could use just one node. However, they would not have anywhere to deploy a redundant replica, so they choose two m3.medium instances.

The second example is a dynamic index workload. The same company ingests data from the Twitter firehose to do brand sentiment analysis and improve their rank function for their product search. They download 100 GB of Twitter data each day and retain it for seven days. We multiply this 100 GB by the compression ratio (1.0) to get 100 GB of index daily. The company will have one replica of this data, yielding 200 GB of daily index, which they will retain for seven days. Multiplying 200 GB by seven days, the company will need 1,400 GB of storage. They choose m3.large.elasticsearch instances, to which they will attach 512 GB, General Purpose SSD (gp2) EBS volumes. At a minimum, they need three of these volumes, but decide on four m3.large.elasticsearch instances to provide additional storage.

How many instances?
It’s easy to get started with Elasticsearch using Amazon Elasticsearch Service, but there are some choices to make up front. One of your first decisions is to decide how many nodes you’ll need. You can calculate the number of nodes by determining how much data you want to store, multiplying by a compression ratio (usually 1), and then multiplying by the number of replicas. This yields the total on-disk size of the index or indices. Divide that by the per-node storage amount to get the total number of nodes required.

Introducing Cost Allocation Tags for Amazon DynamoDB

by Nitin Sagar | on | in DynamoDB | | Comments

Nitin Sagar is a senior product manager for DynamoDB.

You can now add tags to your Amazon DynamoDB tables. Tags are simple, user-customizable key and value pairs that are supported by many AWS services. Tagging for DynamoDB provides fine-grained visibility into your DynamoDB bill. You can assign tags to your tables and see cost breakouts per tag.

Let’s consider a real-life scenario in which you have multiple DynamoDB tables for your different development environments—development, staging, and production. You can assign tags to your DynamoDB tables, representing your various environments by adding a tag and setting the tag key to Environment, and the tag value to Development, Staging, or Production.

Let’s look at how to do this using the DynamoDB console. Before you begin, ensure you have the necessary permissions for the following API operations – ListTagsOfResource and TagResource.

  1. Sign in to the AWS Management Console and open the DynamoDB console at https://console.aws.amazon.com/dynamodb/.
  2. Choose Tables, and then choose the table that you want to modify.
  3. In the Settings tab, choose Tags from the navigation menu.
  4. In the Add Tags section, type Environment for Key and Development for Value. Choose Apply Changes.

 

Settings

 

By default, a newly added tag key is inactive for the billing pipeline. You can activate cost allocation tags in the Billing console by following these steps:

  1. Sign in to the AWS Management Console and open the Billing console at https://console.aws.amazon.com/billing/.
  2. Choose Cost Allocation Tags from the navigation menu.
  3. In the User-Defined Cost Allocation Tags section, select the check box next to the Environment tag key and choose Activate.

 

CostAllocationTags

 

When you have activated your cost allocation tags, the AWS Cost Explorer provides an easy way to see a breakdown of your AWS costs by tagged AWS resources:

  1. Sign in to the AWS Management Console and open the Billing console at https://console.aws.amazon.com/billing/.
  2. Choose Cost Explorer from the navigation menu. Then choose Launch Cost Explorer.
  3. From the menu at top left, choose Monthly costs by service. Select the appropriate time range in the Time range section from the right menu.
  4. In the Filtering section, choose Tag from Filter by.
  5. Choose Environment in the tag key autocomplete field and choose Development in the tag value autocomplete field. Choose Apply.

 

Filtering

 

Your costs are now filtered by the selected tag (Environment=Development). Costs are shown starting from when you applied your tag to your AWS resources (January 15 onward in our example).

 

SaveReport

 

You can add up to 50 tags to your DynamoDB tables by using the DynamoDB Management Console, AWS CLI, or AWS SDK. Global secondary indexes (GSIs) and local secondary indexes (LSIs) associated with base DynamoDB tables are automatically tagged with the same tags that are used for the corresponding DynamoDB base tables. Tagging support for DynamoDB is available in all AWS regions.

You can read more about implementing a consistent and effective tagging strategy for your AWS resources by visiting the AWS Tagging Strategies page.

Introducing AWS Schema Conversion Tool Version 1.0.502

by Eran Schitzer | on | in Schema Conversion Tool (SCT) | | Comments

Eran Schitzer is a product manager in Amazon Web Services.

We are pleased to announce the recent release of version 1.0.502 of the AWS Schema Conversion Tool (AWS SCT). Two key capabilities included with this release are the use of Amazon Aurora with PostgreSQL compatibility as a target and integration with AWS Database Migration Service (AWS DMS).

Amazon Aurora with PostgreSQL compatibility as a target
Following the announcement of Amazon Aurora with PostgreSQL compatibility preview at re:Invent, you can now use the AWS SCT to convert your schema to use with a PostgreSQL-compatible Aurora DB cluster.

The PostgreSQL-compatible edition of Amazon Aurora delivers up to two to three times the performance of PostgreSQL, along with other PostgreSQL capabilities valuable to customers—user-defined functions and data types, complex SQL support, NoSQL and JSON support, and broad application development language support. At the same time, it takes advantage of the scalability, durability, and security capabilities of Amazon Aurora, all for one-tenth the cost of commercial grade databases such as Oracle Database and Microsoft SQL Server. To sign up for the preview, see the preview webpage.

For more information about converting database schemas, see Converting Database Schema to Amazon RDS by Using the AWS Schema Conversion Tool.

Integration with DMS
After converting a schema with SCT, you can have the associated endpoints and tasks created automatically on a DMS instance. You can run and monitor tasks from either the DMS console or from SCT.

The following procedure shows how to create a DMS task from SCT. Once the schema conversion is complete, the DMS task created through the SCT will migrate the data to the target database engine.

To create and access DMS tasks from SCT, you’ll need to define at least one service profile. You can define one or more service profiles to let SCT access your resources running on AWS, including DMS, RDS and AWS Lambda.

  1. Open SCT, choose Settings, and then choose Global settings.
  2. On the AWS Service Profiles tab, type a name for Profile name, enter your AWS access key and AWS secret key, and choose the region where the profile will be created.GlobalSettings7
  3. Choose Save to save the profile.
  4. Choose Current project settings, and then choose the Project Environment tab. For AWS Service Profile, choose the default profile for the current project, and then choose OK.CurrentProjectSettingsOnce you have defined an AWS service profile, you can define, access, and execute DMS tasks from SCT.
  5. Choose the converted schema you’d like to migrate, open the context (right-click) menu for it, and choose Create DMS Task.CreateDMSTask
  6. In the Create DMS task window, enter the task details—task name, replication instance and source and target endpoints (you can also create new endpoints). You can change the defaults, if needed, for the rest of the parameters.
  7. Choose Create.
    CreateDMSTask1At this point, a DMS task is being created in AWS, as shown by this screenshot from the DMS console.DMSConsole

    You can view, start, stop, resume, delete, refresh, and show logs of the new DMS task in the DMS tasks list in both SCT and DMS.

    DMSOverview

For more information about version 1.0.502, see AWS Schema Conversion Tool User Guide. You can download SCT from here.

Auditing an Amazon Aurora Cluster

by Sirish Chandrasekaran | on | in Aurora | | Comments

Sirish Chandrasekaran is a product manager at Amazon Web Services.

At re:Invent, we announced a number of new features for Amazon Aurora, the MySQL-compatible relational database engine that combines the speed and availability of high-end commercial databases with the simplicity and cost-effectiveness of open source databases.

Today we are happy to announce that our advanced auditing feature is now available to all customers.

What is advanced auditing, and why is it important?

Auditing refers to logging a specified set of events and publishing these logs in a way that can be consumed either manually or by another application. These logs can form the bedrock of your governance and regulatory compliance programs. Examples of advanced auditing include log analysis, auditing of user actions (both past events and near real-time threat detection), and support for configurable alarms set on security-related events. Aurora advanced auditing is designed to provide this functionality with minimal impact on database performance.

Getting started with advanced auditing
To get started, enable advanced auditing and then take a look at the audit logs.

Enabling advanced auditing

You can enable and configure advanced auditing by setting the parameters listed following in the DB cluster parameter group. Modifying these parameters doesn’t require a DB cluster restart, and the behaviors are the same as for Aurora DB instance parameters.

You use the server_audit_logging parameter to enable or disable auditing, and the server_audit_events parameter to specify what events to log.

You use the server_audit_excl_users and server_audit_incl_users parameters to specify who gets audited:

  • If server_audit_excl_users and server_audit_incl_users are empty (the default), all users are audited.
  • If you add users to server_audit_incl_users and leave server_audit_excl_users empty, then only those users added to server_audit_incl_users are audited.
  • If you add users to server_audit_excl_users and leave server_audit_incl_users empty, then only those users added to server_audit_excl_users are not audited, and all other users are.
  • If you add the same users to both server_audit_excl_users and server_audit_incl_users, then those users are audited because server_audit_incl_users is given higher priority.

The following list describes each of the advanced auditing parameters in more detail.

server_audit_logging enables or disables audit logging. It defaults to OFF; set it to ON to enable logging.

  • Scope: Global
  • Dynamic: Yes
  • Data type: Boolean
  • Default value: OFF (disabled)

server_audit_events contains the comma-delimited list of events to log. There should be no white space between the list elements.

  • Scope: Global
  • Dynamic: Yes
  • Data type: String
  • Default value: Empty string
  • Valid values: You can log any combination of the following events:
    • CONNECT — Logs successful connections, failed connections, and disconnections. This value includes the user information.
    • QUERY — Logs all query text and query results in plain text, including queries that fail due to syntax or permission errors.
    • QUERY_DCL — Similar to Query, but returns only DCL-type queries (GRANT, REVOKE, and so on).
    • QUERY_DDL — Similar to Query, but returns only DDL-type queries (CREATE, ALTER, and so on).
    • QUERY_DML — Similar to Query, but returns only DML-type queries (INSERT, UPDATE, and so on).
    • TABLE — Logs the tables that were affected by query execution.

server_audit_excl_users contains the comma-delimited list of users whose activity isn’t logged. There should be no white space between the list elements. Connect and disconnect events aren’t affected by this variable; they are always logged if logging for them is specified. Even if a user is listed in server_audit_excl_users, that user’s actions are logged if that user is also specified in server_audit_incl_users, because that setting has higher priority than server_audit_excl_users.

  • Scope: Global
  • Dynamic: Yes
  • Data type: String
  • Default value: Empty string

server_audit_incl_users contains the comma-delimited list of users whose activity is logged. There should be no white space between the list elements. Connect and disconnect events aren’t affected by this variable; they are always logged if logging for them is specified. If a user is listed in server_audit_incl_users, that user is logged even if that user is also specified in server_audit_excl_users, because server_audit_incl_users has higher priority.

  • Scope: Global
  • Dynamic: Yes
  • Data type: String
  • Default value: Empty string

Viewing audit logs

You can view the audit logs by using the AWS Management Console. On the Instances page, select and expand the DB cluster, then choose Logs.

ResourceGroup

If you are familiar with the MariaDB Audit Plugin, you will notice a few differences in Aurora’s approach to auditing.

First, logs are presented somewhat differently:

  • The Aurora advanced auditing time stamp is formatted in Unix time format.
  • Log events are written to multiple files, and the log records aren’t in sequential order. Customers can concatenate and sort the files using the timestamp and query_id fields to produce a single file with ordered events. You can do this in Unix as follows: cat audit.log.* | sort -t”,” -k1,1 –k6,6
  • The number of files might vary based on DB instance size.
  • File rotation is fixed at 100 MB and isn’t configurable.

Also, enabling Aurora advanced auditing after migration from MySQL works differently. Audit log configuration is set using the parameter group for the DB cluster.

How does Aurora implement advanced auditing?

Although auditing capability is commonly available in both commercial databases and some open source databases, auditing in these databases typically has a significant impact on performance, especially when the load is high. One of our Aurora implementation’s main goals is to provide users with a rich set of information but without sacrificing performance.

Maintaining performance
To understand how we achieve our performance objective, let’s compare our implementation of advanced auditing to that of the MariaDB Audit Plugin. We use this as our comparison point because MySQL Community Edition doesn’t have a native audit log, and the MariaDB Audit Plugin has emerged as the most popular option to fill this void in the open source community.

MariaDB Audit Plugin uses a single thread with a single mutex for processing and writing each event. Although the design strictly preserves the order of events, it can lead to a performance drop due to the bottleneck in writing logs. If we had used a similar approach for Aurora, the performance impact would have been even more significant due to the much higher scalability and expected throughput of the engine.

To maintain our high-performance advantage, we redesigned the event processing and event writing logic. On the input side, we used a latch-free queue to store audit events without blocking other threads. On the output side, we use multiple threads writing events from the latch-free queue to multiple files. The files can be post-processed to present a complete audit log with events in order.

advanced-auditing-how

Log format
The audit log is stored separately on each instance’s local (ephemeral) storage. Each Aurora instance spreads log writes across four log files at a time:

  • Encoding: UTF-8
  • File name pattern: audit.log.[0-3].[%Y-%m-%d-%H-%M][-rotation]
  • Location: /rdsdbdata/log/audit/ (on each host)
  • Rotation: Maximum size 100 MB per log file, currently not customer-configurable. When the largest of the four log files reaches 100 MB, the system rotates to a new set of log files.
  • Clean-up: The system will clean up older audit files to free up space past a certain space consumption and/or age.
  • Log format: [timestamp],[serverhost],[username],[host],[connectionid],[queryid],[operation],[database],[object],[retcode]
Parameter Description
timestamp The Unix time stamp for the logged event with second precision.
serverhost The name of the instance that the event is logged for.
username The connected user.
host The host that the user connected from.
connectionid The connection ID number for the logged operation.
queryid The query ID number, which can be used for finding the relational table events and related queries. For TABLE events, multiple lines are added.
operation The recorded action type. Possible values are CONNECT, QUERY, READ, WRITE, CREATE, ALTER, RENAME, and DROP.
database The active database, as set by the USE command.
object For QUERY events, this is the executed query. For TABLE events, this is the table name.
retcode The return code of the logged operation.

How does our approach compare to the alternatives?

As mentioned above, many databases offer audit log capability, but suffer from poor performance when auditing is enabled. We compared the performance of Aurora against MySQL 5.7 with MariaDB Audit Plugin for a select-only workload on an 8xlarge instance. As the results following show, MySQL performance degrades significantly when auditing is turned on, where Aurora only sees modest reduction in performance. There’s a 65% reduction in throughput for MySQL 5.7 as opposed to just 15% for Aurora. In effect, Aurora’s performance edge over MySQL 5.7 more than doubles to over an order of magnitude when auditing is turned on.

advanced-auditing-benchmark1

Advanced auditing is available now, and you can start using it today! You can learn more about this feature in the advanced auditing documentation.

Indexing Metadata in Amazon Elasticsearch Service Using AWS Lambda and Python

by Amit Sharma | on | in Elasticsearch, Lambda | | Comments

Amit Sharma (@amitksh44) is a solutions architect at Amazon Web Services.

You can use Amazon S3 to implement a data lake architecture as the single source of truth for all your data. Taking this approach not only allows you to reliably store massive amounts of data but also enables you to ingest the data at a very high speed and do further analytics on it. Ease of analytics is important because as the number of objects you store increases, it becomes difficult to find a particular object—one needle in a haystack of billions.

Objects in S3 contain metadata that identifies those objects along with their properties. When the number of objects is large, this metadata can be the magnet that allows you to find what you’re looking for. Although you can’t search this metadata directly, you can employ Amazon Elasticsearch Service to store and search all of your S3 metadata. This blog post gives step-by-step instructions about how to store the metadata in Amazon Elasticsearch Service (Amazon ES) using Python and AWS Lambda.

ServicesUsing S3 event notifications and Lambda triggers
In this post, we use S3 event notifications and Lambda triggers to maintain metadata for S3 objects in Amazon ES. S3 notification enables you to receive notifications when certain events happen in your bucket. These events can be for any action in an S3 bucket, such as PUT, COPY, POST, DELETE, and so on. More details about S3 event notifications are available in the AWS documentation.

S3 event notifications integrate with Lambda using triggers. Using this integration, you can write Lambda functions that process Amazon S3 events. To do this, in Amazon S3 you add a bucket notification configuration that identifies the type of event that you want Amazon S3 to publish and the Lambda function that you want to invoke.

AWSAccount

High-level flow between S3 and Lambda

Putting it together
To put all these parts together, you can take the following steps.
Configuring AWS Lambda with Amazon S3
To configure Lambda with S3, start by choosing AWS Lambda on the console.

Lambda1

If this is the first time you’ve created a Lambda function, choose Get Started Now.

GetStartedLambda

Choose Configure triggers.

ConfigureTriggers

On the next page, you should be able to select the triggers you want to work with.

ConfigureTriggersNext

Choose the S3 bucket and the type of event that you want to capture. You can leave the Prefix and Suffix fields blank or, based on your use case, fill them in.

For example, if you expect all files to come in a folder called /appTier/appServer1, you can use that path as the Prefix value. Similarly, if you expect the files to arrive with a certain suffix like .log, .jpg, .avi, and so on, you can use that in the Suffix field. Events are triggered for an object only if both the Prefix and Suffix fields are matched. Also, select the Enable Trigger check box.

EnableTrigger

Next, provide a name and description and choose Python 2.7 as the run-time environment. Because we are going to upload the code separately, choose Upload a .ZIP file for Code entry. Leave the handler information as the default: lambda_function.lambda_handler.

ConfigureFunction1

Now, let’s create the AWS Identity and Access Management (IAM) roles and related permissions so that our Lambda function can access the AWS resources we need. To do this, choose Create a new role from template(s), give a name to this new role, and for Policy templates, choose S3 object read-only-permission.

ChooseTemplate

ChooseTemplate1

In Advanced settings, leave the Memory, Timeout, and VPC settings as the default. Choosing Next will create the Lambda function and also associates the right permissions in S3 so you can invoke this Lambda function. You can verify this by checking this in S3 console. To do this, go to the properties of the S3 bucket you specified earlier and to the Events section, as shown following:

Events

Choose the modify icon to see the details and verify the name of the Lambda function.

Creating the Amazon ES domain
Now, let’s create the Amazon ES domain. Go to Services, and choose Elasticsearch Service in Analytics:

ES

Choose the Get Started button on the front page and type a name for your domain (I chose my-es-cluster):

ESDomain

As shown following, choose an instance type and an instance count (both can be changed later if necessary). We recommend choosing m3.medium or larger if you are planning to put this feature into production. Alternatively, t2.micro is a good choice if you are creating a development environment or a small proof of concept.

For storage, you have choices between instance-based storage and various types of Amazon EBS volumes (General Purpose, Provisioned IOPS and Magnetic). Start with a General Purpose EBS volume and monitor the overall performance with the FreeStorageSpace, JVMMemoryPressure, and CPUUtilization metrics and metrics about query response times before changing the storage type. For a good reference to handling errors and mitigations, see the AWS documentation.

An important question is: How much storage do you need? For example, if every object uploaded to S3 has metadata sized 1 KB and you expect 10 million objects, you should provision a total of at least 20 GB: 10 GB for the primary instance and an additional 10 GB for the replica. For a more detailed discussion on scaling and capacity planning for Elasticsearch, see the Elasticsearch documentation.

ConfigureCluster

Next, set the access policy. I chose to make mine wide open in order to simplify testing, but don’t do this for your cluster. I could also have used one of the IP-based or user-based templates in the wizard to create a more restrictive policy. For more details on controlling access to your cluster, see this blog post.

SetupAccess

Finally, review the settings and choose Confirm and create. That’s it! The cluster will be created in a few minutes.

ESCluster

Creating the Lambda function
Now comes the main code that will actually push the metadata coming from every trigger generated by object creation events. Remember that Lambda has been configured with an execution role that has read-only permissions to read from S3. At a high level, the Python code does the following:

  • Reads the metadata from S3 event
  • Connects to the Amazon ES domain endpoint
  • Creates an index if one has not already been created
  • Writes the metadata into Amazon ES

To connect to Amazon ES, the Python code uses a few specific libraries such as Elasticsearch, RequestsHttpConnection, and urllib. We are going to upload the code to the Lambda function so you can download these packages in a specific folder by using the following command. But first, make sure pip is installed—find steps to do this on the pip website. Note that the sample code available for download includes all the required libraries, so this step is optional and given here mainly for your understanding:

pip install requests -t /path/to/project-dir
pip install Elasticsearch -t /path/to/project-dir
pip install urllib3 -t /path/to/project-dir

Make sure these libraries are now available in the current directory. Now we are ready to look at the code.

The following function connects to Amazon ES:

def connectES(esEndPoint):
 print ('Connecting to the ES Endpoint {0}'.format(esEndPoint))
 try:
  esClient = Elasticsearch(
   hosts=[{'host': esEndPoint, 'port': 443}],
   use_ssl=True,
   verify_certs=True,
   connection_class=RequestsHttpConnection)
  return esClient
 except Exception as E:
  print("Unable to connect to {0}".format(esEndPoint))
  print(E)
  exit(3)

This function takes the domain endpoint as an argument and returns the Elasticsearch client instance. Be sure to use your domain’s endpoint to declare esClient:
esClient = connectES("search-domainname-yourDomainEndpoint.REGION.es.amazonaws.com")
The following function creates an Amazon ES index:

def createIndex(esClient):
 try:
  res = esClient.indices.exists('metadata-store')
  print("Index Exists ... {}".format(res))
  if res is False:
   esClient.indices.create('metadata-store', body=indexDoc)
   return 1
 except Exception as E:
  print("Unable to Create Index {0}".format("metadata-store"))
  print(E)
  exit(4)

Note that this function takes esClient as an instance of the Elasticsearch client returned by the connectES function. Also note that ‘metadata-store’ and ‘indexDoc’ are the name and mapping of the index we are trying to create. The ‘indexDoc’ mapping is defined following:

indexDoc = {
 "dataRecord" : {
  "properties" : {
   "createdDate" : {
    "type" : "date",
    "format" : "dateOptionalTime"
   },
   "objectKey" : {
    "type" : "string",
    "format" : "dateOptionalTime"
   },
   "content_type" : {
    "type" : "string"
   },
   "content_length" : {
    "type" : "long"
   },
   "metadata" : {
    "type" : "string"
   }
  }
 },
"settings" : {
 "number_of_shards": 1,
 "number_of_replicas": 0
 }
}

We are storing five fields:

  • createdDate
  • objectKey
  • content_type
  • content_length
  • metadata

As part of this, there’s a couple of important points to consider.

First, it’s important to plan your shards. The best number of primary and replica shards depends upon multiple things such as instance sizes, amount of data, frequency of new data being generated and old data being purged, query types, and so on. To give an example, for time-series data (for example, Logfile) you can maintain different indexes per hour, per day, and per week depending upon the speed of data being generated—we recommend daily indexes in most cases. Because older logs are less likely to be queried, you can re-index those to lower primary shard numbers or else drop the entire index. A more detailed discussion is provided in the Elasticsearch documentation.

Also, consider using bulk indexing. The preceding code sample works fine for a lot of use cases with low to moderate traffic—for example, up to 100 PUTs per second on S3 with 1KB of metadata. However, for higher traffic volumes we recommend to use larger instances and instead of indexing every document use the _bulk index API call to efficiently dump the data into an Elasticsearch cluster. In a follow-up blog, we will give architectural patterns and recommendations on how to do _bulk indexing efficiently and cost-effectively.

For a detailed explanation about shard settings as part of the cluster planning, refer to the Elasticsearch documentation.

Following is the function that actually writes metadata into Elasticsearch:

def indexDocElement(esClient, key, response):
  try:
   indexObjectKey = key
   indexcreatedDate = response['LastModified']
   indexcontent_length = response['ContentLength']
   indexcontent_type = response['ContentType']
   indexmetadata = json.dumps(response['Metadata'])
   retval = esClient.index(index='metadata-store', doc_type='images', body={
     'createdDate': indexcreatedDate,
     'objectKey': indexObjectKey,
     'content_type': indexcontent_type,
     'content_length': indexcontent_length,
     'metadata': indexmetadata
   })
  except Exception as E:
    print("Doc not indexed")
    print("Error: ",E)
    exit(5)

This function takes esClient, an S3 object key, and the complete response of the S3.get_object function. This response contains the actual metadata. The elements in response are indexed by calling esClient.index. The document ID is autogenerated by Elasticsearch. You can see all the index options in the Elasticsearch documentation.

Finally, following is the main Lambda handler code that calls all these functions at the invocation when it is triggered:

def lambda_handler(event, context):
   esClient = connectES("search-domainname-yourDomainEndpoint.REGION.es.amazonaws.com ")
   createIndex(esClient)

   # Get the object from the event and show its content type
   bucket = event['Records'][0]['s3']['bucket']['name']
   key = urllib.unquote_plus(event['Records'][0]['s3']['object']['key'].encode('utf8'))
   try:
     response = s3.get_object(Bucket=bucket, Key=key)
     print(response)
     print("KEY: " + key)
     print("CONTENT TYPE: " + response['ContentType'])
     print("Metadata : " + json.dumps(response['Metadata']))
     print("Custom 1: " + response['ResponseMetadata']['HTTPHeaders']['x-amz-meta-custom1'])
     print("Custom 2: " + response['ResponseMetadata']['HTTPHeaders']['x-amz-meta-custom2'])
     indexDocElement(esClient,key,response)
     return response['ContentType']
   except Exception as e:
     print(e)
     print('Error getting object {} from bucket {}. Make sure they exist and your bucket is in the same region as this function.'.format(key, bucket))
     raise e

You can also download the entire handler code  from here for the index creation.

To verify that the metadata has been entered into Elasticsearch, you can use Kibana and search using the standard Elasticsearch API calls and queries. For example, you can query by object name as shown following:

bash# curl -XGET https:// search-domainname-yourDomainEndpoint.REGION.esamazonaws.com/metadata-store/images/_search?pretty\&q=objectKey:YOURFILENAME

  {
    "took" : 1,
    "timed_out" : false,
    "_shards" : {
      "total" : 1,
      "successful" : 1,
      "failed" : 0
   },
   "hits" : {
     "total" : 1,
     "max_score" : 9.516893,
     "hits" : [ {
       "_index" : "metadata-store",
       "_type" : "images",
       "_id" : "AVgGSFxdQ43eQcLduwj9",
       "_score" : 9.516893,
       "_source" : {
         "content_length" : 61194,
         "objectKey" : "YOURFILENAME",
         "metadata" : "{\"custom1\": \"banana\", \"custom2\": \"shake\"}",
         "content_type" : "application/octet-stream",
         "createdDate" : "2016-10-27T13:15:54+00:00"
       }
    } ]
  }
}

Following is a screenshot of Kibana after indexing a few documents:

Kibana
Deleting metadata when an S3 object is deleted
To delete the related metadata when you delete an S3 object, follow the same steps as listed preceding—except that at event type selection, choose Object Removed Event as shown following:

ObjectREmoved

The rest of the steps remain the same. Create an additional trigger for object removal for a total of two triggers and two Lambda functions for two different types of events—object PUT, COPY, or POST and object DELETE.

Following is the main handler code:

def lambda_handler(event, context):
  esClient = connectES("search-domainname-yourDomainEndpoint.REGION.es.amazonaws.com ")

  # Get the object from the event and show its content type
  bucket = event['Records'][0]['s3']['bucket']['name']
  key = urllib.unquote_plus(event['Records'][0]['s3']['object']['key'].encode('utf8'))
  try:
    clearMetaData(esClient,key)
    return 'Removed metadata for ' + key
  except Exception as e:
    print(e)
    print('Error removing object metadata from Elasticsearch Domain.)
    raise e

The clearMetaData function is defined as following:

def clearMetaData(esClient,key):
   try:
    retval = esClient.search(index='metadata-store', doc_type='images', q='objectKey:' + key, fielddata_fields='_id')
    total = retval['hits']['total']
    count = 0
    while (count < total):
      docId = retval['hits']['hits'][count]['_id']
      print("Deleting: " + docId)
      removeDocElement(esClient,docId)
      count = count + 1
    return 1
  except Exception as E:
    print("Removing metadata failed")
    print("Error: ",E)
    exit(5)

This function searches the domain for the given S3 object name and calls another function, removeDocElement, with the document ID as an argument that is unique in the domain. The removeDocElement is defined as following:

def removeDocElement(esClient,docId):
  try:
    retval = esClient.delete(index='metadata-store', doc_type='images', id=docId)
    print("Deleted: " + docId)
    return 1
  except Exception as E:
    print("DocId delete command failed at Elasticsearch.")
    print("Error: ",E)
    exit(5)

This code deletes all the references to that S3 key by using the unique document ID. You can confirm the deletion from Elasticsearch index by using following command:

bash# curl -XGET https:// search-domainname-yourDomainEndpoint.REGION.es.amazonaws.com/metadata-store/images/_search?pretty\&q=objectKey:train.csv
{
  "took" : 1,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "failed" : 0
  },
  "hits" : {
    "total" : 0,
    "max_score" : null,
    "hits" : [ ]
  }
}

Following is the Amazon CloudWatch monitoring snapshot for the Elasticsearch cluster—you can see a number of metrics such as those dealing with searchable documents, free storage space, cluster health, and so on. These metrics can help you decide how to scale the cluster from both compute and storage perspective. For ex. by monitoring FreeStorageSpace or CPUUtilization you can decide to scale out or scale up the Elasticseach cluster nodes.

Monitoring

You can also download the entire handler code from here for the index deletion.

Thanks for exploring these technologies with me. For deeper information, take a look at Amazon Elasticsearch Service and AWS Lambda. Let me know in the comments below how this post works for you!