AWS Database Blog
Hemant Borole is a big data consultant at Amazon Web Services.
AWS Database Migration Service (AWS DMS) helps you migrate databases to AWS easily and securely. AWS DMS supports homogenous migrations such as Oracle to Oracle, as well as heterogeneous migrations between different database platforms, such as Oracle to Amazon Aurora.
For customers who use multiple AWS accounts, the source and target databases may exist on different accounts. For AWS DMS to work with source and target databases in different accounts, it must be able to communicate across accounts.
The following are some of the options you can use to enable cross-account AWS DMS communication:
- Making database endpoints accessible over the Internet
- Virtual Private Cloud (VPC) peering
- Whitelisting the AWS DMS public IP in the security group
Exposing database endpoints over the Internet is not a secure option because it makes the databases vulnerable to security attacks.
VPC peering is a commonly adopted practice. A VPC is a virtual network (block of CIDR) that is allocated to an AWS account. VPC peering allows two VPCs to connect as if they are on the same network. However, this means that both VPCs must have CIDR blocks that do not overlap. Customers might have compliance restrictions that prevent the VPCs from peering.
The use case discussed in this blog post requires the customer to continuously replicate data changes on Amazon Aurora to Amazon Redshift. We use AWS DMS to achieve this replication. AWS DMS uses replication endpoints to migrate data from a source endpoint to a target endpoint. The source and target endpoints can reside in the same Amazon account or in different Amazon accounts. (The setup of the databases is not in the scope of this post.)
AWS DMS uses a replication task to manage replication between the source and the target endpoint. You can configure the replication to be a one-time migration or an ongoing replication. The task also can restrict the replication to certain schemas and tables from the source database. With the AWS DMS task settings, you can configure schemas and tables to be included or excluded from replication.
The AWS DMS task uses rules mapping, which is a JSON definition that allows you to configure this mapping. Here is an example of the rules mapping.
Zack Slayton is a software development engineer at Amazon
Using Z-order indexing, you can efficiently run range queries on any combination of fields in your schema. Although Amazon DynamoDB doesn’t natively support Z-order indexing, you can implement the functionality entirely from the client side. A single Z-order index can outperform and even replace entire collections of secondary indexes, saving you money and improving your throughput.
When DynamoDB launched in 2012, its indexing options were limited. There were no secondary indexes. You could only order your data on disk by specifying a sort key during table setup, and a sort key could only be a single attribute. The following year, DynamoDB introduced the ability to declare a variety of indexes for an individual table, greatly increasing the database’s flexibility. However, the core design in which each index orders data by a single attribute is still in effect. This post explores an alternative approach to indexing, using which you can execute efficient queries across any number of attributes.
A practical example
Imagine that you run a website that aggregates weather data from a variety of universities and government organizations. For each of these sources, you periodically collect and persist updates that include the temperature that was recorded by a sensor at a particular time and location. You create a table in DynamoDB using the following schema:
For illustrative purposes, we have randomly generated a set of 300,000 such temperature reports. Each report has been attributed to a single fictional weather organization called Under the Weather (UTW), which has a source ID of 1. The data spans three months and covers the continental United States. We will use this dataset to demonstrate the performance of various queries and indexes.
Traditional indexing in DynamoDB
In DynamoDB, composite indexes can be made up of at most two attributes: a partition key and a sort key. In our example, the
sourceId field of each entry (shown in blue) is our partition key; all data with the same
sourceId value will reside in the same partition of our table. Because many of our queries for weather data will be limited to a given time frame (for example, a specified day, week, or month), you might be tempted to simply use the
timestamp field (shown in green) as our sort key . However, in doing so you quickly hit a roadblock, as we discuss following.
Choosing an effective sort key
To select our sort key, we must give careful consideration to the full range of queries that we hope to run against our data. We must also work within the following constraints:
- DynamoDB doesn’t permit two rows to share the same combination of partition key and sort key values.
- When querying for a range of items within a partition, DynamoDB can only constrain its search by the value of the sort key. We can specify constraints on other fields, but the server will only use them to filter the set of items that has already been retrieved from the table using the sort key. Constraints imposed on fields other than the sort key will reduce neither the amount of read capacity consumed by the query nor the amount of time that it takes the query to run.
The uniqueness constraint
Given our use case, the uniqueness constraint complicates things quite a bit. We cannot use
timestamp alone as our sort key because we expect to receive multiple updates from sensors around the world at more or less the same time. If no two items in a DynamoDB table can have the same
timestamp combination, we’ll end up occasionally losing data. In our example table, preceding, the first and second rows cannot coexist.
We can address this problem by appending some additional information to the sort key: the
longitude of each item. We do this by setting the type of our table’s sort key to String instead of Number and then concatenating the other values onto the end of the timestamp. Now our table looks like this:
Because the newly generated
timestamp_lat_long values are all prefixed with the
timestamp value, we can still query for all items within a given timeframe as we could before. But, because no single source of weather data will have two sensors sending reports from the same place at the same time, we no longer risk having primary key collisions in our data. Thus, we’ve satisfied the uniqueness constraint. However, another constraint is still looming.
Wendy Neu is a big data architect at Amazon Web Services.
Many customers are undertaking the journey to open source relational database management system (RDBMS) databases. With the latest release of the AWS Schema Conversion Tool (AWS SCT), it’s easy to change your database engine. AWS SCT makes heterogeneous database migrations easier by automatically converting procedural code from the source SQL dialect into the target SQL dialect. Any code that the tool can’t convert automatically (due to incompatible constructs between SQL dialects) is clearly marked so that you can convert it yourself, often with suggestions.
A few weeks ago, we featured a blog post Introducing AWS Schema Conversion Tool Version 1.0.502 by Eran Schitzer that can help you get started. You can also get started with SCT by following the steps in the AWS documentation to create a project, connect to your source, and run an assessment report.
For a complete end-to-end migration, you need to create and prepare a few items in your target database. Items like users, roles, and grants that have to be re-created or reproduced on the target can be a challenge if you don’t use a repository to store that information. This post outlines a method for extracting users, grants, and roles from an Amazon EC2 Oracle instance and applying them to an Amazon RDS for PostgreSQL instance. You can download the code used in this blog post from this repository.
Create a target database
In preparing for migration, you first create a target RDS PostgreSQL database and establish security groups for connectivity.
Create an Amazon RDS target
Create a target RDS PostgreSQL database in your VPC. You can find a complete set of instructions in the Amazon RDS User Guide. Be sure to attach security groups with adequate database ingress and egress rules for your client machine to the new RDS instance. My EC2 Oracle instance is on the default port 1521, and my RDS PostgreSQL instance is on the default port 5432. I’m using the same security group for both instances. I restricted traffic inbound to those ports from my client but allowed all local traffic within the security group itself.
Eran Schitzer is a product manager at Amazon Web Services.
We are pleased to announce the recent release of version 1.0.602 of the AWS Schema Conversion Tool (AWS SCT). The new version includes a new key capability—data migration agents that extract data from Oracle, MS SQL Server, Teradata, IBM Netezza, Greenplum and HPE Vertica data warehouses, prepare, optimize and upload it to Amazon Redshift.
The new data migration agents are locally installed agents designed to extract data from data warehouses based on Oracle version 11, MS SQL Server version 2008 and later, Teradata version 14.10 and later, IBM Netezza version 7.0.3 and up, Greenplum version 4.3 and up and HPE Vertica version 7.2.2 and up. When data is extracted, the AWS Schema Conversion Tool optimizes it for Amazon Redshift and saves it in local files. AWS can automatically upload these files to Amazon Redshift or you can transfer these files to an Amazon Simple Storage Service (Amazon S3) bucket, where they can be uploaded to Amazon Redshift manually. For large datasets, you can use AWS Snowball to ship the data to AWS.
The migration agents are designed to extract data in parallel and independently. For example, with a large Oracle data warehouse SCT will automatically distribute work between all available migration agents. SCT will automatically manage all the available agents to extract the data from the different partitions and tables in the schema in the most optimized way for Amazon Redshift, consolidate the data, and save it in local files. The migration agents work completely independent from SCT, and you can replace them, if needed, without any work lost. This unique independent parallel execution capability not only accelerates the extraction of the data, but can also withstand failure of one of the agents.
The new migration agents are supported on RedHat and Debian Linux distros, and the SCT client can run on Windows, Mac, Fedora, or Ubuntu. For more information on installation, security settings, and configuration, see Installing Migration Agents in the AWS SCT User Guide.
When the migration agents are installed and configured, you manage them by using the AWS Schema Conversion Tool. The migration agents act as listeners. When they receive instructions from the AWS Schema Conversion Tool, they extract data from your data warehouse. You can find instructions on how to register the migration agents with your AWS Schema Conversion Tool project in Managing Migration Agents in the AWS SCT User Guide.
After your migration extraction tasks complete, your data is ready. The migration agents save your data as files that are compressed, but not encrypted. The migration agents can copy the data files to your Amazon S3 bucket over an Internet connection. Alternatively, you can copy the files to an Amazon S3 bucket using AWS Snowball. Copy the data from the output folder, under the working folder that you specified for your agents.
After uploading your data files to Amazon S3, the data is ready to be uploaded to Amazon Redshift. The extraction task can automatically upload the data to Amazon Redshift or you can use the Amazon Redshift COPY command to load your data to Amazon Redshift. For more information, see Loading Data from Amazon S3 in the Amazon Redshift documentation.
For more information on migration agents, see Using Data Migration Agents in the AWS SCT User Guide.
Sam Dengler is a Solutions Architect at Amazon Web Services
In this blog post, we review concepts and architectural patterns relevant to a chat application. We also discuss implementation details for a chat client and server, and instructions to deploy a sample chat application into your AWS account.
Building a chat application requires a communication channel over which a client can send messages that are redistributed to other participants in the chat room. This communication is popularly implemented using the publish-subscribe pattern (PubSub), where a message is sent to a centralized topic channel. Interested parties can subscribe to this channel to be notified of updates. This pattern decouples the publisher and subscribers, so that the set of subscribers can grow or shrink without the knowledge of the publisher.
PubSub is implemented on a backend server, to which clients communicate using WebSockets. WebSockets is a persistent TCP connection that provides a channel for data to be streamed bidirectionally between the client and server. With a single-server architecture, one PubSub application can manage the state of publishers and subscribers, and also the message redistribution to clients over WebSockets. The diagram following illustrates the path that messages travel over WebSockets between two clients on a single-server PubSub architecture.
Single-Server PubSub Architecture
Ballu Singh and Pubali Sen are solutions architects at Amazon Web Services.
AWS Database Migration Service (AWS DMS) helps you migrate databases to AWS easily and securely. The AWS Database Migration Service can migrate your data to and from most widely used commercial and open-source databases. The service supports homogenous migrations such as Oracle to Oracle. It also supports heterogeneous migrations between different database platforms, such as Oracle to Amazon Aurora or Microsoft SQL Server to MySQL. The source database remains fully operational during the migration, minimizing downtime to applications that rely on the database.
Data replication with AWS Database Migration Service integrates tightly with the AWS Schema Conversion Tool (AWS SCT), simplifying heterogeneous database migration projects. You can use AWS SCT for heterogeneous migrations. You can use the schema export tools native to the source engine for homogenous migrations.
In this blog post, we focus on migrating the data from Oracle Data Warehouse to Amazon Redshift.
In the past, AWS SCT couldn’t convert custom code, such as views and functions, from Oracle Data Warehouse to a format compatible with the Amazon Redshift. To migrate views and functions, you had to first convert the Oracle Data Warehouse schema to PostgreSQL. Then you’d apply a script to extract views and functions that are compatible with Amazon Redshift.
After an update based on customer feedback, we’re happy to let you know that with AWS SCT and AWS DMS, you can now migrate Oracle Data Warehouse to Amazon Redshift along with views and functions.
The following diagram illustrates the migration process.
How to Stream Data from Amazon DynamoDB to Amazon Aurora using AWS Lambda and Amazon Kinesis Firehose
Aravind Kodandaramaiah is a partner solutions architect with the AWS Partner Program
We find that customers running AWS workloads often use both Amazon DynamoDB and Amazon Aurora. Amazon DynamoDB is a fast and flexible NoSQL database service for all applications that need consistent, single-digit millisecond latency at any scale. Its flexible data model and reliable performance make it a great fit for mobile, web, gaming, ad tech, IoT, and many other applications.
Amazon Aurora is a MySQL-compatible relational database engine that combines the speed and availability of high-end commercial databases with the simplicity and cost-effectiveness of open source databases. Amazon Aurora provides up to five times better performance than MySQL with the security, availability, and reliability of a commercial database at one-tenth the cost.
To put these together, imagine you have built a custom web analytics engine, with millions of web clicks registered within DynamoDB every second. Amazon DynamoDB operates at this scale and can ingest high-velocity data. Now imagine needing to replicate this clickstream data into a relational database management system (RDBMS), such as Amazon Aurora. Suppose that you want to slice and dice this data, project it in various ways, or use it for other transactional purposes using the power of SQL within stored procedures or functions.
To effectively replicate data from DynamoDB to Aurora, a reliable, scalable data replication (ETL) process needs to be built. In this post, I show you how to build such a process using a serverless architecture with AWS Lambda and Amazon Kinesis Firehose.
The following diagram shows the solution architecture. The motivations behind this architecture are the following:
- Serverless – By offloading infrastructure management to AWS, you achieve zero-maintenance infrastructure. You also simplify security management for the solution, because there is no need to use keys or passwords, and you optimize cost. In addition, you automate scaling with concurrent Lambda function executions based on shard iterators in DynamoDB Streams.
- Ability to retry failures – Because the data movement process needs to be highly reliable, the process needs to handle failures in each step and provide an ability to retry. This architecture does that.
- Optimization of concurrent database connections – By buffering records based on interval or buffer size, you can reduce the number of concurrent connections to Amazon Aurora. This approach helps avoid connection timeouts.
- Separation of concerns – Using AWS Lambda, you can separate each concern of the data replication process. For example, you can separate the extract phase as processing DynamoDB streams, the transform phase as Firehose-Lambda transformation, and the load phase as bulk insert into Aurora.
Eran Schitzer is a product manager at Amazon Web Services.
We’ve recently added a new feature in AWS Database Migration Service (AWS DMS)—the ability to receive DMS events notifications, such as email messages, text messages, or calls to HTTP endpoints, through Amazon Simple Notification Service (Amazon SNS).
You now can subscribe and receive notifications for two types of events—events related to DMS instances and events related to replication tasks. Events related to DMS instances include those for availability, configuration change, creation, deletion, and maintenance. For example, when a DMS instance goes down for maintenance, a notification is triggered.
Events related to replication tasks include those such as start, pause, finish, Full Load completed, CDC started, and many more. For example, when a migration task finishes to migrate the entire data, it will trigger a “Full Load completed” notification. If the task is configured to follow Full Load mode with CDC mode (that is, replicate the changes in the data since the Full Load began), a “CDC started” notification is triggered next.
In addition, AWS DMS groups events into categories that you can subscribe to using the AWS DMS console or the AWS DMS API. This subscription means you can be notified when an event occurs in the category you subscribed to. For example, if you subscribe to the creation category for a given replication instance, you are notified whenever a creation-related event occurs that affects your replication instance, such as a replication instance is being created.
The following list represents the possible categories for subscription for the DMS replication instance at this time:
- Configuration change—a replication instance configuration is being changed
- Creation—a replication instance is being created
- Deletion—a replication instance is being deleted
- Maintenance—offline maintenance of replication instance is taking place
- Low storage—the free storage for the replication instance
- Failover—failover for a Multi-AZ instance, when enabled, has begun or finished
- Failure—the replication instance has gone into storage failure or has failed due to incompatible network
The following list represents the possible categories for subscription for the DMS replication task at this time:
- State change—the replication task has started or stopped
- Creation—the replication task has being created
- Deletion—the replication task has been deleted
- Failure—the replication task has failed
For a list of the events and event categories provided by AWS DMS, see AWS DMS Event Categories and Event Messages in the documentation.
To subscribe to AWS DMS events, do the following:
- Create an Amazon SNS topic. In the topic, you specify what type of notification you want to receive and what address or number the notification will go to.
- Create an AWS DMS event notification subscription using the AWS Management Console, AWS CLI, or AWS DMS API.
- When you receive an AWS DMS approval email or SMS message to the address you submitted with your subscription, click the link in the approval email or SMS message to confirm your subscription.
When you have confirmed the subscription, the status of your subscription is updated in the AWS DMS console’s Event Subscriptions section.
You then begin to receive event notifications.
For more information about table mapping using the console, see the DMS documentation.
For more information about AWS Database Migration Service in general, see our website.
Jon Handler (@_searchgeek) is an AWS solutions architect specializing in search technologies.
We have many customers who own and operate Elasticsearch, Logstash, and Kibana (ELK) stacks to load and visualize Apache web logs, among other log types. Amazon Elasticsearch Service provides Elasticsearch and Kibana in the AWS Cloud in a way that’s easy to set up and operate. Amazon Kinesis Firehose provides reliable, serverless delivery of Apache web logs (or other log data) to Amazon Elasticsearch Service.
With Firehose, you can add an automatic call to an AWS Lambda function to transform records within Firehose. With these two technologies, you have an effective, easy-to-manage replacement for your existing ELK stack.
In this post, we show you first how to set up an Amazon Elasticsearch Service domain. Then we show how to create and connect a Firehose stream that employs a prebuilt Lambda function to parse Apache web logs. Finally, we show how to load data with Amazon Kinesis Agent and visualize with Kibana.
Gisung Lim is a security solutions architect at Amazon Web Services Korea and Sangpill Kim is an enterprise solutions architect at Amazon Web Services Korea.
This blog post explains how to enhance the privacy and security of data transfers between Amazon DynamoDB and your corporate network using the new Amazon VPC Endpoints for DynamoDB (currently in public preview). With VPC Endpoints for DynamoDB, you can access your DynamoDB tables using private connection endpoints from within your VPC. We also explore how to prevent access to your data from unauthorized locations by using VPC endpoints (for example, preventing use of the AWS Management Console to access your DynamoDB tables from outside of your company network). Although we don’t talk about this in this blog post, you can also use VPC Endpoints for DynamoDB to help resolve regulatory issues regarding authorization and auditability of DynamoDB for confidential user data.
VPC Endpoints for DynamoDB enables Amazon EC2 instances in your VPC to access DynamoDB using their private IP addresses, without any exposure to the public Internet. This new DynamoDB feature ensures that traffic between your VPC and DynamoDB doesn’t leave the Amazon network. In this configuration, your EC2 instances don’t require public IP addresses, and you don’t need an Internet gateway, a NAT device, or a virtual private gateway in your VPC. Furthermore, you can use endpoint policies to control access to VPC endpoints.
Let’s assume that developers’ and administrators’ PCs in your private corporate network don’t have public Internet connectivity, and that you are using either a virtual private network (VPN) connection or AWS Direct Connect to connect between your corporate network and your VPC. This setup helps mitigate the risk of losing or disclosing personally identifiable information (PII) from the PCs. You can use a VPN to route all DynamoDB network traffic through your corporate network infrastructure to help address concerns about the privacy and security of data transfers. However, using a VPN can introduce bandwidth and availability challenges.
To resolve those challenges, we propose a new architecture based on VPC Endpoints for DynamoDB. First, we define the following objectives for our new design:
- All access to DynamoDB should be internal private communications, not using the public Internet.
- Access to DynamoDB using the AWS Management Console should be prohibited.
- All access to DynamoDB should be restricted to the permitted locations or devices, and should be logged.
To satisfy our objectives, we define four control factors, shown in the following architecture diagram.