How to Choose the Right Database
This article is part of a technical content series crafted by AWS Startup Solutions Architects to help guide early stage startups in setting the foundations needed to start building quickly and easily. The series offers a high-level overview of the technical decisions startup founders need to make when getting off the ground, along with which AWS services are best suited to address those decisions.
Picking a database is a relatively long-term commitment for a startup’s technical decision maker. When writing an application within a distributed system, all changes are captured in some sort of database. This makes migration of a database the most complex part of workload migration. It is even more complex to do with zero downtime. Taking time to make an informed choice of database technology upfront can be a valuable early decision for your startup. In this article, we walk through the factors you should consider.
The Right Tool for the Right Job
In order to make an informed decision, we start by understanding various database types. Specifically, we look at databases through two lenses: access characteristics and the pattern of the data being stored.
Data can be structured, like a SQL schema, or semi-structured, like a JSON object, where each object can have a different shape. It could also have no defined structure, like textual data for a full text search or a key-value pair, which is not very different from a file name to file content relationship.
Data could also be segregated by size, or the quantum of data produced (reads) or ingested (writes). Payment gateways are a specific use case by which speed of reads and writes is the important focal point.
You may also consider the speed at which data is produced or consumed when selecting a database. For example, stock market data may be small, but the speed at which derived values need to be calculated may be less than 10 milliseconds for a stock back-testing application.
And finally, data is also segregated by scale, or the throughput or simultaneous rate data is created or ingested. We distinguish between transactional (OLTP) and analytical (OLAP) databases. OLAP databases are larger databases for warehousing and data archiving. They generally have lower constraints on speed requirements but high expectation on volume of data they can process. Typically, early-stage startups don’t have an immediate requirement for OLAP systems, so we’ll focus on OLTP systems only.
For a long time, relational databases dominated the database landscape, but it’s clear that the days of single database type are now in the past. With large spans of usage, relational databases are still the dominant database type today. A relational database is self-describing because it enables developers to define the database's schema as well as relations and constraints between rows and tables in the database. Developers rely on the functionality of the relational database and not the application code to enforce the schema and preserve the referential integrity of the data within the database. Typical use cases for a relational database include web and mobile applications, enterprise applications, and online gaming. Various flavors or versions of Amazon RDS and Amazon Aurora are used by startups for high-performance and scalable applications on AWS. Both RDS and Aurora are fully managed, scalable systems.
Key-value and document data
As your system grows, large amounts of data are often in the form of key-value data, where a single row maps to a primary key. Key-value databases are highly partitionable and allow horizontal scaling at levels that other types of databases cannot achieve. Use cases such as gaming, ad tech, and IoT lend themselves particularly well to the key-value data model where the access patterns require low-latency Gets/Puts for known key values.
Amazon DynamoDB is a managed key-value and document database that delivers single-digit millisecond performance at any scale.
Another relevant database type is a document database. Document databases are intuitive for developers to use, because the data in the application tier is typically represented as a JSON document. Developers can persist data using the same document model format that they use in their application code and use the flexible schema model of Amazon DocumentDB to achieve developer efficiency.
Next, we have graph databases. A graph database's purpose is to make it easy to build and run applications that work with highly connected data sets. Typical use cases for a graph database include social networking, recommendation engines, fraud detection, and knowledge graphs. Amazon Neptune is a fully managed graph database service. Neptune supports both the Property Graph model and the Resource Description Framework (RDF), giving you the choice of two graph APIs: TinkerPop and RDF/SPARQL. Startups use Amazon Neptune to build knowledge graphs, make in-game offer recommendations, and detect fraud.
Then there are in-memory databases. Financial services, ecommerce, web, and mobile applications have use cases such as leaderboards, session stores, and real-time analytics that require microsecond response times and can have large spikes in traffic coming at any time. We built Amazon ElastiCache, offering Memcached and Redis, to serve low latency, high throughput workloads that cannot be served with disk-based data stores. Amazon DynamoDB Accelerator (DAX) is another example of a purpose-built data store. DAX was built to make DynamoDB reads an order of magnitude faster, from milliseconds to microseconds, even at millions of requests per second.
Finally, there are search databases. Many applications output logs to help developers troubleshoot issues. Amazon Elasticsearch Service, or Amazon ES, is purpose-built for providing near real-time visualizations and analytics of machine-generated data by indexing, aggregating, and searching semi-structured logs and metrics. Amazon ES is also a powerful, high-performance search engine for full-text search use cases. Startups store billions of documents for a variety of mission-critical use cases, ranging from operational monitoring and troubleshooting to distributed application stack tracing and pricing optimization.
Having gone through the landscape of database choices, we now discuss how to minimize the risk associated with choosing a database for your startup. Availability of mature tooling is the single biggest factor for developers. The PHP-MySQL or LAMP stack, as it is known, is a good example where uniform and deep support of MySQL leads to success of PHP and vice versa. In general, you will find RDS, DynamoDB, and DocumentDB to be good initial choices with wide support for tooling, languages, and flexible data usage patterns.
In this article, we discuss a variety of databases: relational, document, key-value, graph, in-memory, and search. It is always important to capture diverse opinions when making a database decision at your startup. By providing ownership in this important decision, you may find that right choice is not just one database, but maybe two or three. Pick the best database to solve a specific problem or a group of problems.
Have fun, and build on!