Architecture III: Picking the Right Data Store for Your Workload
The Tyranny of Choice
Why does AWS have so many data storage options? Which one is right for me? In this blog series, I provide some clarity for those types of questions. In part one of the series, I discussed the basics of high availability and how redundancy is a common way to achieve it. I also mentioned that bringing redundancy to the data tier introduces new challenges. In part two, I discussed some of those challenges, along with the common tradeoffs that you need to take into consideration while overcoming them. In this third post of the series, I build on the information in part one and two by discussing which aspects of your data-centric workload will influence the tradeoffs that you are willing to make. In future posts, I will discuss specific AWS data storage options and the workloads each storage option is optimized for. After you read this blog series, you will come to appreciate the richness of AWS data storage offerings and learn how to select the right option for the right workload.
As I discussed in part two, the Internet era brought about new challenges for data storage and processing, which prompted the creation of new technologies. The latest generation of data stores are no longer jack-of-all-trades, single-box systems, but complex distributed systems optimized for a particular kind of task at a particular level of scale. Because no single data store is ideal for all workloads, the old habit of choosing a data store for the entire system will not serve us well in this brave new world. Rather, we need to consider each individual workload/component within the system and choose a data store that is right for it.
While we all crave the freedom of choice, when we are faced with too many choices and unclear differentiators among them it is easy to get overwhelmed and either enter a state known as “analysis paralysis” or take a shortcut and go with what is most familiar, rather than what is best.
When you review the various AWS data storage options, the situation might seem similar unless you know how to compare the available options. AWS provides a dozen services that can be classified as “data storage services.” Additionally, you have the option to host whatever you want on your EC2 instances. In my investigation of these data storage options, I use several dimensions to help clarify which service is best suited for a particular task. It is important to keep in mind that these dimensions are just convenient shortcuts, a way of creating a common set of terminology that allows us to reason about data workloads more consistently and succinctly.
Velocity, Variety, and Volume
The first dimensions to consider are the venerable three Vs of big data: Velocity, Variety, and Volume. While most of us are probably familiar with these concepts, how we define them might differ slightly depending on the context. In the following sections, I describe how I use the terms throughout this blog series.
In the world of data stores, Velocity is the speed at which data is being written or read, measured in reads per second (RPS) or writes per second (WPS). In the world of big data, Velocity is defined as the speed at which data is analyzed, ranging from batch, to periodic, to near-real-time, to real-time.
Velocity affects the choice of a data store in several ways. If the rate of writes is high, a single disk or network card can easily become a bottleneck, calling for multiple storage nodes. However, as I mentioned in part two of this series, this forces us to consider CAP tradeoffs, which are easier solved by share-nothing partitioned/sharded architectures. If the rate of reads is high, the solutions include adding read replicas and caches and, again, CAP tradeoffs. When it comes to big data, the higher the rate of analysis (for example, near-real-time vs. batch), the more storage and processing will move away from disk and into memory and away from batch-oriented frameworks (such as Map/Reduce) to streaming-oriented frameworks (such as Apache Spark).
The typical metrics that are used to measure the effectiveness of a data store from the Velocity point of view are writes/reads per second, write/read latency, and the time to analyze a certain amount of data (for example, one day’s worth of data).
Variety determines how structured the data is as well as how many different structures exist in the data. This can range from highly structured, to loosely structured, to unstructured, to BLOB data.
Highly structured data has predefined schema, where each entity of the same type has the same number and type of attributes, and the domain of allowed values for an attribute can be further constrained. The great advantage of highly structured data is its self-described nature. This makes it very effective for data exchange across systems. The self-described nature of highly structured data also makes it easy to reason about it, which means we can build generic tools for storing, processing, and displaying this data, such as relational database management systems and BI/reporting tools.
Loosely structured data has entities which do have attributes/fields, but aside from the field uniquely identifying an entity, the attributes don’t have to be the same in every entity. This data is more difficult to analyze and process in an automated fashion, putting more burden of reasoning about the data on the consumer or application.
Unstructured data, as the name implies, does not have any sense of structure: it specifically has no entities or attributes. This data does contain useful information that can be extracted, but it is up to the consumer or app to figure out how to do it — the data itself will not provide any help.
BLOB data is useful as a whole, but there is usually little benefit in trying to extract value from a piece or attribute of a BLOB. Therefore, the systems that store this data typically treat it as a black box and only need to be able to store and retrieve a BLOB as a whole.
Volume is the total size of the dataset. There are two main reasons we care about data: to get valuable insight from it or store it for later use. When getting valuable insights from data, more data usually beats better models. When keeping data for later use, be it digital assets or backups, the more data we can store, the less we need to guess what data to keep and what to throw away. These two reasons prompt us to collect as much data as we can store, process, and afford to keep.
Typical metrics that measure the ability of a data store to support Volume are maximum storage capacity and cost (such as $/GB).
Although we would like to extract useful information from all data we collect, not all data is equally important to us. Some data has to be preserved at all costs, and other data can be easily regenerated as needed or even lost without significant impact on the business. Depending on the value of data, we are more or less willing to invest in additional durability.
Transient data is usually short-lived and represents drops in the bucket. The loss of some subset of transient data, does not have significant impact on the system as a whole. Examples include clickstream or Twitter data. We usually do not need high durability of this data because we expect it to be quickly consumed and transformed further, yielding higher-value data. If we lose a tweet or a few clicks, this is unlikely to affect our sentiment analysis or user behavior analysis.
However, not all streaming data is transient. For example, for an intrusion detection system (IDS), every record representing network communication can be valuable, as every log record can be valuable for a monitoring/alarming system.
Reproducible data contains a copy of useful information that is often created to improve performance or simplify consumption, such as adding more structure or altering a structure to match consumption patterns. Although the loss of some or all of this data may affect a system’s performance or availability, this will not result in data loss because the data can be reproduced from other data sources. Examples include data warehouse data, read replicas of OLTP systems, and all sorts of cache. For this data, we may invest a bit in durability to reduce the impact on system’s performance and availability, but only to a point.
Authoritative data is the source of truth. Losing this data will have significant business impact because it will be very difficult, or even impossible, to restore or replace it. For this data, we are willing to invest in additional durability. The greater the value of this data, the more durability we will want.
Critical/Regulated data is data that a business must retain at almost any expense. This data tends to be stored for long periods of time and needs to be protected from accidental and malicious changes, not just data loss or corruption. Therefore, in addition to durability, cost and security are equally important factors.
Data temperature is a useful way of looking at data as well. It helps us understand how “lively” the data is: how much is being written/read and how soon it needs to be available.
Hot data is being actively worked on. It is being actively contributed to via new ingests, updates, and transformations. Both reads and writes tend to be single-item. Items tend to be small (up to 100s of KB). Speed of access is essential. Hot data tends to be high-Velocity and low-Volume.
Warm data is still being actively accessed, but less frequently than Hot data. Often, items can be as small as in Hot workloads, but updated and read in sets. Speed of access, while important, is not as crucial as with Hot data. Warm data is more balanced across Velocity and Volume dimensions.
Cold data still needs to be accessed occasionally, but updates to this data are rare and reads can tolerate higher latency. Items tend to be large (tens of hundreds of MB or GB). Items are usually written and read individually. High durability and low cost are essential. Cold data tends to be high-Volume and low-Velocity.
Frozen data needs to be preserved for business continuity, archival, or regulatory reasons, but is not being actively worked on. While new data is regularly added to this data store, existing data is never updated. Reads are extremely infrequent (known as “write once, read never”) and can tolerate very high latency. Frozen data tends to be extremely high-Volume and extremely low-Velocity.
The same data can start as Hot and gradually “cool down.” As it does, the tolerance of read latency increases as does the total size of the data set.
In the next part of this blog series, I will explore individual AWS services and discuss which services are optimized for the dimensions I’ve discussed thus far.
- Segment your system into workloads.
- Think in terms of a data storage mechanism that is most suitable for a particular workload, not a single data store for the entire system.
- To further optimize cost and/or performance, segment data within each workload by Value and Temperature, and consider different data storage options for different segments.