AWS Cloud Enterprise Strategy Blog
Analytics Part 1: Why Every Company Has a Big Data Problem
The weekly Excel report comes out and is delivered to your email. As you review it, you see an anomaly in the financial data that you don’t understand, despite the pivot table provided in the report that allows you to drill down to at least some level of detail. You ask your operations analyst what’s going on. To which your analyst responds, “I’m not sure. Let me find out.”
The next day the analyst tells you that the reason for the anomaly is productivity was way down at the manufacturing plant.
“That doesn’t make sense,” you say. “Can you ask HR if sick days are impacting the productivity numbers? Or could it be there was an issue with the time capture application at the plant?”
“It will take a week to get at that data and merge it with the financial data,” your analyst says.
“Can’t you just send me a dump of the data from the ERP and time application, and I’ll work with it myself?”
The analyst responds, “I don’t have access to the data, and it will take a few days to submit the right tickets to get access to it.”
“Well, surely we can get access to the Centers for Disease Control dataset with location specific data to see if flu outbreaks in the local area are impacting productivity numbers at the plant?” you respond sarcastically, your frustration rising. Your analyst looks at you like you have two heads, to which you say, “Uh, never mind.”
If this scenario seems all too familiar to you, your organization has a big data problem. Your first reaction may be this is a business intelligence process and tools challenge that has plagued organizations since time began, not a true big data problem. But without getting into a religious debate about what analytics vs. reporting vs. business intelligence is, stay with me as I show why every organization has a big data problem. And, with artificial intelligence and machine learning capabilities starting to come to fruition, it’s even more important for enterprises to get a better handle on the data they have.
This blog will be the first in a multi-part series covering the issues and approaches, both organizationally and technically, to creating a data driven organization. In this series, I hope to help you see why every organization should revisit their data and analytics strategy to better utilize the advantages big data can bring.
Most of us think of big data problems as one of volume, and we usually associate them with use cases like the Internet of Things (IoT) or the storage of large objects like images. But the truth is every organization has a big data issue that has been masked in a number of different ways. I’m going to use the 4 V’s model to describe big data and hopefully convince you that you do indeed have a big data problem (and don’t worry, there are solutions).
Volume
An architect once asked me why we need something like Hadoop when the average database size in their environment was something like 50 GB. In many cases the entire database for applications was running in memory in the database engine. However, what most organizations don’t realize is that a lot of interesting data is thrown away or is just not accessible.
For example, what about user activity in the application? Is that information readily available? Or the telemetry of the infrastructure hosting the application (including load balancers and switches)? What about where users are interacting with the application? How they are using the application in relation to other applications? What about the older versions of data that are no longer compatible with the current table schemas? Yes, there are application and end user monitoring tools, but rarely are they analyzed in context with the business processes and activities.
The other volume problem is that data is siloed across many different applications and data warehouses. While no one application may be “big,” the entirety of all applications in an enterprise are big. When businesses become focused on outcomes that span functions or units, the need to analyze data from many sources becomes very challenging. Data warehouse technologies can do this to a certain extent, but most are constrained and cannot house it all. In my own past, I’ve owned shared reporting platforms with hundreds of data warehouses, data marts, and operational data stores.
These silos of data pose another problem: access. Each place data is stored has its own access roles, rules, and ceremonies to be adhered to when trying to access the data. This becomes really pronounced when you do your first data science experiment only to see it stall because you can’t get at the data. (By the way, if you have archives of data because of performance issues, that’s another data silo. When you expand the data types to include object or unstructured data, you have big data)
Velocity
Velocity is the speed at which data moves, but I would argue that it is also the speed at which data changes. Spark, AWS Kinesis, and other streaming technologies seem, again, to have little applicability outside of IoT-type use cases and aren’t relevant to enterprise applications. If you buy my argument that data about what is happening with your infrastructure is a business and business application concern, then having the ability to store and process this information is really important. Tools like Splunk or Sumo Logic are great, but how many times did you wish you had metadata other than some cryptic server name? You can use Amazon Kinesis to enhance log data by adding things like application name, business criticality, and owner. Then you can send it on to your log analytics tool of choice.
How about if I were to ask you how much time is required to change an interface at your organization? Or how long it takes for data to propagate from your ERP to your downstream systems? How many of your business users would love to be notified and alerted on key events or algorithms?
By the way, service-oriented architectures (SOAs) and application programming interfaces (APIs) are not sufficient to solve these issues. I recall working with architects in my newly inherited integration platform organization. They were declaring that all new data integrations must only be API based. To which someone asked, how are you going to deliver 100 GB reference data dumps when you need to refresh a data set and post a master data change in SAP? Silence.
Variety
Many enterprises realize there are troves of data that don’t neatly fit into the traditional database storage technologies (e.g., images, sensor data, etc.). However, most enterprises also don’t realize how easily and quickly this information can be acquired and stored in fit-for-purpose solutions like Amazon Simple Storage Service (Amazon S3), which is great for object storage. Or how easy it can be to store relationships like social networks in a graph database, like Amazon Neptune.
But variety doesn’t stop with types of data. There is also variety in how you analyze and consume insights from the data. When I launched an analytics initiative at a prior company we had a tenet to deliver interactive insights to users where they are. In order to meet that objective we quickly realized that no single reporting or visualization solution could fulfill all the requirements. You can only ask Excel to do so much. We were delivering insights processed through algorithms through API’s, within applications using custom visualization widgets using JavaScript frameworks like D3.js, and through business intelligence portals leveraging Tableau and other visualization solutions.
Veracity
Data veracity speaks to the noise, abnormality, accuracy, or usefulness of data. When you start tapping into unstructured or object-based data, there’s going to be noise. Just like in electronic noise, there are filtering, enhancement, and amplification mechanisms you can use to get at the data you want.
One use case that many enterprises are concerned with is the spiraling costs of sending data into proprietary log aggregation, security, or monitoring tools. However, in most cases a vast portion of the log data can be filtered out, as it is not useful. One pattern I’ve seen implemented is instead of sending data directly to specialized log analytics tools, it can be sent into a data lake architecture and then filtered in real time using tools like AWS Apache Spark, allowing users to siphon off only the useful data.
If you’ve struggled with or resonate with any of the points above, it’s time your organization revisits your analytics approach and architecture. I actually don’t like the term “big data” because it misdirects the application and opportunity big data architectures can provide. What it really boils down to is that every enterprise has the opportunity to deploy fit-for-purpose analytics solutions (storage, processing, querying, analysis, presentation, etc.) to meet their existing business and IT challenges.
As I continue this multi-part series, I’ll cover more details on the future of analytics solutions and some ideas for how best to organize around making your company a data driven organization.
Never stop innovating,
Joe
chung@amazon.com
@chunjx
http://aws.amazon.com/enterprise/