AWS Cloud Enterprise Strategy Blog
What to Expect from Your Analytics Platform
In my last post I covered why every organization has a big data problem. In summary, the shortcomings in legacy database technologies (and associated infrastructure costs) lead companies to fragment and throw a lot of data away because it’s not perfect, too big, too old, etc. In this post I’ll cover what businesses should expect from a modern analytics platform, especially when backed by the capabilities of the AWS cloud, to help provide you with the key reasons it will bring more value to your business. And to ensure your technology partners don’t think you’re crazy, I’ll also provide some technical reasons why it’s possible and how it’s being done today.
Access to Any Data I Want
In order to make data driven decisions, access to many disparate types of information is essential. A pilot relies on the gauges in his/her airplane to understand information critical to flight, like altitude, air speed, fuel consumption. But imagine if the pilot didn’t have those gauges all in one place. Perhaps they have to walk to the back cabin or radio in for the information, or worse yet, have to ask permission for the data. Unfortunately this is a daily reality in today’s enterprise environment. Data is stored in multiple locations and often requires approval before the employees who need access can retrieve the data to do their job. For a long time this has simply been seen as the status quo, but businesses do not need to tolerate the status quo.
Forward leaning organizations have flipped this standard on its head by pulling data out of the systems in which they exist and storing them in one place (i.e., a data lake). While there are many instances of companies storing large amounts of one type of data, more and more companies are creating enterprise-wide data lakes.
Internet scale companies like Amazon, Yahoo, and Facebook started to see in the early 2000’s that relational database technologies hit roadblocks in their ability to scale and their performance. Amazon responded with a technology called Dynamo, which is a highly available and scalable key value store (i.e., NoSQL/non-relational technology). Amazon then evolved and applied Dynamo to become public services like Amazon S3 and DynamoDB. Amazon S3 is attractive to enterprises to create their data lakes due to S3’s ability to store many different data types and it’s low storage cost. There are, of course, other technical solutions, like Hadoop, but an important characteristic of all data lake solutions is their ability to store all types of data at petabyte scale and at low cost.
Responsive to Change
Business systems and data change all the time, but often the systems which report or share that information end up being the last to change. How many times have you been told it will take six months or more for data to be remediated in data warehouses and reports? Or that data changes from source systems have not yet flowed to reporting systems, and that it takes several days for those changes to make their way down due to the batch processing? The speed in which data is available dictates the speed at which decisions can be made. Therefore, it should be expected that modern analytics systems should be able to process and report data in near real time and should be responsive to changes to upstream data sources.
The first key enabler is the nature in which data is stored in big data technologies like Amazon S3 or Hadoop. One of the big inhibitors of changing a relational database is modifying the schema or definition of how data should be stored. Until the schema is modified, data can’t land in the database or it will choke. File or object based technologies like Amazon S3 don’t care how data is structured—data can come as it is versus the “you need to fit my structure” approach. The other challenge is that only one schema is active at any given time. While I’m sure we’ve all seen database tables named “2015,” “2016,” etc., it’s not ideal. Big data technologies have a schema on read based approach, meaning the structure of the data is applied when you grab it and not inferred based on how it’s stored. What it means for businesses is that data changes from source systems aren’t a big deal.
The second enabler is streaming technologies like Amazon Kinesis and Apache Spark. Most enterprises move data around in big batches, and typically they occur perhaps once a day. Streaming technologies allow data to be ingested in smaller pieces at a massive scale. For example, FINRA ingests 75 billion data events per day. They not only ingest and store those events, but process them in near real time to monitor for anomalies in capital markets. You should never have to wait for the daily batch to be completed to understand where your business stands.
Interactive Insights Where and How I Want Them
Today’s enterprise users have to jump through lots of hoops to understand the information being presented to them. Maybe it’s digging through their inbox to find the report that was attached. Or logging into the reporting system to download a PDF only to find they have to copy and paste the data into Excel to make sense of what is there. We need to stop making users slog through terrible experiences to get the data and insights they need. The rallying cry for users should be bring the data in the right form with the right tools at the right time.
Software like Tableau, Amazon QuickSight, and others have made things better by considering user experience when interacting with data. However, I have found that at most enterprises it requires the use of many tools to meet user’s needs. It could be Amazon QuickSight embedded in a business intelligence portal to a Tableau workbook sent in an email. AWS provides the diversity of data storage and business intelligence tools delivered through a pay-as-you-go model. This allows organizations to experiment with many different business intelligence tools without making large investments in infrastructure and licensing.
Intelligence Embedded in the Business
Artificial intelligence and machine learning is all the rage these days, and rightly so. Advances in machine learning frameworks coupled with the use of specialized servers utilizing graphics processing units (GPU’s) are enabling all kinds of new capabilities, like autonomous driving. Of course, in order to train machine learning models vast amounts of data are required (thus the data lake points I discussed above). Organizations are already starting to take advantage of these AI/ML capabilities to drive new outcomes never before possible, such as being able to better predict health outcomes based on retinal imaging or predict outages or hardware breakdowns in the field. Businesses should be strengthening their organizational AI/ML muscles (letting AWS take care of the heavy lifting), as this is not the stuff of science fiction but operating in production today.
But one final point I would like to emphasize is that the best algorithm in the world is worthless unless it can be integrated with business processes. Often times getting the insight or data science model created is the easy part, getting it integrated into your insurance policy engine or retail platform is the harder work, as these systems are not typically engineered with the ability to integrate outside data sources or application programming interfaces (API’s). This is a great opportunity to consider moving these systems to the cloud to take advantage of all of the services available to help modernize or re-architect these systems.
There are a few more common themes I have heard from customers, such as self-service analytics, but hopefully what I have already provided will help create a set of priciples or a rallying cry for change in your organization. In Part 3 of this blog series, I will cover models for organizing analytics and associated processes.
Never stop innovating,
Joe
chung@amazon.com
@chunjx
http://aws.amazon.com/enterprise/