GPU-Powered Big Data Analytics with OmniSci Helps Change Data into Information
By Shashi Raina, Partner Solution Architect at AWS
By Veda Shankar, Senior Developer Advocate at OmniSci
We generate quintillions of bytes of data every day, but on the face of it the data is not necessarily useful. To make data useful, it should be processed into information.
In this post, we’ll talk about how OmniSci, an AWS Partner Network (APN) Standard Technology Partner, provides a platform to make this process of changing data into information faster, cheaper, and interactive.
With the rapid growth of data comes the challenge of getting meaningful information from it.
Traditionally, this problem of extracting meaningful information was solved using CPU-based data analytical tools. These tools are powerful when it comes to performing complex backend calculations, but didn’t provide the fine user experience when dealing with big data, interactive querying, or visualization.
Ongoing advancements in harnessing the power of Graphics Processing Unit (GPU) for processing complex programmatic algorithms is enabling user experience of interactively analyzing large volume of data with near real-time latency.
OmniSci (formerly named MapD) is using these advancements to provide GPU-based SQL store and analytical tools.
OmniSci Puts the Power of Data Into Your Hands
Whereas traditional tools didn’t really empower analysts, OmniSci’s GPU-based platform excels at putting the power in your hands. You can slice and dice, drill down, and investigate patterns and trends based on your imagination.
Try filtering New York City taxi data from the last 20 years, for example, to figure out which months provided the maximum amount of tips to drivers. Then, see if these months had any pattern with other variables like festivals and events. Was there a correlation between weather and the number of rides?
Traditional analytical tools didn’t allow users to query billions of rows and provide real-time interactive results like this. The OmniSci platform helps users ask questions that are not limited by technology or the size of data.
Real-World Problem Solving Using OmniSci
Geospatial information drives every phase of upstream exploration and production (E&P) in the oil and gas industry. The mapping, leasing, site preparation, and drilling phases all have the potential to generate terabytes of data that geologists and project managers can use to reduce costs and improve yields.
Yet the wealth of data flowing from sensor-enabled machinery, geologic models, and seismic software overwhelms the standard tools designed for an earlier era when all data was less abundant and location data was particularly difficult to integrate into operations.
A GPU-driven tool set like OmniSci gives geoscientists in the oil and gas industry a way to visualize and interact with the massive volumes of data being generated by their upstream exploration and production processes. It’s a very fast and seamless user experience with billions of geolocated time-series records giving energy companies the ability to find new reserves and optimize the production of existing wells.
The OmniSci Core database and Immerse visual analytics platform allows you to query and visualize billions of rows in milliseconds using Amazon Web Services’ GPU instances, delivering lower latency over CPU solutions in area of interactive visualization.
The OmniSci platform comes with following core components:
- OmniSci Core: Core is an open source SQL-based query engine. It’s able to process up to billions of rows in milliseconds and is capable of unprecedented ingest speeds, making it ideal for high-velocity data.
- OmniSci Rendering Engine: Render uses the GPU’s visual rendering prowess to make big data analytics interactive and visually rich through use of point maps, heat maps, choropleths, scatterplots, and other visualizations.
- OmniSci Visualization Layer: Immerse, a browser-based visualization client, works seamlessly with the OmniSci platform. It’s instant cross-filtering enables real time data querying across multiple chart types.
Figure 1 – OmniSci platform architecture.
Platform in Action
OmniSci offers Amazon Machine Images (AMI) in both Community and Enterprise versions. This empowers you to place the GPU instance as part of your Amazon Virtual Private Cloud (VPC) deployment, and put the security controls and permissions around it as you would for other workloads.
A representational architecture is shown in Figure 2 and consists of a public and private subnet, as well as a NAT (Network Address Translation) gateway that’s part of the public subnet.
Public subnet also contains a Windows bastion host that acts as gateway host to access the OmniSci instance that’s hosted in a private subnet. This instance is attached to an AWS Identity and Access Management (IAM) role that has access to an Amazon Simple Storage Service (Amazon S3) bucket.
This bucket contains CSV files that will be uploaded into the SQL datastore for our demo. Access to the private subnet is only open to security groups of the bastion host.
Figure 2 – Our representational architecture for the demo.
In addition to using Amazon S3 as the staging area for data, there are additional ways to load data into the datastore. OmniSci provides core utilities like JDBC, Sqoop for integrating with services like Amazon Elastic MapReduce (Amazon EMR), Amazon Relational Database Service (Amazon RDS), and Streaming Libraries for streaming data with Kafka.
For our demo, we downloaded the community AMI and used OmniSci Immerse, which is a web client. You can browse to the client from the bastion host using https://<PrivateIpOmniSciInstance>:8443. The username is mapd, and the password will be the same as the instance-id.
Data Manager Tab lets you use either your local drive or Amazon S3 bucket as a datastore. The password can be customized via configuration.
Once data is loaded into the SQL store, you can create a dashboard and add different chart types. The AMI comes preloaded with sample datasets and corresponding dashboards.
Figure 3 – OmniSci preloaded datasets.
I loaded a public dataset for California Oil & Gas Production data for our demo from Enigma Public’s website. I then created a dashboard using the different chart types to get insights into the oil well production data. Figure 4 shows the California Oil & Gas Production dashboard.
Figure 4 – Collection of different charts from California Oil & Gas Wells.
Monthly production information for California Oil & Gas Wells for the years 2008 through early 2018 includes operating data, well pressure, water produced, American Petroleum Institute gravity scale, and the number of days producing.
We joined the production dataset with California Wells’ dataset based on the American Petroleum Institute number, and then extracted information like well location (county, latitude, and longitude), field, and area information. The dataset has 10 million rows with 44 columns, and I created a dashboard with a small subset of the features.
This dashboard shows a point map based on the longitude and latitude of the wells and colored by the name of the operator. You can see that all the wells with geo coordinates are located in Kern county of southern California.
Figure 5 – Point map for California Oil & Gas Wells.
Horizontal Bar Charts are used to display values for multiple dimensions, with two measures displayed as the width and color of the bar for each dimension group. The bar chart in Figure 6 shows the most active well operator in California during the indicated time period using the number of records to represent the width of each bar.
You can create additional bar charts showing the well operators by the amount of oil produced and assign colors for the different methods of operation.
Figure 6 – Activity by well operator.
Histograms are used to understand the distribution of data, and to see areas of unusually high or low density, which would be masked by an aggregate such as average. The Histogram displays the distribution of data across a continuous variable, by aggregating the data into bins of a fixed size. Vertical bars show the count of data within each bin.
The Histogram in Figure 7 shows the distribution of the number of days that the well was in production during the monthly reporting period. It’s surprising to see there are two peaks, where the wells are either operational throughout the month or for just a few days in a month.
Figure 7 – Well production efficiency distribution.
Interactive Querying Via Crossfilters
Now that I have all my charts in the dashboard, as shown in Figure 4 above, I can click around on some of the chart elements, zoom in or out on the point map, and brush across a time range. I can also watch the other charts on the screen update instantaneously.
This feature of OmniSci Immerse is called crossfilter and allows a filter applied to one chart to simultaneously be applied to the rest of the charts on a dashboard. This is possible even with very large datasets because OmniSci is not pre-indexing or aggregating any of the data, giving you a completely granular view of the data in real-time.
The map selection generates a query in the backend, and the power of this platform lies in the way output is sent back to the user. Instead of sending results datasets over the wire to be rendered on the browser client, OmniSci’s GPU-based platform sends a compressed image back as the output that is then rendered in the browser for the user.
You can see the crossfilter in action in Figure 8 when I click on the operator Occidental of Elk Hills in the bar chart. The select statement is applied to all the charts, and I immediately see that Occidental has a steady ongoing production in the Elk Hills oil field. Simultaneously, I can also see that their production in the Buena Vista oil field ended in 2010.
Figure 8 – Crossfilter results for California Oil & Gas.
In this test, we analyzed the oil and gas production in California over a 10-year period and used OmniSci’s Immerse visualization platform to interactively find information and patterns from the raw data.
The analysis does not require any pre-indexing and allows for ingesting extremely large datasets. Check out Enigma’s website to find the public dataset that interests you and spin up an OmniSci AMI from AWS Marketplace to interactively analyze it.
OmniSci – APN Partner Spotlight
OmniSci is an APN Standard Technology Partner. They redefine the limits of speed and scale in big data analytics by combining the fastest analytics software with the fastest hardware, the GPU.
*Already worked with OmniSci? Rate this Partner
*To review an APN Partner, you must be an AWS customer that has worked with them directly on a project.