AWS for Industries

A light in the dark—illuminating dark data with the OSDU Data Platform

With the advent of digitalization initiatives and cloud storage, there may be a tendency to forget the “other side”— cases in which specific data or assets still have no digital equivalents. For oil and gas operations in particular, access to physical assets such as paper documents, tapes, or even core samples, is important for gaining a complete view of the subsurface in a given area of interest. In the energy industry context, dark data tends to refer to unstructured or untapped data that organizations generate and collect during their operations but do not use for analysis, decision-making, or other purposes. Dark data is often overlooked, and its potential value remains hidden because it is not processed, analyzed, or integrated into existing data systems.

Dark data in energy industry operations

Energy operations generate vast amounts of data from various sources, and large amounts of historical data are still available only in analog form. The sorts of data that are, in many cases, awaiting conversion to digital form include paper reports, well logs in sepia or paper print, seismic section in rolled paper, and reference books. Even when some kind of digital copy does exist, data from these sources is frequently not digitized in its entirety, or customers may still want access to a hard copy because the quality of a digital version is not optimal. In some cases, customers may ask for physical media (cartridges, floppy disks, 9-track tape) to be transcribed for the creation of digital data. As illustrated in Figure 1, the estimated amount of data maintained in archive storage and not readily available for consumption is staggering.

Figure 1. Iron Mountain estimates of the energy data still maintained in physical formatsFigure 1. Iron Mountain estimates of the energy data still maintained in physical formats

Oil and gas operators often do not know the extent of their data in its entirety. Different applications are designed to handle different types of physical or digital data, and data is often not centralized and not readily available for consumption across an organization. Moreover, as a result of the economic downturn in oil the gas industry related to the COVID-19 pandemic, operators have tended to cut back on resources for in-house recordkeeping and data management, in many cases outsourcing these functions. These developments have only exacerbated the dark data challenge.

Data consumers typically do not have access to physical data—that is, to the complete catalog—and neither do records managers. When working in a geographical area of interest, operators want to identify the information and resources that a project team would ideally want to have in digital form. For example, a team might want to have access to a well log for carbon storage evaluation, but the original log might have been produced years ago, consequently being available only as print and not as digital data. Even relatively light metadata can be a good starting point in guiding a discovery and informing a decision on what to digitize—and can even, in some cases, help determine whether a candidate area is worth continuing to assess as a prospect for drilling or carbon storage.

Digitization projects are often tough to scope and plan, making budgeting difficult and perpetuating a lack of access to critical data. Improved visibility into physical information assets can, therefore, be a game changer. In one recent experience, a supermajor was under pressure to maximize asset profitability and efficiency and knew that it had hundreds of thousands of exploration records in storage with minimal insight into them. By taking a closer look at the catalog of records stored in Iron Mountain warehouses, the relevant team was able to scope a large digitization initiative. The team was able to define phases, timelines, budgets, and specific scope inclusions and exclusions. With the digitization of dark data from thousands of boxes and tapes, and the consolidation of the resulting digital data on a single platform, the operator empowered its geoscientists and stakeholders around the globe to make faster and more informed decisions and generated tens of millions of dollars in value—all by making use of decades-old dark data.

Illuminating dark data with the OSDU Data Platform

The digitization of a physical asset, along with the capture of all relevant metadata during the process, is an early and vital step in illuminating dark data. However, to make the resulting data readily available for use in various applications and workflows across an organization, it should be stored in an enterprise system of record such as the OSDU Data Platform. The OSDU Data Platform is an open-source data platform for the energy industry, designed to manage and share subsurface and other data and can be extended to accommodate dark data as well.

Iron Mountain, Katalyst Data Management, 47Lining, and Amazon Web Services (AWS) have collaborated to build a workflow that takes historical data stored in Iron Mountain physical storage, digitizes it, ingests it into the OSDU Data Platform on AWS, makes it discoverable via the Iron Mountain eSearch solution, further visualizes it with the Katalyst Data Management iGlass application, and interrogates it with the generative AI assistant that sits on top of the OSDU Data Platform. The overall workflow consists of four key stages:

Stage 1: Mapping the physical storage metadata to the OSDU Data Platform schema. In this stage, two custom schema entities were built to accommodate scanned hard-copy documents and well logs.

Stage 2: Migrating sample data into the OSDU Data Platform. The 47Lining team created the IngestionClient tool to bulk-load metadata from scanned documents and associate it with specific wellbores. This step includes the migration of data and metadata to cloud storage.

Stage 3: Enhancing data discovery workflows via applications compatible with the OSDU Data Platform. The Iron Mountain eSearch tool is used to manage inventories of physical assets in warehouses. The Katalyst Data Management iGlass application has been enhanced to support newly developed schemas for visualizing the metadata of physical assets (such as warehoused archives) on the same map as other digital data on the OSDU Data Platform. This has made possible a comprehensible view of digital and dark data assets in any application that relies on a corporate single-source-of-truth repository.

Stage 4: Applying generative AI capabilities. To ease data querying across a broad variety of data sources, 47Lining, a Hitachi Digital Services company, built a generative AI assistant that sits atop the OSDU Data Platform.

A generative AI assistant for the OSDU Data Platform

The generative AI assistant by 47Lining facilitates interactive conversations about documents managed within the OSDU Data Platform. Instead of manually sifting through lengthy documents, users are empowered by the AI assistant to engage in natural conversation and quickly find answers to their questions. A significant advantage of this solution is that it is based in the AWS Cloud, where it can be accessed via a simple-to-use web interface, meaning that it can be accessed from virtually anywhere. Whether you’re in the office, out in the field, or working from a remote location, the power of this AI assistant is virtually always at your fingertips.

The generative AI assistant for the OSDU Data Platform was built using Amazon Bedrock, a fully managed service that offers a choice of high-performing foundation models (FMs) from leading AI companies like AI21 Labs, Anthropic, Cohere, Meta, Stability AI, and Amazon with a single API, along with a broad set of capabilities you need to build generative AI applications, simplifying development while maintaining privacy and security. Using Amazon Bedrock, the assistant generates latent semantic embedding for each document using OpenSearch. These embeddings serve as the foundation for understanding the content and context of a document. The AI assistant uses question embeddings to find the most relevant document embeddings. Figure 2 shows that the query can be as simple as asking for the latitude of a well or as complex as seeking insights into hydrocarbon occurrences.

Figure 2. The generative AI assistant designed for the OSDU Data Platform on AWSFigure 2. The generative AI assistant designed for the OSDU Data Platform on AWS


The solution presented above shows one way in which dark data can be illuminated in a modern industry data workflow. Knowing what data you have across all data sources remains a challenge in many industries, including the energy industry. Large amounts of historical data and evolving energy needs continue to challenge data management and industry professionals. The solution demonstrated here uses transformative cloud technologies to process complex and ever-growing amounts of data, which helps not only to accelerate the enterprise decision-making process for professionals but also makes data more accessible and user friendly. This workflow is, therefore, a step forward in enhancing productivity and harnessing the power of data management and AI in the world of energy and beyond.

Yuriy Gubanov

Yuriy Gubanov

Yuriy Gubanov is a Senior Partner Solutions Architect at Amazon Web Services specializing on Energy Data Platforms, including OSDU Data Platform. Yuriy has worked in the energy industry for nearly two decades architecting, implementing and delivering innovative IT solutions for the engineering, geoscience and data management communities. He is an avid cloud computing enthusiast and is always looking for new ways to design and influence the energy systems of the future.

Christine Rhodes

Christine Rhodes

Christine Rhodes is a Senior Partner Development Specialist for Energy Data Platforms, OSDU. Christine has 16 years' experience in the energy industry working in Petrotechnical Data Management and Project/Program Management, and currently spends her days at AWS tackling industry challenges by diving deep and bringing forward innovative solutions from our growing ISV community. She is passionate about change management and how data and relationships can support the energy transition across the industry.

Debasis Chatterjee

Debasis Chatterjee

Debasis Chatterjee is the director for research and analytics at Katalyst Data Management. He has more than 40 years of experience in the field of oil and gas data management. Prior to joining Katalyst, Debasis worked at Schlumberger (now called SLB) and Noah Consulting (now part of Infosys Consulting). During his long career, Debasis has worked in many countries, on many different client projects, and in various capacities, including technical and business roles. Debasis has been the project management committee (PMC) vice-lead at the Open Group OSDU Forum for more than two years.

Lorena Pelegrin

Lorena Pelegrin

Lorena Pelegrin is global product lead of energy digital solutions at Iron Mountain. Lorena currently focuses on solutions that make use of previously untapped legacy data to improve strategic decision-making, optimize operations, and reduce risk. Before joining Iron Mountain, Lorena led the technical safety and risk management practice at a global engineering firm, serving the oil and energy industries for over a decade. She is passionate about augmenting human capabilities with better data and automation in the ongoing energy transition.

Priya Choudhari

Priya Choudhari

Priya Choudhari is a principal solution architect at 47Lining, Hitachi Digital Services, with nearly two decades of experience in architecting digital solutions for the oil and gas, healthcare, and aerospace industries. Priya specializes in the OSDU Data Platform and has supported multiple customers in application integration. Prior to joining 47Lining, Priya was a technical lead at Baker Hughes, a GE company, with a focus on delivering software-as-a-service applications for asset performance management. She is deeply passionate about technology innovation.