AWS for Industries
A light in the dark—illuminating dark data with the OSDU Data Platform
With the advent of digitalization initiatives and cloud storage, there may be a tendency to forget the “other side”— cases in which specific data or assets still have no digital equivalents. For oil and gas operations in particular, access to physical assets such as paper documents, tapes, or even core samples, is important for gaining a complete view of the subsurface in a given area of interest. In the energy industry context, dark data tends to refer to unstructured or untapped data that organizations generate and collect during their operations but do not use for analysis, decision-making, or other purposes. Dark data is often overlooked, and its potential value remains hidden because it is not processed, analyzed, or integrated into existing data systems.
Dark data in energy industry operations
Energy operations generate vast amounts of data from various sources, and large amounts of historical data are still available only in analog form. The sorts of data that are, in many cases, awaiting conversion to digital form include paper reports, well logs in sepia or paper print, seismic section in rolled paper, and reference books. Even when some kind of digital copy does exist, data from these sources is frequently not digitized in its entirety, or customers may still want access to a hard copy because the quality of a digital version is not optimal. In some cases, customers may ask for physical media (cartridges, floppy disks, 9-track tape) to be transcribed for the creation of digital data. As illustrated in Figure 1, the estimated amount of data maintained in archive storage and not readily available for consumption is staggering.
Figure 1. Iron Mountain estimates of the energy data still maintained in physical formats
Oil and gas operators often do not know the extent of their data in its entirety. Different applications are designed to handle different types of physical or digital data, and data is often not centralized and not readily available for consumption across an organization. Moreover, as a result of the economic downturn in oil the gas industry related to the COVID-19 pandemic, operators have tended to cut back on resources for in-house recordkeeping and data management, in many cases outsourcing these functions. These developments have only exacerbated the dark data challenge.
Data consumers typically do not have access to physical data—that is, to the complete catalog—and neither do records managers. When working in a geographical area of interest, operators want to identify the information and resources that a project team would ideally want to have in digital form. For example, a team might want to have access to a well log for carbon storage evaluation, but the original log might have been produced years ago, consequently being available only as print and not as digital data. Even relatively light metadata can be a good starting point in guiding a discovery and informing a decision on what to digitize—and can even, in some cases, help determine whether a candidate area is worth continuing to assess as a prospect for drilling or carbon storage.
Digitization projects are often tough to scope and plan, making budgeting difficult and perpetuating a lack of access to critical data. Improved visibility into physical information assets can, therefore, be a game changer. In one recent experience, a supermajor was under pressure to maximize asset profitability and efficiency and knew that it had hundreds of thousands of exploration records in storage with minimal insight into them. By taking a closer look at the catalog of records stored in Iron Mountain warehouses, the relevant team was able to scope a large digitization initiative. The team was able to define phases, timelines, budgets, and specific scope inclusions and exclusions. With the digitization of dark data from thousands of boxes and tapes, and the consolidation of the resulting digital data on a single platform, the operator empowered its geoscientists and stakeholders around the globe to make faster and more informed decisions and generated tens of millions of dollars in value—all by making use of decades-old dark data.
Illuminating dark data with the OSDU Data Platform
The digitization of a physical asset, along with the capture of all relevant metadata during the process, is an early and vital step in illuminating dark data. However, to make the resulting data readily available for use in various applications and workflows across an organization, it should be stored in an enterprise system of record such as the OSDU Data Platform. The OSDU Data Platform is an open-source data platform for the energy industry, designed to manage and share subsurface and other data and can be extended to accommodate dark data as well.
Iron Mountain, Katalyst Data Management, 47Lining, and Amazon Web Services (AWS) have collaborated to build a workflow that takes historical data stored in Iron Mountain physical storage, digitizes it, ingests it into the OSDU Data Platform on AWS, makes it discoverable via the Iron Mountain eSearch solution, further visualizes it with the Katalyst Data Management iGlass application, and interrogates it with the generative AI assistant that sits on top of the OSDU Data Platform. The overall workflow consists of four key stages:
Stage 1: Mapping the physical storage metadata to the OSDU Data Platform schema. In this stage, two custom schema entities were built to accommodate scanned hard-copy documents and well logs.
Stage 2: Migrating sample data into the OSDU Data Platform. The 47Lining team created the IngestionClient tool to bulk-load metadata from scanned documents and associate it with specific wellbores. This step includes the migration of data and metadata to cloud storage.
Stage 3: Enhancing data discovery workflows via applications compatible with the OSDU Data Platform. The Iron Mountain eSearch tool is used to manage inventories of physical assets in warehouses. The Katalyst Data Management iGlass application has been enhanced to support newly developed schemas for visualizing the metadata of physical assets (such as warehoused archives) on the same map as other digital data on the OSDU Data Platform. This has made possible a comprehensible view of digital and dark data assets in any application that relies on a corporate single-source-of-truth repository.
Stage 4: Applying generative AI capabilities. To ease data querying across a broad variety of data sources, 47Lining, a Hitachi Digital Services company, built a generative AI assistant that sits atop the OSDU Data Platform.
A generative AI assistant for the OSDU Data Platform
The generative AI assistant by 47Lining facilitates interactive conversations about documents managed within the OSDU Data Platform. Instead of manually sifting through lengthy documents, users are empowered by the AI assistant to engage in natural conversation and quickly find answers to their questions. A significant advantage of this solution is that it is based in the AWS Cloud, where it can be accessed via a simple-to-use web interface, meaning that it can be accessed from virtually anywhere. Whether you’re in the office, out in the field, or working from a remote location, the power of this AI assistant is virtually always at your fingertips.
The generative AI assistant for the OSDU Data Platform was built using Amazon Bedrock, a fully managed service that offers a choice of high-performing foundation models (FMs) from leading AI companies like AI21 Labs, Anthropic, Cohere, Meta, Stability AI, and Amazon with a single API, along with a broad set of capabilities you need to build generative AI applications, simplifying development while maintaining privacy and security. Using Amazon Bedrock, the assistant generates latent semantic embedding for each document using OpenSearch. These embeddings serve as the foundation for understanding the content and context of a document. The AI assistant uses question embeddings to find the most relevant document embeddings. Figure 2 shows that the query can be as simple as asking for the latitude of a well or as complex as seeking insights into hydrocarbon occurrences.
Figure 2. The generative AI assistant designed for the OSDU Data Platform on AWS
Conclusion
The solution presented above shows one way in which dark data can be illuminated in a modern industry data workflow. Knowing what data you have across all data sources remains a challenge in many industries, including the energy industry. Large amounts of historical data and evolving energy needs continue to challenge data management and industry professionals. The solution demonstrated here uses transformative cloud technologies to process complex and ever-growing amounts of data, which helps not only to accelerate the enterprise decision-making process for professionals but also makes data more accessible and user friendly. This workflow is, therefore, a step forward in enhancing productivity and harnessing the power of data management and AI in the world of energy and beyond.