AWS for Industries
TC Energy innovates using AWS to improve document consistency and asset management
TC Energy operates one of North America’s largest energy infrastructure portfolios, delivering the energy that millions of people rely on to power their lives in a sustainable way.
TC Energy’s portfolio includes three complimentary energy infrastructure businesses:
- A 93,300 km network of natural gas pipelines that supplies more than 25 percent of the daily clean-burning natural gas demand across North America
- A 4,900 km liquids pipeline system that connects growing continental oil supplies to key markets and refineries
- Power generation facilities with combined capacity of approximately 4,300 MW—enough to power more than four million homes
TC Energy’s core values of safety, innovation, responsibility, collaboration, and integrity demand a disciplined, consistent, proactive, and systematic approach to risk management. Protecting people and over $100 billion of assets is a part of risk management that involves tens of thousands of documents, including contracts, engineering standards, and operating procedures. The integrity and accuracy of these documents are managed by TC Energy’s Engineering Governance team. To avoid conflict and duplication across these documents, the team was using a time-consuming, manual process of comparing all the clauses within these documents each time a document was created or revised. In addition, this repetitive manual process was also prone to error.
To improve this manual, error-prone process and increase the accuracy of detecting duplicates and conflicts, TC Energy turned to Amazon Web Services (AWS) and decided to use natural-language processing (NLP). A custom tool was developed using AWS along with an open-source, pretrained bidirectional encoder representations from transformers (BERT) language model. The tool, dubbed “Document Clause Analyzer,” detects references, parses clauses, computes sentence encodings, and calculates sentence similarity scores. The results are encoded in a report used by the Engineering Governance team in its review process.
Getting started by validating the concept
The first step Engineering Governance took was to collaborate with TC Energy’s Machine Learning Lab team to prove the value of NLP to generate reports that would assist with document review. The team was able to do this using a prebuilt machine learning (ML) model to create a clause-similarity report that aided in identifying clauses that are duplicates or conflicting. The team used Amazon SageMaker, a fully managed service for building, training, and deploying ML models and workflows. Using Jupyter Notebooks from Amazon SageMaker, the Machine Learning Lab team completed a proof of concept (PoC) in just 3 months. After feasibility was shown, Engineering Governance engaged the Information Management Product Delivery team to build a production-ready solution. To achieve this, the Product Delivery team added features that would help authors to compare clauses within a draft document with the clauses contained in all the referenced documents and other documents, along with a secure and working user interface.
Overview of the Document Clause Analyzer
The following figure shows the architecture of the solution that the Product Delivery team designed, built, and released during an 8-week product-development initiative.
The solution consists of the following key components:
- Integration to the corporate content-management system
- Reference extractor
- Clause parser
- User interface
- Clause similarity model
- Reporting solution
Integration into the corporate content-management system
To facilitate the comparison of the clauses of a draft document with any other published corporate engineering standard, the team implemented a process to synchronize the solution with the corporate content-management system. The published documents were uploaded into the solution using a function of AWS Lambda, a serverless, event-driven compute platform. Each newly published document was processed by the reference extractor and clause parser.
Reference extractor and clause parser
The solution used Amazon Textract, an ML service that automatically extracts text, handwriting, and data from documents, to extract text from documents. The extracted text was then used to identify the references and clauses within an engineering standard.
TC Energy’s custom-built reference extractor, which had been built for a previous solution, was slightly modified to support draft documents for the Document Clause Analyzer. The module extracted references to any other published corporate document. The reference mappings were stored in Amazon Neptune, a fully managed graph database service, while clauses were sent for sentence embedding.
Sentence embedding is an NLP technique where sentences are mapped to vectors of real numbers. These vectors are called sentence encodings. The sentence encodings for all the clauses in a published document are stored as a file in Amazon Simple Storage Service (Amazon S3), a fully managed object storage service. After extraction, the list of references and files is saved to Amazon DynamoDB, a fully managed NoSQL database service.
The custom-built clause parser extracted and processed clauses within an engineering standard. The algorithm used to parse clauses was based on the standardized numbering format of the engineering standards. This workflow was initiated every time documents were uploaded to TC Energy’s content management or a draft document was uploaded into the solution for review.
Clause similarity model
After users uploaded a draft document and defined a list of documents to compare to, they generated a clause similarity report. The sentence encodings for each document listed were retrieved from the files in Amazon S3 and passed to the BERT-based clause similarity model. The ML model was used to calculate a similarity score for all combinations of clauses within the draft and all other documents. The results were returned and forwarded to another AWS Lambda function to generate a report.
Interactive user interface
The user interface helped users to upload valid engineering standards (see figure 1), define a list of engineering standards to be compared, initiate the clause similarity analysis, view a list of reports containing the results, and download a report (see figure 2).
The interface was developed in React and hosted in AWS Amplify, a full-stack service that can host front-end web apps and create the backend environment. To secure the application, the team used Amazon Cognito, a fully managed authentication service. It was integrated with the corporate directory store, so that corporate employees could securely and seamlessly access the app.
Figure 1: DCA UI – Document selection
Figure 2: DCA UI – Report generation
Reporting
The similarity scores returned from the clause-similarity calculator were filtered, formatted, and saved as an Excel file, which was then stored in Amazon S3. Users could then view and download the report from the web application. The results were used by the document contributor to review their draft for duplicated and conflicting requirements. To make the report more user friendly, Engineering Governance built a Power BI template that was able to ingest the downloaded Excel reports (see figure 3).
Figure 3: DCA – Power BI report
Road to production
Engineering Governance users wanted the ability to compare hundreds of documents within a limited time window. The linear approach used in the PoC running on Amazon SageMaker Jupyter Notebooks could take hours to generate a report. To make this product viable, the team needed to have reports produced in a matter of minutes. The approaches implemented below allowed the tool to move to production and produce reports in under 5 minutes.
Preprocessing documents and calculating encodings in advance
A significant amount of time was saved by having the encodings for all published documents calculated in advance and stored in Amazon S3 (alongside the document itself), so when a published document was selected for comparison, its encodings could be quickly retrieved. When a draft document was uploaded, its references were extracted and returned to the user. While a user was reviewing the references and completing the comparison list, the clauses of the draft document were extracted, and the encodings for each were calculated. By the time the user was ready to generate a report, much of the heavy processing had already been completed in the background.
Parallelizing the processing of each pair of documents
Because documents could potentially contain hundreds of references, processing sequentially could take upward of 1 hour to generate a report. To reduce the time to minutes, each pair of documents was processed concurrently. This was facilitated by AWS Lambda running TC Energy’s custom Docker image containing the BERT model. Another optimization was using the same container image for both calculating sentence encodings and producing similarity scores, allowing the AWS Lambda function runtime to almost always be in a warm state, saving a few minutes every time. After the last pair of documents was processed, a final AWS Lambda function compiled the data and generated an Excel file.
Reducing the number of encodings being compared
In the initial approach, all clauses were being compared with each other. This resulted in a lot of redundant and unnecessary comparisons, which increased the compute time. By focusing only on the comparisons that were required by the business, the team reduced the number of comparisons and thus reduced the problem space. In addition, this change also simplified the report for the business, resulting in a better user experience.
Filtering comparison results
A separate process to calculate the similarity scores was spun up for each pair of documents, and the results from each merged to produce a single report. A significant amount of delay when creating the report was caused by the substantial number of results being merged. To speed up the merge, a similarity-score filter was added to remove irrelevant comparisons based on a threshold determined by the business. This reduced the size of the report by over 90 percent and the time to produce it significantly.
Conclusion
To avoid conflict and duplication across thousands of engineering standards, TC Energy looked to NLP to assist in the review process. TC Energy’s Product Delivery Team used multiple AWS services to build a solution that could generate a clause-similarity report of over 200 documents in under 5 minutes. Sharing components between solutions, preprocessing, parallel computing, efficient logic, and data filtering were essential to the success of this initiative. The solution was also built with future use cases in mind. The team is looking to expand this product to cover more areas of the business.