Microsoft Workloads on AWS

How Thermo Fisher Scientific enriches AWS data pipeline by pulling business data from Microsoft SharePoint

In a previous blog post, we showed how Thermo Fisher Scientific enabled the self-service provisioning of data lake resources on Amazon Web Services (AWS) for their business team by using the functionality of their Microsoft Power Apps environment. In this blog post, we will discuss how they enrich their AWS data pipeline by automatically pulling data in from their Microsoft SharePoint environment.

Ever-growing need for actionable insights

Thermo Fisher has grown through acquisitions and has multiple business groups and divisions within their organizational structure. They have a data-driven culture, and their business teams want to derive actionable insights from the data that the organization generates and collects. Thermo Fisher’s Enterprise Data Services (EDS) team within the Enterprise Technology Organization (ETO) has built a data lake, the Enterprise Data Platform (EDP), that supports different business groups’ analytics, data warehouse (DW) and business intelligence (BI) needs.

The EDS team has invested in AWS services to provide most of the infrastructure for EDP. The EDS team has used Amazon Simple Storage Service (Amazon S3) for the data lake solution because of its high availability, scalability, durability, and cost efficiency. Business users are familiar with Microsoft Office 365 tools and use Excel, SharePoint, and Power Apps for their day-to-day requirements and collaborations. Thermo Fisher has created processes that allow the business users to create and store data in the Microsoft Office 365 ecosystem. The data formats include:

  • Data stored and updated in SharePoint Lists
  • Comma-separated value (CSV) files uploaded to a SharePoint document library or Microsoft Teams collaboration file share
  • Excel spreadsheets stored in Microsoft OneDrive

The business team has a requirement for the EDS data engineering team to add this business data into the data pipeline regularly so that it will be reflected in the final output.

Existing manual process

Business users have developed pre-built templates for Microsoft Power Apps so that sales representatives and field engineers can manually enter sales notes, manual orders, or product information with user-friendly names. Daily, sales representatives and field engineers enter data into the system, which is stored in a SharePoint list.

The EDS team had created a data pipeline to pull inventory and financial data from the Enterprise Resource Planning (ERP) systems into AWS and build sales and inventory BI reports. However, the financial and inventory data in the ERP systems does not contain inputs from the sales representatives and field engineers. Therefore, business analysts had to log in to Microsoft Office 365, export the data from the SharePoint list into a CSV format, log into the AWS Console, and upload to an Amazon S3 bucket on a daily or weekly basis before the data pipeline started. This manual process was time-consuming and error-prone.

There was a need to automate integration between Microsoft Office 365 and AWS to reduce manual effort and increase user productivity. However, there were several challenges the team faced:

  1. There is no out of the box mechanism to export data directly from SharePoint lists to Amazon S3.
  2. The authentication methods differ between the two systems. Microsoft Office 365 uses Azure Active Directory, while AWS uses AWS Identity and Access Management (IAM).
  3. The EDS team uses Apache Spark-based compute engines with Python to develop the pipelines. The Microsoft-based solutions are based on C# or JavaScript.

Automating the integration between Microsoft Office 365 and AWS

The EDS team at Thermo Fisher designed an architecture (Figure 1) to remove the manual work and automate the integration between Microsoft Office 365 and AWS. We will dive deep into the details of the architecture.

Figure 1: High-Level Architecture

  1. Business users use Microsoft Power Apps to maintain product information, inserting data such as new manual sales orders, user-friendly names, or product descriptions. Business users are familiar with Power Apps and like the simple interface. Power Apps has existing Microsoft Office 365 authentication and controls setup to validate data format and maintain data quality.
  2. Power Apps adds data entered by the business users as new items in the Microsoft SharePoint list using the built-in PowerApps connectors.
  3. Because business users have a requirement to have the daily updates reflected in the reports, the EDS team scheduled an Amazon CloudWatch Events rule to trigger the data pipeline to push updates every day.
  4. The Amazon CloudWatch event rule invokes an AWS Lambda function, which retrieves product data from the SharePoint list. AWS Lambda is a serverless, event-driven compute service that lets developers run code for virtually any type of application or backend service without having to provision or manage servers.
    The EDS team has developed a Thermo Fisher-specific software development kit (SDK) for Python using a graph client object from the Microsoft Office 365 REST client. This SDK is used by the Lambda function. It has modules to authenticate against Microsoft Azure using an Azure application API key, and to download product information from the SharePoint list. The Lambda function pulls the data from SharePoint and generates a CSV file on the local storage of the host AWS Lambda environment. The Lambda function immediately processes the file.
  5. The Lambda function copies the CSV file containing product information and overwrites any existing files in a predefined Amazon S3 bucket. Once the Lambda function has copied the file to the Amazon S3 bucket for further downstream pipeline processing, it deletes the file from its temporary local storage.
  6. Other data pipelines, such as the Oracle Golden Gate replication process, import SAP and other ERP data into Amazon S3 for processing.
  7. An AWS partner product, Databricks, processes and aggregates the data. The Databricks compute clusters run on Amazon Elastic Compute Cloud (Amazon EC2). As part of the aggregation process, the unprocessed ERP data is enriched with the product information pulled from the SharePoint list in the previous step. This allows the use of business-friendly names, comments, descriptions, and notes to the final aggregated report and triggers certain logic depending on user input. The final consumable data is written to an Amazon S3 bucket.
  8. Then, the consumable data is catalogued for easy discovery, query, and reporting. An AWS Glue crawler is used to crawl the data in the Amazon S3 prefix daily to extract metadata, create table definitions, and load the result into the AWS Glue Data Catalog schema and tables. AWS Glue is a serverless data integration service that makes it easy to discover, prepare, move, and integrate data from multiple sources for analytics, machine learning (ML), and application development.
  9. Business users and data analysts use Amazon Athena to query the data from Amazon S3, by taking advantage of the structure provided by the AWS Glue Data Catalog. Amazon Athena is an interactive query service that makes it easy to analyze data in Amazon S3 using standard SQL. Athena is serverless, so there is no infrastructure to manage, and the users pay only for the queries that they run.
  10. Business intelligence developers use tools such as Microsoft Power BI to create reports by querying Amazon Athena. They use the Amazon Athena connector for Power BI, which uses an ODBC driver configuration. Business users and leaders within the organization then use these reports and the insights they provide.

Future work

With this solution, the EDS team at Thermo Fisher developed a bridge between the Microsoft ecosystem and AWS that is reusable throughout the entire organization. Some future use cases that they are planning to implement are:

  • Direct Import – Configuring the data pipeline to pull data directly from Microsoft Office 365.
  • Automatic Trigger – Whenever there is an update from the Microsoft Office 365 side, Power Automate can trigger an API call to push the latest copy of data to refresh to Amazon S3.
  • Future Integration – Integrate with other Microsoft services, such as Microsoft Teams, Microsoft OneDrive for Business or Microsoft Forms, which uses SharePoint document library architecture as the backend.

Conclusion

By using their automated solution, the EDS team eliminated the manual efforts of moving data from the Microsoft Office 365 platform to Amazon Web Services. It saved the team 5 hours per week of work for data export and validation, and eliminated errors related to manual processing. It also improved the service level agreement (SLA) for report generation by 15%. There were no changes for the sales representatives and field engineers; they continued to enter sales notes, manual orders, and product information the same familiar way with Microsoft Power Apps. This solution improved the overall efficiency of the business process.

The EDS team believes this will inspire further possibilities and better collaboration between business and IT across domains. This will foster innovation, drive more automation, and process improvements within Thermo Fisher.

About Thermo Fisher Scientific

Thermo Fisher Scientific Inc. is the world leader in serving science, with an annual revenue of approximately $45 billion. Our Mission is to enable our customers to make the world healthier, cleaner and safer. Whether our customers are accelerating life sciences research, solving complex analytical challenges, increasing productivity in their laboratories, improving patient health through diagnostics or the development and manufacture of life-changing therapies, we are here to support them. Our global team delivers an unrivaled combination of innovative technologies, purchasing convenience and pharmaceutical services through our industry-leading brands, including Thermo Scientific, Applied Biosystems, Invitrogen, Fisher Scientific, Unity Lab Services, Patheon and PPD.


AWS can help you assess how your company can get the most out of cloud. Join the millions of AWS customers that trust us to migrate and modernize their most important applications in the cloud. To learn more on modernizing Windows Server or SQL Server, visit Windows on AWSContact us to start your modernization journey today.

Debaprasun Chakraborty

Debaprasun Chakraborty

Debaprasun Chakraborty is an AWS Solutions Architect, specializing in the analytics domain. He has around 20 years of software development and architecture experience. He is passionate about helping customers in cloud adoption, migration and strategy.

David Tishkoff Chidester

David Tishkoff Chidester

David is a Software Developer at Thermo Fisher Scientific. David works on automating cloud tasks and infrastructure using AWS lambda and other services and tools. He is AWS certified and has particular experience with Git and Linux. Outside of work, David enjoys cooking, programming, and playing with his dog, Shoko.

Matthew Yu

Matthew Yu

Matthew is an IT Director at Thermo Fisher Scientific Inc. Matthew has achieved four AWS certifications. Matthew leads the Platform Engineering team for the Enterprise Data and Data Science Platform, empowers Platform Engineering, scalability, cost optimization, automation, and security framework, and works closely with internal data user groups and external partners.