AWS in Switzerland and Austria (Alps)
From Silos to Synergy: Philip Morris International’s Distributed Data Network
This post was written by Parviz Shariff, Giorgio Sorbara, Yhawmini Balakrishnan and Jacques Detroyat from Philip Morris International, with the support of Jordi Porta and Samuel Riedo from AWS.
Introduction
Philip Morris International (PMI), a multinational corporation operating across 4 continents, is undergoing a significant digital transformation journey to support our multi-category portfolio of smokefree products. This self-disruptive journey began around 10 years ago and at the heart of this transformation lies the challenge of managing complex, heterogeneous data and analytics needs across the global enterprise.
This blog post describes PMI’s approach to addressing this transformation through the implementation of a Distributed Data Network (DDN). PMI’s DDN initiative has streamlined data processes across more than 30 global locations so far, significantly improving efficiency and data accessibility. This data network has enabled both autonomy and automation, opened up new possibilities for data utilization, enabling PMI to enrich its analytics and support critical business processes with a comprehensive view of its operations.
At its core, this change has been necessitated due to the fundamental shift of the PMI business model from just B2B to also B2C and a B2B2C operating model which requires us to cater for the complex and specific data and analytic needs of each function, market and product category. In essence, we have had to move from a monolithic, centralized, one-size fits all analytics operating model to a federated approach that provides both topline and bottomline benefits for the organization.
In the following sections, we’ll delve into the architecture, implementation, and benefits of the DDN, exploring how it fits into PMI’s overall data strategy. We’ll also discuss the crucial role that PMI’s organizational culture and appetite for innovation have played in making this initiative possible.
Key Principles
To address the challenges of its digital transformation, PMI has designed and implemented an architectural approach called the Distributed Data Network (DDN). This approach to organizing and sharing data across the organization is based on principles from network theory, domain-driven design, and data mesh. It comprises a set of self-organized, interoperable, and semi-autonomous “nodes” that can support diverse data and technology requirements while still maintaining federated governance. This design has provided PMI with the capabilities needed to succeed in its complex, global environment.
Key goals of the DDN include:
- Ambidexterity: Balancing the ability to operate effectively across diverse business contexts by enabling speed and business agility.
- Composable Architecture: Promoting modularity, interoperability, and seamless scalability through containerized services and insights.
- Integrative Growth: Digitalizing business processes to drive innovation, cost leadership, and value across top and bottom lines.
Implementing a DDN presents two main challenges:
- Data Literacy: Many organizations struggle with varying levels of data literacy and comfort among their employees. Part of the challenge is to provide effective training, resources, and support to help elevate data skills and understanding across the different functional teams.
- Data Governance: Organizations often need to balance the need for centralized data governance with the desire to provide broader access to data across the markets and business units. The goal is to empower teams with the data they need, while still maintaining appropriate controls, security, and data quality across the organization.
Success in implementing a DDN depends on addressing both challenges: building data skills organization-wide and creating a framework for secure, effective data sharing that allows for both autonomy and coherence.
This approach has allowed PMI to address both the common and uncommon needs across its global operations. The common needs, such as optimization of infrastructure, technology and tooling were addressed through reusable, interoperable design principles, while the uncommon needs, such as domain and geography-specific requirements, were handled through a federated model that enables autonomy and localization.
DDN Architecture
Core Principles and Design Patterns
The overall architecture blueprint of the DDN is based on a distributed network topology that operates within a federated governance model. Given the size, scale, and complexity of the organization, the emphasis is on providing clear logical boundaries for each stakeholder group to operate, whether they are defined by functional domains or geographies.
The conceptual view of the DDN architecture depicts three primary node types:
- Functional Domain Nodes: These nodes represent data domains that belong to specific lines of business.
- Geographical Nodes: These nodes correspond to data domains that belong to specific markets or regions, each with their own local data sources and unique requirements.
- Core Node: This node is reserved for cross-domain data and analytics with global consumption patterns, as well as for the choreography and orchestration of central data services.
The DDN architecture manifests through a comprehensive set of reusable capabilities that support the entire data lifecycle – from ingestion and storage to metadata management, data quality, transformation, and governance.
Data Ingestion and Landing Zones: DDN’s data ingestion process leverages a variety of mechanisms, including SFTP, Amazon Kinesis, Apache Kafka, and REST APIs. Depending on the source and nature of the data, it is ingested and initially landed in either Amazon S3 or Snowflake, the two primary data storage solutions within the DDN.
Data Storage and Federation: DDN supports a diverse set of storage solutions across its nodes, including Amazon Neptune, Altair Graph Studio, MongoDB, Amazon RDS and Amazon Redshift. Structured data is primarily stored in Snowflake, while unstructured data is predominantly kept in Amazon S3, with same-region and multi-region replication for resilience and availability.
Architecture
From an architectural perspective, each node in the DDN is implemented as a separate AWS account, representing a logically segregated domain or market. Within each node, there’s flexibility to further segregate into domain subject areas or market clusters based on business needs and legal requirements. This approach aligns with domain-driven design principles throughout the entire architecture.
- Data Sources – Both external and internal data from various domains and markets. This includes internal data sources from large CRM and ERP providers, and external sources like Market Research and business partners. Examples of data transfer mechanisms include SFTP, APIs and database connectors.
- Federated Governance and Data Quality: After identifying the data sources, the data is classified and labeled in both Atlan (business metadata catalog) and PMI’s enterprise architecture management software. Once a data steward approves this, the data is ready to be imported.
- Data Ingestion & Extraction: We use a domain driven approach to bring in data, based on its domain or category into specific nodes. This means that data from different areas (e.g. customers, orders, products) is stored separately. There are two ways we get data into the system:
- Matillion Connectors: We use Matillion, a cloud-native data integration and transformation platform, to directly import data from various sources into the staging area of Snowflake.
- APIs and Amazon Kinesis: For some specific sources, we use custom-developed APIs and Amazon Kinesis to stream in the data into Amazon S3.
- Data Landing: The data landing zone serves as the initial storage area for ingested data. It consists of both Snowflake and Amazon S3 object storage. After data from external source is ingested into Amazon S3, AWS Lambda functions monitor and perform quality assurance checks. Data that passes these checks is then loaded into a Snowflake staging area. Subsequently, Snowflake’s computing resources are used to transform and harmonize the data. This process involves cleaning, standardizing, and preparing the data for analysis and integration into the system.
- Data Catalog: Data is cataloged in Atlan with business metadata and classification tags to capture its full lineage.
- Data Sharing: Data is not duplicated; instead, it is virtualized using Snowflake’s data sharing capabilities, for all functions and markets. When data needs to be copied to other storage systems or across different cloud platforms, we use replication methods within the same geographic region or across different regions as needed. Access is managed through Azure AD and SIGA (PMI’s internal access management system) based on the principle of least privilege, using personas, domains, and tags as parameters and inputs.
- Data Persistency:
- Polyglot data persistency: the distributed data network supports a polyglot persistency landscape, and, in the concept of this architecture, it aims to use fit for purpose technology for ingestion, distribution, and storage as one size does not fit all data needs. A combination of database persistence technologies is used to create specific data products, making the system polyglot. For example, relational databases are used for Business Intelligence (BI) traditional reporting, notebooks for Machine Learning (ML) predictive forecasting reports, and Altair Graph Studio (Graph DB technology) for bill of material explosion reports covering complex products.
- Data products: Data is integrated, harmonized, and transformed. Based on consumer requests, data products are created as views and shared with respective domains or markets. Sharing within domains and with markets is possible within the architecture, with appropriate approvals and access controls in place. Domain-specific data products are exposed on “functional data domain nodes”, and data products dedicated to markets analytics specificities, called market data products, are exposed on the “Geographical nodes” of the distributed data network.
- Interoperability: Data produced by domains or markets also needs to be combined and consumed for dashboards, such as PowerBI. Additionally, Key Performance Indicators (KPIs) generated by domains must be integrated with web analytics KPIs from prominent MarTech and Customer Intelligence platforms to create comprehensive, holistic reports. This integration and interoperability are achieved by implementing custom interface protocols that support data use across multiple cloud platforms.
- Data consumption: Data products and insights are consumed by multiple audience intelligence platforms within PMI, for advanced analytics including Generative AI. This promotes reusability and reduces data duplication and redundancy.
The provisioning of DDN nodes is automated via Terraform, an infrastructure-as-code framework. This ensures a scalable and reliable deployment of the DDN’s components across the global enterprise.
The sheer scale and complexity of the challenge are complicated by the platform’s organic evolution and legacy footprint. As a result, PMI has a diverse technology landscape that provides best-of-breed capabilities and services to our stakeholder base.
Semantic and Metadata Layer
An emerging component of the DDN architecture is the semantic layer, which bridges the gap between the isolated data storage and management at the node level, and the need for a cohesive, interconnected view of information for diverse personas ranging from technical roles to business stakeholders and compliance teams.
At the core of the semantic layer are two architectural building blocks:
- Metadata Catalog: DDN leverages Atlan, an AWS Marketplace Partner, to manage the business-level metadata across the enterprise. In addition to storing relevant business metadata, Atlan also crawls and ingests technical metadata from sources like Snowflake and Amazon S3. The catalog also allows to show technical lineage from data ingestion to presentation and business lineage from business terms to physical assets.
- Knowledge Graph: A knowledge graph is being developed to create interconnected relationships between various data elements across the enterprise. This graph connects unstructured data, data products, domain ontologies, and business processes, providing a comprehensive view of data relationships and dependencies. By leveraging Altair Graph Studio triple store1 and AWS Neptune, DDN offers graph-based data products that enable better data discovery and improved context for decision-making.
While the technical and physical metadata are automatically collected during DDN object creation, the Knowledge Graph is expected to grow organically based on specific business requirements and opportunities. This strategy aims to maximize business value extraction while minimizing technical debt.
Federated Governance
To manage access and permissions across the DDN, the DDN leverages Microsoft Entra ID as the identity and access management service. Extending the capabilities of Entra ID, PMI has implemented a bespoke set of internal services to manage the access control requirements across the DDN, facilitating the request and approval process for accessing data products. Each domain has a designated data owner who is responsible for approving access to the data, while the product owner oversees the approval of access to data products.
This federated governance model ensures that the appropriate controls and security measures are in place, while still empowering teams with the data they need to drive business value.
Insights, Recommendations and Looking Ahead
The “ideation-to-implementation” journey of DDN has been both demanding and rewarding. PMI has successfully expanded the data network to over 30 geographical nodes, replacing siloed data warehouses across different markets and countries. We have replaced redundant data pipelines from central solutions such as CRM, ERP, and CDP with a single flow and data sharing system. This reduces time and cost of processing on source systems, typically by days or weeks, depending on the use case. Now, instead of months of development for a new pipeline, activating data sharing requires only hours of configuration.
We have democratized data access, enabling new use cases that were previously unachievable. We are now enriching dashboards with financial data, market insights, and consumer engagement information to support new product launches.
Organizational culture and appetite for innovation have been key catalysts in making such a change possible.
Here are a few key insights based on our journey, and looking ahead, the next steps for us to harness the value of the DDN:
- Secure Executive Buy-In Early: Identify opportunities for value-add that align with the company’s strategic objectives, and build your narrative around these. These could be linked to top-line (profitability) or bottom-line (efficiencies) targets.
- Identify Critical Competencies: Recognize and acquire the specialized skills needed for success. For instance, considering the rapid expansion of Generative AI applications, evaluate whether your semantic layer requires specialists such as Ontologists or Architects with specific skill sets. Assess the need for experts like Computational Linguists for Natural Language Processing tasks, or NoSQL Architects for database management.
- Establish a Federated Governance Model: Agree early on a federated governance model for the design and operating principles. This is crucial because such a huge change requires considerable acceptance and patience from stakeholders accustomed to different ways of working. It has to be supported by a strong change management initiative, as it substantially impacts roles and responsibilities throughout the organization.
Looking ahead, PMI will focus on the following key challenges and opportunities:
- Driving Awareness and Literacy: We aim to increase understanding of this architectural paradigm across our organization, allowing us to accelerate and scale more quickly to meet growing demand. Our goal is to onboard all our markets onto the DDN by 2027.
- Adapting to Technological Shifts: We will remain aligned with and invested in the rapidly evolving landscape of technology innovations (particularly AI), regulations, and security protocols. This will enable us to adapt effectively while maintaining full compliance.
- Designing the ‘Information Fabric’: We plan to create a metadata-driven layer that stitches process metadata and lineage with domain metadata through associations, taxonomies, and ontologies. This will require close collaboration between Business, Semantic, and Data Architecture practices. The Information Fabric will serve as a gateway for contextualizing PMI data, enabling its use in Generative AI (Large Language Models) and other AI-related applications (such as Graph Neural Networks, Graph RAGs, and Content-centric Knowledge Graphs).
Conclusion
Data and Analytics is a key enabler of the PMI “smoke-free” vision that allows us to harness the power of diverse datasets that are well-organized, appropriately modeled and governed for both exploration and exploitation. This has been made possible, thanks to the trust and support of our colleagues and leaders across the organization. However, PMI still has major milestones on the horizon as they progress rapidly towards a “smoke-free” future. This will involve continued focus and investment in innovation, learning and applying our collective wisdom to the strategic objectives of the company.
The Distributed Data Network architectural paradigm has many advantages and benefits that could resonate with other companies that are grappling with challenges of a similar size and scale. Like any methodology, it does have its own challenges, primarily centered around organizational maturity, integration and interoperability. In a rapidly shifting data and analytics landscape, however, it offers automation and autonomy at scale within a federated governance setup.
1. A triple store is a type of database that stores data in the form of subject-predicate-object triples, useful for managing and querying graph data.