Implementing a Snowflake-Centric Data Mesh on AWS for Scalability and Autonomy
By Vinayak Datar, Tony Trinh, Deepmala Agarwal, Ed Chen, and Bosco Albuquerque – AWS
By Andries Engelbrecht – Snowflake
A data mesh architecture is a relatively new approach to managing data in large organizations, aimed at improving scalability, agility, and autonomy of data teams.
Many organizations are trying to form data teams based on domains or business units. They have various personas like business users, data scientists, business intelligence (BI) engineers, and data engineers who want to make data-driven decisions. However, each persona has different needs to produce or consume data.
There’s a need for an architecture that removes complexity and friction of provisioning and managing the lifecycle of data. This post outlines an approach to implement a data mesh with Snowflake as the data platform and with many Amazon Web Services (AWS) services like to support all pillars of the data mesh architecture.
We assume the reader has a good understanding of Snowflake, AWS ecosystem, and data mesh architecture concepts.
Snowflake is an AWS Data and Analytics Competency Partner and AWS Marketplace Seller that delivers on-demand elasticity, scalability, and flexibility that makes it possible to bring together all data, workloads, and users in one system.
Let’s look at how to implement the four main pillars: 1) domain 0wnership; 2) federated governance; 3) data as a product; 4) self-service infrastructure for data mesh using Snowflake and AWS.
Figure 1 – Proposed Snowflake-centric data mesh architecture.
The domain-driven ownership pillar of a data mesh defines how a monolithic data lake can be broken up by domains. It designates the ownership of the technical architecture as well as the organizational ownership such as lines of business, or functional areas like sales, marketing, and finance.
In an organization, a domain is a group of people having specific and relevant knowledge, and who are organized towards achieving specific business outcomes. In a data mesh approach, a domain owns and is responsible for the data products it produces and maintains. These data products are consumed by one or more consumers belonging to other domains.
In the context of using Snowflake to architect a data mesh, each domain may have one or more Snowflake accounts in the same or different cloud regions. Each Snowflake account can own multiple databases for which compute and storage resources can be deployed, and scaled separately.
Teams belonging to different domains can work independently, using independent compute power called virtual warehouses, or in separate databases or accounts while continuing to use the Snowflake platform. Snowflake provides the traditional relational database customers may already be familiar with, as well as other features related to data engineering, data lakes, data warehousing, data sharing, and artificial intelligence (AI).
In this Snowflake-centric data mesh, we’ll look at Snowflake features that facilitate domain-driven ownership. Data producers in Snowflake can share data, data services, and/or applications with data consumers in other Snowflake accounts by publishing metadata “listings.”
Data producers can share privately with other accounts, groups of accounts, or share publicly via Snowflake marketplace. They can also specify service-level agreements (SLAs) or service-level objectives (SLOs) for the data shared.
Teams in other domains can search to discover data assets of interest which have been made available to them, and obtain access or request access. These data consumers can gain live access to the shared data, which continues to remain under the control of the producer who can customize access policies, or revoke access at any time. This access to shared data is zero copy, and no extract, transform, load (ETL) or data movement is required.
Data producers can publish and share external tables, which are views over files stored in Amazon Simple Storage Service (Amazon S3). Optionally, these can include Delta Lake and Iceberg formats. A data producer can share data externally via a so-called “Snowflake reader” account and the supported APIs. Partitioned data can also be exported to an Amazon S3 bucket using any popular file format.
The concept of domain ownership in the context of a Snowflake-centric data mesh extends from the Snowflake account to an AWS account or AWS Organization. AWS-native services can be used to ingest data, transform, and enrich this data on its way to publishing in Snowflake as a data product.
Data as a Product
Data as a product is about the usability of data by consumers, whether they are data scientists, data analysts, or traditional database administrators (DBAs) across business unit and business line.
Snowflake has many entry points upon which a data owner can make the prepared data accessible to the consumers. For example, Snowpark libraries enable consumers to query and process data at scale in Snowflake using Java, Python, and Scala. Additionally, consumers can also use Snowflake Connectors to work with data to/from Kafka or Spark.
For scripting use cases, consumers can leverage Snowflake scripting to write stored procedures and procedural code. Finally, Snowflake also supports a SQL REST API to perform queries. Snowflake is positioned as a data platform to enable a wide variety of consumers and consumer access methods.
Now, let’s look at steps to implement data as a product with Snowflake and AWS:
- Raw layer: Consider we have raw data residing in Amazon Simple Storage Service (Amazon S3) with data getting ingested from several sources. Let’s call this raw layer.
- Processed layer: We’ll clean and validate the data from the raw layer and bring it into Snowflake, which we’ll call the processed layer. We can load data from S3 to Snowflake using Snowpipe. The data here is transactional in nature. Please refer to these implementation steps in this AWS blog post.
- Semantic/aggregated layer: Next, we define an aggregated layer which has aggregated data and calculated business metrics. Each business domain owns their construction of these layer which can be defined using SQL, Python, Java, or Scala in Snowflake. For example, a marketing team (account 1) can own one database with its own set of marketing analytics metrics. Meanwhile, a finance team (account 2) owns another database.
- As shown in the architecture diagram in Figure 1, if the marketing team needs to perform their analytics in Amazon QuickSight they can access the database and tables (per defined permissions) from Amazon QuickSight as detailed in these steps.
- Similarly, other teams may work on machine learning and need to access Snowflake data from Amazon SageMaker. They can access the database and tables (per defined permissions) by Amazon SageMaker Data Wrangler as detailed in these steps.
Self-Service Data Infrastructure
A self-service data infrastructure enables data consumers to directly access, explore, and analyze distributed domain-specific data products on their own by providing a high-level abstraction of infrastructure. This can help reduce the complexity of provisioning and lifecycle management of data products.
Ease of Provisioning, Scaling, Isolation and Cost Attribution
Snowflake is a software-as-a-service (SaaS) product so there is no explicit deployment needed. Snowflake also introduced the concept of a virtual warehouse, which is a cluster of compute resources that are provisioned on demand to perform data processing operations using SQL and Spark. There are several benefits provided by virtual warehouses.
- Virtual warehouses come in what is called “T-shirt” sizes starting with x-small up to 4x-large sizes. It provides vertical scalability by configuring T-shirt sizing to increase or decrease capacity. It means data teams are at no risk of over- or under-sizing, as sizing can be changed instantly.
- Automatic and instant start-and-stop of a virtual warehouse eliminates the need to set up or tear down any compute infrastructure. With this, data teams will be charged only when they use the warehouse.
- “Multi-cluster” data warehouses support the manual or automatic scaling in and out of clusters to provide horizontal scalability based on load.
- Each virtual warehouse scales independently of each other. The storage and compute separation ensures maximum flexibility of compute allocation for each virtual warehouse.
- We recommend creating different virtual warehouses of each persona to provide autonomy and flexibility to use compute infrastructure based on need. This also helps attribute cost to individual persona.
As a result, data producers and consumers no longer depend on IT, DevOps, or DBAs to procure or manage their infrastructure.
Simplification of Data Development and Debugging
Zero copy cloning allows users to make an instant snapshot of data from one table to another without incurring data storage overhead. It means any data can be instantly replicated from one environment to another without affecting the original workload. Similarly, a snapshot can be taken instantly after important events without downtime or performance impact.
On top of that, the Snowflake Time Travel feature allows users to go back in time to access the older copy of data to investigate before and after change or rollback. These features simplify the development and debugging lifecycle.
AWS Components of Self-Service Data Infrastructure
AWS provides a rich set of serverless or managed analytics services to complete the data landscape that can be provisioned on demand. AWS services like Amazon AppFlow, AWS Glue, and AWS Lambda can be used to build services for data integration, transformation, and job orchestration.
Amazon QuickSight can be used for self-service data analytics and interactive dashboards, and Amazon SageMaker can be used for artificial intelligence (AI) and machine learning (ML). These serverless or managed analytics services can be leveraged to implement a feature-rich self-service data infrastructure.
One of the standout features of Snowflake is federated governance, which centralizes data management and enables organizations to maintain consistency across multiple data sources on a unified platform.
Users can create a role in AWS Identity and Access Management (IAM) and assign a Snowflake account as a trusted entity (detailed here), which facilitates secure and consistent user authentication and authorization across both Snowflake and AWS services.
The secure data sharing feature allows organizations to collaboratively share insights without duplicating or moving data. This is further enhanced by Amazon S3 integration, where data can be ingested and stored in large volumes, and AWS PrivateLink which ensures secure data transfer between Snowflake and AWS services over a private network connection.
Snowflake’s federated governance provides granular role-based access control (RBAC) that defines varying access levels based on user roles, reducing the risk of unauthorized access. This is supported by IAM which controls access to AWS resources, ensuring a consistent and secure access control mechanism.
It’s worth noting that this is bidirectional; when dataflow is from AWS to Snowflake, authorization is often controlled using IAM. Vice versa, when the data flow is from Snowflake to AWS, authorization is handled using Snowflake RBAC
Data protection is another key component, with end-to-end data encryption safeguarding data at rest and in transit. This robust security, combined with the capability to establish private connectivity via AWS PrivateLink, helps meet stringent security requirements.
Snowflake also implements data masking through a combination of dynamic column-level security policies. Dynamic data masking allows sensitive data to be automatically masked based on predefined rules, while column-level security policies provide granular control over who can access and view specific data columns.
Comprehensive auditing and monitoring of data access and usage is made possible, providing visibility into who is accessing data, when, and for what purpose. This can be further improved by integrating AWS Lambda, which can execute code in response to changes and helps maintain data quality and consistency within Snowflake.
Finally, Snowflake’s federated governance is built with compliance and certification in mind, including GDPR, HIPAA, SOC 1, SOC 2, and PCI DSS. With the integration of AWS Glue for ETL processes and Amazon API Gateway for custom API creation, organizations can maintain regulatory compliance and enhance data accessibility and management.
Implementing a data mesh architecture with Snowflake as the data platform and leveraging various AWS services offers organizations scalability, agility, and autonomy for their data teams.
The pillars of domain-driven ownership, data as a product, self-service data infrastructure, and federated governance enable seamless collaboration, efficient data provisioning, simplified development, and robust security. By combining the strengths of Snowflake and AWS, organizations can accelerate data-driven decision-making and unlock the full potential of their data assets.
You can also learn more about Snowflake in AWS Marketplace.
Snowflake – AWS Partner Spotlight
Snowflake is an AWS Competency Partner that has reinvented the data warehouse, building a new enterprise-class SQL data warehouse designed from the ground up for the cloud and today’s data.