AWS Big Data Blog

Deploy modern data platforms in minutes with MDAA

Modern Data Architecture Accelerator (MDAA) is an open source framework that replaces infrastructure code with concise YAML configuration, so your team can deploy a governed, production-ready data architecture, reducing deployment time from months to weeks (depending on complexity and team experience).

Organizations building modern data architecture on AWS face a critical challenge: deploying production-ready, governed infrastructure traditionally requires 6–12 months of custom development, thousands of lines of infrastructure code, and continuous remediation cycles to maintain security and compliance. Governance is often added incrementally, treated as an afterthought that creates compliance gaps and engineering rework.

MDAA addresses this by replacing infrastructure code with concise YAML configuration, achieving up to 97.6 percent code reduction (from approximately 1,800 lines of AWS CloudFormation to 45 lines of MDAA YAML) while embedding governance from the start. The complete Governed Lakehouse Starter Kit deploys 491 AWS resources across 12 stacks from approximately 450 lines of YAML configuration, representing a 66x verbosity ratio where each line automatically expands into production-ready infrastructure.

In this post, we explore how MDAA transforms data architecture development from months of manual coding to production-ready deployment through configuration-driven infrastructure and embedded governance, examine a real customer transformation, and provide a clear implementation pathway for your own data modernization journey.

Customer use case and challenge

A university system office needed to modernize its analytics architecture across 17 campuses while managing sensitive educational data. Their third-party dependency created bottlenecks that slowed feature implementation from weeks to months, and their IT team lacked the cloud skillsets to build modern infrastructure independently.

With MDAA, they achieved:

  • 95 percent reduction in time-to-value for dashboard and feature implementation (from weeks to hours).
  • 17 campuses integrated into a unified, secure architecture.
  • 7.2TB of data and over 8,000 dashboards migrated successfully.
  • Significant cost savings by removing third-party dependencies and reducing license costs.
  • Enhanced security posture for external stakeholders accessing sensitive educational data.

The team used MDAA to implement a modernization strategy with continuous integration and continuous delivery (CI/CD) for automated deployment. The architecture now supports rapid response to stakeholder requests while maintaining strict data governance through AWS Lake Formation.

Their transformation demonstrates what becomes possible when governance is embedded from launch rather than added incrementally, moving from months-long manual development to weeks of production-ready deployment through configuration-driven infrastructure.

Solution: MDAA and its value propositions

MDAA’s capabilities stem from its modular, composable architecture. The accelerator provides over 40 pre-built modules that encapsulate AWS best practices for security, governance, and operational excellence. Organizations describe the outcomes they want in MDAA-specific YAML configuration files (not CloudFormation or Terraform YAML) and the accelerator automatically translates these configurations into AWS Cloud Development Kit (AWS CDK) constructs, which then deploy via CloudFormation with embedded governance.

Configuration over code. The MDAA framework takes a fundamentally different approach: describe the outcomes you want in YAML, and the accelerator deploys production-ready infrastructure with embedded governance. Consider deploying a governed data lake where fraud detection teams need write access to transaction data, while marketing analytics teams require read-only access to customer behavior data. Traditional approaches require over 1,800 lines of CloudFormation across Amazon Simple Storage Service (Amazon S3) buckets, AWS Key Management Service (AWS KMS) keys, AWS Identity and Access Management (IAM) policies, and Lake Formation permissions. With MDAA, the same governed data lake is expressed in 45 lines of configuration, a 97.6 percent reduction, while helping you apply encryption, least-privilege access, and cross-account governance as built-in defaults.

The configuration deploys multi-zone S3 storage with KMS encryption, Lake Formation permissions with tag-based access control (TBAC) enabled, Amazon SageMaker Unified Studio for data product discovery, and encrypted AWS Glue Data Catalog with automated crawlers. All permissions flow through Lake Formation rather than individual IAM policies.

Embedded governance from day one. Governance is declared in YAML and deployed alongside infrastructure from the first run. Fine-grained access controls, encrypted data catalogs, data quality validation, audit trails, and sensitive data classification are all part of the same configuration. MDAA’s Governed Lakehouse starter kit defines an entire governed data architecture in roughly 450 lines of YAML, which produces approximately 29,700 lines of CloudFormation across 12 stacks (a 98.5 percent reduction in infrastructure code).

Modular, composable architecture. Each module is purpose-built to handle a specific capability within the data architecture. Modules communicate through AWS Systems Manager Parameter Store, passing resource identifiers (Amazon Resource Names (ARNs), IDs, and names) between stacks. This approach removes hardcoded dependencies. A KMS key created in one module can be referenced by another through parameter resolution, with all dependencies resolved automatically at deployment time.

The diagram illustrates the deployed architecture and team-level access flow that MDAA generates from the 45-line configuration.

Progressive architecture patterns. MDAA provides four reference architecture patterns that align to progressive stages of data infrastructure maturity:

  • Basic Data Lake deploys a governed data lake with built-in security controls, data quality checks, centralized metadata management using AWS Lake Formation and AWS Glue.
  • Data Science Platform extends the data lake with Amazon SageMaker notebooks, feature stores, and machine learning (ML) pipelines so data science teams can experiment and train models on governed data.
  • SageMaker Unified Studio adds a single interface for analytics and ML collaboration, connecting data engineers, analysts, and data scientists in one workspace.
  • Generative AI Platform layers Amazon Bedrock and Retrieval Augmented Generation (RAG) capabilities on top of your existing data foundation, so teams can build generative AI applications grounded in enterprise data.

Each pattern builds the one before it. You can start with the Basic Data Lake and adopt additional patterns as your team’s needs grow. MDAA’s modular design means you add capabilities without rearchitecting what you already deployed.

The infrastructure is versioned through GitHub, repeatable across environments, and auditable through comprehensive AWS CloudTrail logging. Data engineers focus on data pipelines and business logic while MDAA manages infrastructure complexity and governance integration. This represents the fundamental shift: from writing infrastructure code to describing the outcomes you want through configuration, with governance embedded from the start.

Use case of MDAA: Governed data architecture

DataOps teams spend significant time on governance tasks, including permissions management, compliance validation, and access control, rather than building pipelines and analytics. These aren’t data problems, they’re governance problems that consume engineering capacity meant for higher-value work. MDAA addresses this at the architectural level. Governance is declared in YAML and deployed alongside infrastructure from the first run.

The following sections walk through how each governance module works in practice.

Publish, discover, subscribe, and consume data products between business units: SageMaker Unified Studio

Amazon SageMaker Unified Studio provides a governed data catalog where data producers publish data products, and consumers discover and subscribe to them. Your deployment with MDAA includes a pre-configured domain, blueprints (managed and custom), projects, and environment profiles, all defined in a single configuration file:

# sagemaker.yaml --- 16 lines that deploy 114 CloudFormation resources
domains:
  domain1:
    dataAdminRole:
      id: ssm:/{{org}}/govern1/generated-role/data-admin/id
    description: SMUS Domain 1
    userAssignment: MANUAL

    tooling:
      vpcId: '{{context:vpc_id}}'
      subnetIds:
        - '{{context:private_subnet_id1}}'
        - '{{context:private_subnet_id2}}'

    groups:
      team1:
        ssoId: '{{context:team1-group-sso-id}}'
      team2:
        ssoId: '{{context:team2-group-sso-id}}'

Behind this configuration, MDAA deploys an Amazon SageMaker Unified Studio domain with dedicated KMS keys, execution and provisioning roles, and single sign-on group profiles for team access. Data producers tag and publish assets with metadata, ownership, and classification. Consumers browse a searchable catalog, see only authorized assets, and request access through a governed workflow. Cross-account and cross-business-unit data sharing flows through a subscription model, ensuring every access grant is tracked, auditable, and revocable.

Use case of MDAA: Restricting access to cardholder data using Lake Formation

AWS Lake Formation provides fine-grained access control at database and table levels, removing manual IAM policy management. MDAA deploys AWS Lake Formation with pre-configured settings that disable IAMAllowedPrincipals, the critical governance setting that ensures all permissions flow through centralized governance:

# lakeformation-settings.yaml --- 6 lines that deploy 25 CloudFormation resources
lakeFormationAdminRoles:
  - id: generated-role-id:data-admin
createCdkLFAdmin: true
createDataZoneAdminRole: true
iamAllowedPrincipalsDefault: false

That last flag is the single most important governance setting in the platform. Without it, an IAM principal with glue:GetTable can read tables in the catalog, bypassing the entire access control model. Most manual setups miss this or defer it.

With the data lake configuration, you declare roles and access policies in YAML where admins get full control, engineers get read access to curated data, extract, transform, and load (ETL) roles get scoped write access, and MDAA compiles them into the correct S3 bucket policies and Lake Formation registrations.

Use case of MDAA: Ensuring data integrity with AWS Glue Data Quality

AWS Glue Data Quality runs automated validation rulesets continuously as part of the pipeline, not as periodic batch checks. MDAA’s data quality module supports over 15 built-in rule types, from completeness and uniqueness checks to statistical thresholds and data freshness validation:

# data-quality.yaml
projectName: example-project

rulesets:
  customer-data-quality:
    description: Validate customer data completeness and uniqueness
    targetTable:
      databaseName: project:databaseName/customer-data
      tableName: customers
    ruleset:
      - ruleType: IsComplete
        column: customer_id
      - ruleType: Uniqueness
        column: email
        comparisonOperator: ">"
        threshold: 0.95
      - ruleType: RowCount
        comparisonOperator: ">"
        value: 100

Quality metrics flow into Amazon CloudWatch for real-time alerting. If anomalies are detected, automated workflows quarantine affected records and alert data engineering teams before issues reach downstream consumers.

Protecting metadata at rest: AWS Glue Data Catalog encryption

Table schemas, column names, and partition structures can reveal sensitive information about an organization’s data architecture, even without access to the underlying data. AWS Glue Catalog Encryption secures metadata at rest using AWS KMS-managed keys. MDAA configures catalog encryption by default, so schema definitions and connection passwords are encrypted from initial deployment without requiring manual key management setup. Access to catalog metadata follows the same Lake Formation governance controls applied to the data itself, so teams see only the schemas that they’re authorized to query.

Auditing every data access event: CloudTrail integration

Every data access event must be logged and attributable to a specific identity. Without a complete audit trail, demonstrating compliance during a regulatory review becomes a manual, error-prone process. AWS CloudTrail captures API-level activity across the data infrastructure, recording who accesses what data, when, and from which service. MDAA configures CloudTrail integration by default, so audit logging is active from initial deployment rather than added retroactively. Log data flows into a centralized, tamper-resistant store, giving compliance teams a single location to query access history across all business units and accounts.

Identifying sensitive data automatically: Macie integration

In large environments, sensitive information spreads across dozens of S3 buckets through pipelines, transforms, and ad hoc data drops, and self-reporting data owners consistently produce gaps. Amazon Macie uses machine learning to automatically discover and classify sensitive data in S3, surfacing findings at the object level without manual tagging. MDAA configures Macie across your S3 buckets during deployment, routing findings to Amazon EventBridge where automated workflows can alert owners or trigger remediation.

Together, these controls form a layered defense: Lake Formation governs access to cataloged data, Glue Data Quality validates integrity on arrival, and Macie identifies sensitive data that lands outside governed pipelines to reduce compliance risk.

Multi-account data mesh

MDAA provides extensive support for multi-account data mesh setups, with decentralized data ownership across business units and centralized governance. The data mesh starter kit supports cross-account data product publishing and consumption, allowing organizations to scale data sharing while maintaining consistent security and compliance controls.

Technical implementation

Ready to deploy your modern data architecture? Here are the resources to get started:

MDAA Implementation Guide provides detailed instructions for deploying all starter packages, including architecture patterns, configuration examples, security best practices, and troubleshooting guidance.

MDAA Hands-on Workshop offers step-by-step guided implementation with AWS experts. The workshop covers configuration management best practices, implementation patterns, hands-on labs with real-world scenarios, and cleanup instructions.

GitHub Repository and Documentation provide source code, module reference, and comprehensive documentation.

Organizations approach MDAA from different starting points. Some modernize existing data architectures, migrating from on-premises infrastructure or legacy cloud architectures. Others build new architectures for artificial intelligence and machine learning (AI/ML) initiatives or generative AI applications. Financial services organizations require PCI-DSS compliance from day one. Healthcare organizations need controls that can help support HIPAA. Each journey benefits from MDAA’s configuration-driven approach and embedded governance.

Conclusion

MDAA transforms data architecture development from months of manual coding to production-ready deployment. Configuration-driven infrastructure reduces development time by 40–60 percent while embedding governance from the start. The university system’s 95 percent reduction in time-to-value demonstrates the outcome: organizations deploy secure, compliant, governed data architectures in weeks rather than months.

Financial services organizations can deploy architectures to help them align with PCI-DSS compliance requirements using Lake Formation access controls, Glue Data Quality validation, SageMaker Unified Studio data discovery, comprehensive CloudTrail audit trails, and automated Macie data classification, all inherited from configuration rather than built manually.

Data architecture journeys need not follow six-month timelines with governance added incrementally. MDAA provides an alternative: describe the outcomes you want through YAML configuration, inherit pre-validated security controls, and deploy production-ready infrastructure with comprehensive governance from initial deployment.

Security and compliance is a shared responsibility between AWS and the customer. For more information, see the AWS Shared Responsibility Model.

Need help or have questions? Contact AWS ProServe for personalized guidance on selecting the right package and deployment strategy for your organization.


About the author

Sudeshna Dash

Sudeshna Dash

Sudeshna is a Data Scientist at AWS Professional Services based in Berlin, Germany. She specializes in data architecture, generative AI, and agentic AI systems on AWS. Sudeshna is a contributor to the Modern Data Architecture Accelerator (MDAA) open-source project and helps customers design and deploy governed, production-ready data and AI/ML architectures on AWS.

John Reynolds

John Reynolds is a Principal Engineer with AWS Professional Services based in Seattle, Washington. He leads the architecture and development of Modern Data Architecture Accelerator (MDAA), focusing on turning proven delivery patterns into reusable, production-ready foundations that customers can adopt and extend at scale.