AWS Public Sector Blog

Building national foundation models

AWS branded background with text "Building national foundation models"

National governments and organizations worldwide are racing to build national AI capabilities, but many struggle with the infrastructure complexity and costs involved. The reality is that many successful national foundation model programs are built on Amazon Web Services (AWS) Cloud infrastructure, not on-premises data centers.

This post examines real-world examples of successful national AI programs, explores the hidden economics that make AWS Cloud infrastructure essential, and discusses why cultural sovereignty matters as much as technical capability. Whether you’re planning a national AI strategy or evaluating infrastructure options, these lessons from global leaders will help you avoid common pitfalls and focus resources where they matter most.

Real-world, large-scale examples

If you want to build from scratch and not merely fine-tune off-the-shelf models, you need scale, reliability, and the right cloud-managed services. Consider these leaders:

The challenge of owning GPUs on premises

Building large models on premises isn’t only about buying enough servers. GPU hardware has a much shorter lifecycle than standard enterprise CPUs, rapidly evolving as AI performance demands climb. GPU-based servers might need to be refreshed every 2–3 years (or sooner to keep up with new model architectures), whereas typical CPU-based infrastructure often runs much longer before depreciation or repurposing.

This means organizations running on premises must commit to frequent, expensive hardware refresh cycles and plan for accelerated depreciation to stay competitive. And each new generation of GPU launches with global demand outstripping supply. Organizations, especially in smaller markets, often find themselves facing long lead times, price premiums, and uncertainty around future upgrades.

What do model builders actually need?

When organizations set out to build LLMs from the ground up, the overwhelming demand from data scientists and machine learning (ML) engineers is to spend time mastering data and models rather than wrestling with hardware, networking, security frameworks, or resource management. The real value is unlocked by tuning architectures, experimenting with new training recipes, and iterating rapidly—not troubleshooting servers or chasing compliance certifications.

Meeting this demand requires tackling several complex infrastructure and assurance needs:

  • High-throughput, scalable storage means that massive datasets and rapid checkpoints don’t become bottlenecks. In the cloud, this is delivered by Amazon FSx for Lustre, which provided fast, parallel file system access at scale.
  • Fast, reliable networking enables data to move between thousands of GPUs in sync. Elastic Fabric Adapter (EFA) provides the low-latency, high-bandwidth connections that distributed ML workloads require.
  • Seamless management of massive GPU clusters is needed as experiments get bigger and more complex. SageMaker HyperPod lets teams automate and manage distributed training, freeing up specialists for actual model work.
  • Predictable, cost-effective GPU access is achieved through Capacity Blocks for ML, which means teams can reserve scalable GPU capacity exactly when needed.
  • Durable, accessible storage for all model outputs, checkpoints, and data is handled by Amazon Simple Storage Service (Amazon S3).
  • It’s increasingly common for organizations to design AI systems with portability in mind, using open standards and tools that simplify data migration between environments as needed. Technology and market offerings shift rapidly and teams frequently reassess their infrastructure choices to take advantage of innovation, competition, and changing compliance standards.
  • Confidence in security and compliance is essential for regulated and government workloads through programs like Federal Risk and Authorization Management Program (FedRAMP) and Information Security Registered Assessors Program (IRAP). AWS infrastructure provides various compliance certifications and frameworks that organizations can use to meet their national privacy and security requirements.

In every case, these supporting AWS services and compliance frameworks let data scientists and national AI builders focus on what they do best. They can harness local data and modeling expertise while the complexity of infrastructure, security, and regulation stays invisible in the background.

Cultural importance of national LLMs

Although the technical infrastructure gets most attention, the cultural and strategic dimensions of national foundation models (FMs) are equally critical. Language models aren’t merely computational tools—they’re repositories of cultural knowledge, linguistic nuance, and societal values.

National FMs serve purposes that extend far beyond technical capabilities. Models trained on local languages, dialects, and cultural contexts help preserve linguistic diversity that might otherwise be marginalized by dominant global models. They understand local idioms, cultural references, and historical context that global models can miss or misinterpret. Additionally, locally trained models reflect national educational curricula and cultural values while understanding local legal systems, business practices, and regulatory frameworks.

Sustainable infrastructure at scale

Training large-scale generative AI models is energy intensive. Running thousands of GPUs for weeks at a time quickly consumes large amounts of electricity. When organizations try to handle these workloads on premises, the result is typically less efficient use of power, higher cooling requirements, and a much larger carbon footprint, especially when energy is drawn from traditional grids or older facilities.

In contrast, running generative AI workloads in the cloud takes advantage of the global efficiency and renewable investments of cloud providers. AWS continues to invest heavily in expanding data center infrastructure globally while operating it as sustainably as possible, including significant investments in renewable energy projects.

Even the biggest model builders choose cloud

The scale and reliability required for frontier AI development has led the world’s leading AI companies to build on AWS infrastructure—demonstrating that the cloud is the default choice for serious model builders.

In October 2025, AWS completed Project Rainier, one of the world’s largest AI compute clusters featuring nearly 500,000 Trainium2 chips across multiple US data centers, deployed in under a year. Amazon invested $8 billion in Anthropic, which is actively using Project Rainier to train and deploy Claude, with plans to scale to over 1 million chips by the end of 2025.

In November 2025, OpenAI signed a $38 billion, 7-year deal with AWS to access hundreds of thousands of Nvidia GPUs, marking the ChatGPT maker’s first major partnership with AWS. The deal represents OpenAI’s commitment to scaling its infrastructure through proven cloud providers rather than attempting to build and manage data centers independently.

These partnerships underscore the critical reality that even organizations with virtually unlimited capital and the world’s top AI talent choose to build on cloud infrastructure rather than managing their own data centers. The complexity, speed, and scale required for frontier AI development makes the cloud the only practical path forward—for both national AI builders and Silicon Valley AI labs.

Inference costs far exceed the costs of training

Much attention focuses on the upfront costs of training FMs, but the real economic challenge lies in inference—serving the model to users at scale. Training an LLM might cost millions of dollars over several months, but inference costs can balloon as usage grows.

Consider the fact that training GPT-3 cost approximately $4.6 million, but OpenAI’s inference costs for ChatGPT exceed that figure on a weekly basis at current usage levels. This reality fundamentally changes the infrastructure equation for national AI programs.

Unlike training, which has a defined endpoint, inference requires continuous GPU capacity that scales with user adoption. This sustained GPU demand creates ongoing infrastructure requirements that grow proportionally with the number of users and queries. Furthermore, low-latency inference often requires compute resources distributed across multiple regions to enable responsive performance for geographically dispersed users.

The economics of inference also differ significantly from training. Services such as SageMaker Inference and Amazon Bedrock provide optimized, managed inference that reduces costs by 5090% compared to self-managed deployments, making cloud infrastructure increasingly attractive for cost-conscious national programs.

This inference reality makes cloud infrastructure even more compelling for national AI programs. The alternative of maintaining enough on-premises GPU capacity to handle peak inference loads would require massive capital investments that sit idle during low-usage periods.

Start building your national AI capability

Successful national FM programs use cloud infrastructure to focus resources on what matters—data, models, and cultural relevance. The technical infrastructure, compliance frameworks, and cost optimization tools are available today.

Ready to explore how AWS can support your national AI program? Learn about AWS for government or contact our public sector team to discuss your specific requirements. For technical deep dives, explore our blog post on building LLMs for the public sector. To learn more about generative AI in Southeast Asia, read AI Singapore brings inclusive generative AI models to Southeast Asia with AWS.

The tools and infrastructure are available now for you to focus on your models and outcomes and let the cloud handle the complexity underneath. Your nation’s AI future depends on making the right infrastructure choices today.

Craig Lawton

Craig Lawton

Craig is a principal AI/ML solutions architect at AWS, specializing in AI/ML infrastructure and public sector cloud adoption. He works with government organizations and research institutions to design and implement large-scale AI systems. Craig has over 15 years of experience in cloud computing and distributed systems architecture.