AWS Public Sector Blog

Democratizing climate data science: How Columbia University’s LEAP center built AutoClimDS on AWS

AWS branded background with text "Democratizing climate data science: How Columbia University's LEAP center built AutoClimDS on AWS"

Climate change poses one of the most pressing challenges of our time, yet accessing and analyzing climate data presents a significant barrier for researchers. The fragmented nature of data sources, heterogeneous formats, and steep technical requirements have long limited who can participate in climate research and slowed the pace of scientific discovery. Columbia University’s Learning the Earth with Artificial Intelligence and Physics (LEAP), a National Science Foundation (NSF) Science and Technology Center, collaborated with Amazon Web Services (AWS) to address these challenges. The result is AutoClimDS, an agentic AI system that researchers with no specialized coding expertise can use to conduct climate data science workflows using natural language.

In this post, you will learn how LEAP worked with the AWS Generative AI Innovation Center to build AutoClimDS, and what this collaboration can teach other research institutions seeking to democratize access to complex scientific data.

Lowering barriers to climate research with AI agents

Founded in 2021 with a $25 million grant from the NSF, LEAP represents a new paradigm in climate modeling—one that merges traditional physics-based approaches with machine learning. Led by Columbia University in collaboration with institutions including the NSF National Center for Atmospheric Research (NCAR) and the National Aeronautics and Space Administration (NASA) Goddard Institute for Space Studies (GISS), LEAP aims to improve near-term climate projections and train the next generation of climate data scientists.

“Climate data science faces persistent barriers stemming from the fragmented nature of data sources, heterogeneous formats, and the steep technical expertise required to identify, acquire, and process datasets,” said Ahmed Jaber, research scientist at LEAP. “These challenges limit participation, slow discovery, and reduce the reproducibility of scientific workflows.”

To address these barriers, LEAP developed AutoClimDS, an agentic AI system built around a curated knowledge graph that organizes climate datasets, tools, and workflows. Rather than requiring researchers to manually search disparate data portals, write complex data retrieval scripts, and harmonize inconsistent formats, they can use AutoClimDS to state their research objective in natural language.

“Given only a research objective expressed in natural language, the agent leverages the knowledge graph to autonomously identify relevant data sources, reconcile their metadata, and execute preprocessing steps before generating analytical outputs such as figures and graphs,” the researchers explained in their paper on LEAP, which was published on Amazon Science.

Collaborating with the AWS Generative AI Innovation Center

Building a system capable of autonomously conducting climate research workflows required expertise spanning climate science and cloud-focused AI systems. LEAP turned to the AWS Generative AI Innovation Center, an acceleration team that connects organizations with AWS AI and machine learning experts to help ideate on, build, and deploy generative AI solutions.

The AWS team helped LEAP design a multi-agent architecture where specialized AI agents work together, coordinated by a central orchestrator. This modular approach mirrors how human research teams collaborate: Different experts handle data discovery, acquisition, analysis, and verification.

“We architected a multi-agent system on Amazon Bedrock and Neptune that can automatically reproduce published climate-science analyses from natural-language instructions,” said Karthick Jayavelu, a member of the AWS Generative AI Innovation Center team. “By leveraging a modular, knowledge-graph-powered design, the system lets LEAP integrate new datasets, adapt to evolving foundation models, and scale toward real-time applications like hurricane forecasting—all while preserving the scientific transparency and reproducibility required for peer-reviewed research.”

Building a knowledge graph-powered agentic AI system

AutoClimDS is built on the principle that a knowledge graph is all you need for scalable, agentic workflows for scientific inquiry. At its core is a semantically structured knowledge graph that integrates heterogeneous climate data from sources such as NASA’s Common Metadata Repository and institutional catalogs.

The system employs a multi-agent architecture with specialized components:

  • Orchestrator agent: Interprets user objectives, maintains research state, and delegates tasks to specialized agents.
  • Data discovery agent: Queries the knowledge graph using semantic search to identify relevant datasets.
  • Data acquisition agent: Retrieves and preprocesses data from cloud-focused sources such as those available in the registry of Open Data on AWS.
  • Climate modeling and analytics agent: Integrates datasets with climate model ensembles to produce harmonized simulations.
  • Verification agent: Validates data quality, logical consistency, and adherence to physical constraints.

The AWS architecture supporting AutoClimDS includes:

  • Amazon Bedrock to access foundation models and power generative AI capabilities
  • Amazon Neptune for the knowledge graph database, supporting both symbolic querying and vector-based similarity search using Neptune Analytics
  • AWS Lambda for serverless data transformation and harmonization routines
  • Amazon S3 for scalable storage of climate datasets
  • Amazon Textract to extract and process data from various document formats

“To ensure scalability, adaptability, and continuous integration of new climate datasets, the AutoClimDS system follows a modular design,” the research noted. “Data are periodically ingested from external sources, transformed into standardized formats, and harmonized through automated routines. This allows the knowledge graph to be continuously updated as new data becomes available.”

A critical innovation in the system is a fine-tuned transformer classifier built on ClimateBERT that achieves 99.17% semantic accuracy in linking observational metadata to standardized Earth System Model variables. The system can bridge the gap between raw observational data and the structured format required for climate modeling.

Lessons learned from building a research-grade AI system

The AutoClimDS project demonstrates several key principles for building AI systems that serve the scientific community:

  • Start with the knowledge graph: Rather than jumping directly to AI agents, LEAP first invested in creating a well-structured knowledge graph that provides the semantic foundation for agent reasoning. This infrastructure-first approach prioritizes AI responses grounded in curated, high-quality data.
  • Design for reproducibility: AutoClimDS was designed to reproduce published climate research workflows, demonstrating a new paradigm of AI-driven scientific reproducibility. By validating the system against known results, the team could confirm reliability before deploying it for novel research.
  • Embrace modularity: The multi-agent architecture allows different components to be updated independently. As new data sources emerge or better AI models become available, individual agents can be upgraded without restructuring the entire system.
  • Build for the cloud: By using AWS cloud services, AutoClimDS provides elasticity and facilitates accessibility across domains. Researchers can access computational resources on demand without maintaining expensive local infrastructure.

“The open-source design of our system further supports community contributions, ensuring that the knowledge graph and associated tools can evolve as a shared commons,” said the research team.

Looking ahead: AI-human collaboration in climate research

AutoClimDS represents a proof of concept for a new model of scientific research where AI agents serve as collaborators that handle time-consuming data-wrangling tasks so researchers can focus on hypothesis generation and interpretation.

The system is designed to complement, not replace, human expertise. By drastically lowering the technical threshold for engaging in climate data science, non-specialist users can harness AutoClimDS to explore climate datasets, and experienced researchers can work more efficiently.

“Our results illustrate a pathway toward democratizing access to climate data and establishing a reproducible, extensible framework for human-AI collaboration in scientific research,” the research team wrote.

As climate change impacts intensify, the need for rapid, reproducible climate research is becoming more urgent. By making climate data science more accessible, systems such as AutoClimDS can accelerate the pace of discovery and a more diverse community of researchers can contribute to our understanding of Earth’s changing climate.

To learn how AWS helps research institutions build and launch transformative cloud solutions to advance scientific discovery, visit AWS for Research or contact us today to get started.


The AutoClimDS project is supported by the NSF Science and Technology Center Learning the Earth with Artificial Intelligence and Physics (LEAP) (Award #2019625).

Tian Zheng, PhD

Tian Zheng, PhD

Tian is a professor of statistics at Columbia University and deputy director of LEAP. Her work applies statistical thinking to the rigorous and responsible development of AI systems. She envisioned and led the LEAP–AWS collaboration on an Agentic AI system for climate data science.

Ahmed Jaber

Ahmed Jaber

Ahmed is an undergraduate researcher in computer science at Columbia University with a focus on natural language processing, machine learning, and agentic AI systems. He played a leading role in the LEAP–AWS collaboration on an Agentic AI system for climate data science, contributing to the design and implementation of large-scale reasoning pipelines that integrate multimodal climate datasets. His work applies principled machine learning and statistical reasoning to the development of AI systems that are robust, interpretable, and suitable for high-stakes scientific domains.

Justin Downes

Justin Downes

Justin is a senior applied scientist at AWS developing generative AI solutions for public sector customers and managing a team in the Generative AI Innovation Center. He is a Ph.D. candidate in computational social science at George Mason University, researching AI-automated social simulation creation at the intersection of computer vision, NLP, and social science theory.

Karthick Jayavelu

Karthick Jayavelu

Karthick is a senior deep learning architect at AWS. He works with AWS customers offering Data and AI technical support and designing customer solutions on AI and ML projects, as well as helping them on their journey to modernize their workloads to AWS.

Sameer Mohamed

Sameer Mohamed

Sameer is a deep learning architect at the AWS Generative AI Innovation Center, where he helps public sector customers build state-of-the-art generative AI solutions. Leveraging his background in machine learning and computer science, Sameer develops innovative generative AI systems that address a wide range of customer challenges.

Taylor McNally

Taylor McNally

Taylor is a senior generative AI strategist in the AWS Generative AI Innovation Center. He helps customers prioritize, build, and deploy effective AI solutions on AWS. He enjoys a good cup of coffee, the outdoors, and time with his family and energetic dog.