AWS Public Sector Blog
Weill Cornell Medicine digitizes historical medical archives with generative AI on AWS
In a university archive in New York City, handwritten medical records dating back to the Revolutionary War and belonging to the New York Presbyterian Hospital are slowly decaying. These rare patient files from one of America’s oldest hospitals hold centuries of insight into disease progression, medical practice, and the lived experiences of under-researched patients, including women and immigrants. Yet much of this knowledge remains trapped in fragile, fading documents.
Weill Cornell Medicine (WCM) knew these records were too valuable to lose. They turned to generative artificial intelligence (AI) on Amazon Web Services (AWS) to help digitize this archive.
In this blog post, you will learn how WCM used generative AI to build a secure, searchable interface for its historical medical archives and unlock new research opportunities while preserving the integrity of these one-of-a-kind materials.
Preserving medical history with modern technology
WCM is home to the Samuel J. Wood Library, which houses one of the largest collections of historical medical records in the United States. The archives include handwritten case files from the original New York Hospital, America’s second-oldest hospital, spanning the Revolutionary and Civil War eras. These records offer valuable insight into the evolution of American healthcare, from treatments and terminology to the social dynamics reflected in how care was delivered.
WCM wanted a way to digitize these records before they were lost to time and the elements. “The archives contain medical records from the Revolutionary War era and Civil War era at different levels of preservation, and they are falling apart,” explained Dr. Curtis Cole, internist, informaticist, and chief global information officer (CIO) for Cornell University. “We lose them to mold and heat, so we had a goal to digitize them to try to preserve the records.”
WCM had first attempted digitization more than a decade ago. Early efforts using optical character recognition (OCR) tools produced promising results but left significant gaps. The handwriting was difficult for the technology to parse, certain medical symbols were too outdated to pick up, and many terms were no longer in use.
The breakthrough came when project advisor Frank Naeymi-Rad tested a modern generative AI model. “We realized that AI was able to not only understand and translate the scanned images. It could also provide accurate accounts of the content,” Naeymi-Rad said. That experiment prompted the team to revisit the archive digitization project with a generative AI approach.
Working with AWS for advanced generative AI capabilities
WCM has a long-standing relationship with AWS and regularly uses AWS services to support research, academic computing, and innovation. In May 2024, that relationship expanded when WCM was introduced to the AWS Generative AI Innovation Center.
For this medical records digitization project, the AWS team—Noel Singh, Tyler Bursee, and Sandra El Ashry—provided advisory support, helping the WCM team evaluate different approaches, align on best practices, and identify the most effective architecture for their needs.
Building from there to develop a proof-of-concept, Dr. Cole and the WCM library team—Dr. Sarah Ben Maamar, Amanda Garfunkel, and Chiyong Han—worked with Leap of Faith Technologies, a company founded and chaired by Naeymi-Rad that partners with academic institutions to train students in healthcare informatics. Leap of Faith provided graduate students from Illinois Institute of Technology to help develop the solution to preserve this valuable repository of historical records with generative AI.
Developing an AI-powered digitization solution in just two months
The development team completed the digitization solution in just two months. After initial weeks exploring feasibility and evaluating different OCR options like Amazon Textract, the team tested two approaches: using OCR preprocessing followed by the Anthropic Claude Sonnet 3.5 model and sending images directly to Anthropic Claude Sonnet 3.5 through Amazon Bedrock.
The direct approach won. “We found it was more accurate to upload directly to Claude, more accurate than just doing OCR through Amazon Textract,” the development team explained.
The final solution provides a streamlined interface where users upload documents as TIFFs or PDFs. For TIFF images, documents go directly to the Anthropic Claude Sonnet 3.5 model. PDFs use Textract for preprocessing before reaching Anthropic Claude Sonnet 3.5.
Behind the scenes, the architecture connects multiple AWS services. Documents stored in Amazon Simple Storage Service (Amazon S3) trigger processing through AWS Lambda functions. Then, the extracted data from Anthropic Claude Sonnet 3.5 flows through Amazon Simple Queue Service (Amazon SQS) before landing in Amazon DynamoDB for research queries and retrieval.
Claude Sonnet 3.5 proved ideal due to its multimodal capabilities and image understanding to achieve 97.16% accuracy without model retraining. Instead, the team focused on prompt engineering to help Claude distinguish between patient cases and extract structured outputs. Sarah Ben Maamar, associate director for research services at WCM, helped manually transcribe sample documents to create a ground truth dataset for comparison.
Despite the complexity of the source material, the team overcame key technical challenges to deliver a streamlined, scalable system. To handle large TIFF files (often 30 to 40 MB each), they converted images to high-resolution PNGs, preserving quality while enabling direct processing by the model. To make the extracted data searchable and interoperable, they integrated the Intelligent Medical Objects (IMO) API, which maps outdated medical terminology to modern clinical codes like ICD-10 and SNOMED.
Delivering dramatic efficiency gains
Before developing the generative AI digitization solution, transcribing these records was slow and resource-intensive. A team of 20–25 IT staff spent an hour each week on manual transcription, completing just 30 documents in a month—and many of them only partially. Today, the AI-powered solution processes entire casebooks in minutes, with over 97% accuracy.
“This is a night-and-day difference,” said Dr. Ben Maamar. “It allows us to scale access to the archive in a way that simply wasn’t possible before.”
Once fully digitized, the archive will support new research into how diseases were described and treated before modern interventions, how diagnostic language evolved across populations, and how social attitudes shaped care. For example, many women in the archive were diagnosed with “hysteria”—a reflection of the gender norms and medical biases of the time.
“These aren’t just clinical documents,” said Frank Naeymi-Rad. “They reflect how society understood and delivered care.”
WCM is now scaling the generative AI digitization solution to process more than 60,000 additional records and map them to the OMOP (Observational Medical Outcomes Partnership) Common Data Model to support population health studies. Plus, this AI-powered approach is already influencing other Cornell departments, including the veterinary college and AI Innovation Lab, which are adapting the pipeline for use cases like clinical notes and expense processing.
A model for digital preservation in higher education
By applying generative AI to historical records, WCM is preserving a fragile part of American medical history and creating a model that other institutions can follow. “I would tell other institutions to try everything they can in the AI world because it’s evolving so fast,” advised Dr. Ben Maamar. “You’ll be surprised at what these tools can do. And testing their limits is the best way to understand their value.”
For WCM, what was once locked in handwritten files is now accessible and ready to inform new discoveries in healthcare, history, and beyond.
Learn more about how AWS helps institutions build, deploy, and scale automated AI solutions that address university needs. Contact AWS today.
