Cambridge Crystallographic Data Centre Uses AWS and Intel to Accelerate Drug Development

Executive Summary

The Cambridge Crystallographic Data Centre (CCDC) works with AWS and Intel to produce a public-facing database of protein structures with intelligently derived and computationally optimized atomic positions for hydrogen atoms. The CCDC runs curation workflows on Amazon EC2 and uses Intel processors to refine its structural database.

Revolutionizing the Structural Chemistry Field

The Cambridge Crystallographic Data Centre (CCDC) is a nonprofit organization that collates and organizes the Cambridge Structural Database (CSD)—a catalog of over 1.1 million enhanced small molecule structures. The CSD serves as a vital resource for pharmaceutical development, providing insight into small molecule structural features and properties.

Understanding interactions in the CSD can inform drug discovery because similar types of interactions can occur in structures in the Protein Data Bank (PDB). And as new protein structures are discovered and added to the worldwide data bank, the CCDC uses advanced data-modelling techniques to predict the likelihood of hydrogen bonds and other less classical interactions occurring.


With the power of Intel and AWS, we’ve essentially eliminated that process, presenting researchers with predictions of protonation states in important protein structures, saving hundreds of thousands of hours of life sciences research time across the globe.”

Jason Cole
Senior Research Fellow, CCDC

The Challenge in Making Statistical Datasets Publicly Available

The CCDC wants to enable drug discovery scientists to understand hydrogen bonding within protein-ligand complexes to help with drug design. But this is difficult when new protein structures are added to the Protein Data Bank every year. Attempting to automatically produce chemically and biologically sensible interpretations of structures starting with a PDB file—including where hundreds of hydrogen atoms are situated—is an academic and technical challenge.

As a pilot project, the CCDC worked with a large pharma company to identify non-optimal binding of small molecules to target protein sites. The team created complex workflows that counted pairs of interactions in specific environments and then extrapolated the frequency of those interactions to the company’s entire in-house database. Although this project was successful, it was CPU-time intensive. Jason Cole, senior research fellow of the CCDC, says, "While my team wanted to provide the scientific community with model-ready structures, we were struggling to do so without adequate capacity and computational power.”

The team lacked the powerful processors needed to speed up the CCDC’s Python/C++ workflows and the cloud storage required to curate protein structures with the addition of reliable inferred hydrogen positions and to convert them into statistical tables. There was no way for the CCDC to curate the large volumes of data in the Protein Data Bank because the curation process time for each protein is significant (due to the need for structural optimization).

Lowering Computational Costs

The organization’s CEO was a former colleague of the director of precision medicine at Intel, who encouraged the CCDC to apply for Intel’s RISE program. After compiling a list of projects that could benefit from Intel’s processing systems, it was clear that the curation workflow would have the biggest potential impact—and not just because of Intel’s CPU-related benefits.

Cole clarifies, “Intel’s super-fast chips would give us a leg up, but another reason we wanted to work with Intel was that our analysts could learn how to deploy large-scale computational projects most efficiently. I knew this technical expertise would help us continue pushing the field of structural chemistry forward.”

Upon accepting the CCDC’s grant application, Intel reached out to its partner contacts at AWS. Combining the power of Intel’s processing systems with the breadth of Amazon Elastic Compute Cloud (Amazon EC2) established an ideal environment for CCDC’s project.

CCDC & Intel

Enhancing Complex Model Performance

The CCDC already recognized AWS as a leader in the cloud storage space but was especially excited by the proven experience of AWS in handling similarly large and complex databases. Amazon EC2 was particularly attractive because of its capacity to scale up CCDC’s storage needs as the project grew. Amazon EC2 could save IT and DevOps teams time with easy network configuration, virtual server launches, and security monitoring.

Beyond the technical benefits of Amazon EC2, the CCDC was impressed with Intel’s high praise of AWS and the AWS team’s ability to connect the organization with professional services providers. “Cloud-based resources have a significant barrier to entry for scientists,” Cole explains. “The AWS team has been brilliant at finding the right people to connect with to get our solution up and running quickly. We are so impressed with their technical know-how and willingness to help."

Developing a Flexible Curation Workflow

With the help of Intel processors and Amazon EC2, Cole and his team had the freedom to dream up an ideal scientific workflow. “We were able to come up with a public dataset, but more critically, we leveraged the AWS and Intel partnership to perform Relative Frequency analyses and generate accurate binding site examples—a very useful resource for medicinal chemists,” says Cole.

AWS Professional Services, Intel’s processor team, and the CCDC’s internal DevOps team are working together to deploy the newly created protonation workflow on AWS and Intel architectures.

Decreasing the Time and Cost of Drug Development

Ultimately, the Intel-AWS pairing significantly decreased optimization time and made it possible to share the enhanced version of the Cambridge Structural Database with the public. Internally, the Intel-AWS combination helped the CCDC cut its processing time and compute costs in half.

But perhaps more importantly, the new workflow has forever altered the small molecule drug development process. “Every pharmaceutical company in the world has to create its own model-ready structures, a process that takes time away from other strategic activities it could be doing,” says Cole. “With the power of Intel and AWS, we’ve essentially eliminated that process, presenting researchers with predictions of protonation states in important protein structures, saving hundreds of thousands of hours of life sciences research time across the globe.”

The Intel and AWS-driven workflow also set the stage for continuous CSD improvement. For example, the CCDC can now update data tables in tandem with additions to the Protein Data Bank, which was once a cost-prohibitive exercise. And according to Cole, that’s not something that’s ever been done before. “To my knowledge, there is no single, curated, public-facing, modeling-ready dataset, and it would not be possible without the help of AWS and Intel.”

In addition, CCDC analysts have become experts at deploying large-scale computational projects on Intel and AWS. With subject matter experts in-house, the CCDC can start and complete similar projects in far less time. Overall, the solution accomplished the CCDC’s goals and solidified the CCDC’s position as a world-leading expert in the field of structural chemistry.

Powering the Future of Small Molecule Development

While revolutionary in and of itself, this project only scratches the surface of what the CCDC can achieve with Amazon EC2 and Intel processors. The immediate next step is optimizing CSD structures quantum mechanically, adding another valuable dimension to small molecule drug research methodology. But additional enhancements to the CSD aren’t the only projects waiting in the wings.

The CCDC also wants to improve the efficiency of its algorithms and scoring models that predict protein-ligand docking. “Making this workflow broadly available would allow analysts to strengthen docking for specific targets, further expediting drug discovery and development,” says Cole. “And just like the protonation workflow, refining these models takes significant compute power, making it a perfect opportunity to take advantage of AWS and Intel solutions.”


About the CCDC

The Cambridge Crystallographic Data Centre (CCDC) is a nonprofit organization that collates and organizes the Cambridge Structural Database (CSD)—a catalog of over one million enhanced small molecule structures.

AWS Services Used


  • Cut processing time and compute costs in half
  • Developed a flexible curation workflow
  • Decreased the time and cost of drug development

About the AWS Partner Intel

Intel is the world’s leading designer and manufacturer of high-performance processors for servers, PCs, IoT devices, and mobile devices. AWS and Intel engineers have worked together for more than 10 years, building custom hardware to ensure AWS services run on a platform optimized for customer workloads for the best value. Intel Xeon Scalable processors power Amazon EC2 instances to help enterprises drive performance for their compute-intensive workloads.

Published March 2023