Syntegra: Bridging the Healthcare Data Gap
Guest post by Ofer Mendelevitch, co-founder and CTO, Syntegra
Data is at the heart of everything we do in healthcare: driving treatment decisions, research and development of new drugs, medical breakthroughs for rare disease patients and so much more. Yet, the inability to easily access and share high-quality, patient-level data inhibits us from leveraging its full potential to advance medical innovation and improve patient care.
These long-standing challenges led to the creation of Syntegra in 2019, after my co-founder Dr. Michael Lesh witnessed firsthand the roadblocks researchers face in accessing critical patient data to inform their efforts. I connected with Michael as he searched for solutions to these challenges, and we realized the wide array of use cases for synthetic data across the healthcare ecosystem. Pairing his industry knowledge and patient-focused perspective with my deep technical expertise was a no-brainer. Our mission at Syntegra is to leverage novel, generative deep learning models never before used in the healthcare space to bridge the healthcare data gap, enabling improved research and accelerating how treatments reach patients.
Current Challenges with Healthcare Data
Despite the emergence of big data in healthcare, it remains nearly impossible to share useful, comprehensive, patient-level data sets externally due to strict privacy concerns. Even internally, it can be cumbersome to share data with other teams for analytic and educational purposes.
Amid the expanding focus in healthcare on Real World Data (RWD) for innovation and throughout the drug development lifecycle, real-world data sources require ever-increasing levels of privacy guarantees. However, the current methods of de-identification to protect patient privacy are largely inadequate for two reasons; first, they are vulnerable to re-identification attacks, resulting in continued privacy concerns; second, the de-identification process results in a significant reduction in data quality due to obfuscation of data, addition of noise and removal of small cohorts.
The Syntegra Solution
This clear market opportunity led to Syntegra becoming the first company to use the power of transformer-based language models like GPT-2 and GPT-3 in healthcare to create comprehensive, synthetic datasets that maintain full statistical fidelity and patient privacy. The result is a dramatic increase in availability and usage of healthcare data, enabling tremendous value creation across the healthcare continuum.
Our unique synthetic data engine trains on the underlying structured healthcare data and can then generate new synthetic patient records that contain no link to the original patients yet maintain all statistical properties of the original dataset. This approach allows Syntegra to generate patient-level healthcare data that can be used as a privacy-preserving replacement for real data in both simple and complex analytics, such as building predictive models or conducting survival analysis. In addition, this approach enables the generation of enhanced data that goes beyond the limitations of the real data. For example, the Syntegra engine does not remove small cohorts, which are often so important for research in rare disease. Furthermore, it enables bias normalization and imputation of missing values, ensuring the resulting synthetic dataset is fit for purpose and a significant leap in quality for real world data.
Syntegra’s approach allows health systems, life sciences, payers and commercial data providers to seamlessly share privacy-guaranteed healthcare information internally and externally without the need for expensive and time-consuming compliance and contractual structures, secure sandboxes and complicated access protocols. Privacy is guaranteed in a way that goes beyond HIPAA and GDPR compliance. And with statistical fidelity preserved at the patient-level – unheard of in healthcare – Syntegra-generated synthetic data can be used immediately for statistical analysis, reporting and predictive models. But to build the necessary trust that Syntegra is accomplishing these goals, measurement and proof is needed. Syntegra has developed a set of metrics to assess both statistical fidelity and privacy of synthetic data to help demonstrate the performance of its synthetic data engine.
Leveraging AWS to Empower Growth
As an early-stage startup, we leverage a number of AWS technologies at Syntegra to help us be a better partner to our growing and diverse customer base.
Harnessing the computing power of Amazon EC2 allows us to effectively and efficiently scale our synthetic data engine with customer workloads. We use the AWS python SDK, boto3, to dynamically create GPU-enabled P3 and P4 EC2 instances based on our custom Amazon Machine Image/Launch template. Our AMI has pre-configured access permissions, the latest cuda drivers and fabric managers (for the new A100 GPU instances), and a pre-loaded version of our docker container with all of the necessary installed python packages installed to minimize the startup time for all jobs. Amazon’s streamlined authentication process for accessing EC2 makes it easier for us to work across different machines and in different settings, letting us work nimbly for a wide range of healthcare stakeholders and across a variety of use cases.
For basic data storage, we use the cost-effective Amazon S3 to preserve all outputs from our EC2 machines even after the machine has been terminated. The Amazon Elastic Container Registry allows us to continue to update our container images without affecting the rest of our workflow, so that we can work across a range of datasets and organizations.
Powering our synthetic data engine through AWS allows us to work smarter and more efficiently while also being cost-effective as Syntegra continues to grow and scale. The offering of spot instances lets us keep our costs low, while the AWS API allows us to easily query for the real-time price of spot and non-spot instances. Syntegra has leveraged EC2 spot instances to reduce costs by up to 80%, when compared to the previous use of on-demand instances. Additionally, during the training of a large language model, Syntegra detects when a spot instance will be terminated by AWS and starts a new spot instance to resume exactly where the previous instance left off. This allows us to save the cost of cold restarts, leading to saving many hours of instance uptime.
Looking to the Future
AWS’ wide range of offerings not only enables us to leverage the services that fit our company’s needs right now, but it allows us to look to the future and think big. Our goal of democratizing healthcare data through the use of synthetic data promises to revolutionize the way we think about accessing and sharing healthcare data, and subsequently the potential to change the lives of countless patients.
Curious to learn more about synthetic data generation in healthcare? Visit syntegra.io to learn more and connect with us at email@example.com.