Overview
The PDF Obfuscation Pipeline helps healthcare providers, researchers, and data scientists safely access and use sensitive clinical data without compromising patient privacy by masking PHI information in PDFs. Whether you are training machine learning models, building decision-support tools, or enabling research collaborations, our PDF Obfuscation Pipeline ensures your documents are privacy-compliant, legible, and faithful to their original structure.
Masked entities include HOSPITAL, NAME, PATIENT, ID,MEDICALRECORD, IDNUM, COUNTRY, LOCATION, STREET, STATE, ZIP, CONTACT, PHONE, DATE. The output is a PDF document, similar to the one at the input, but with fake obfuscated text on top of the targeted entities.
IMPORTANT USAGE INFORMATION:
After subscribing to this product and creating a SageMaker endpoint, billing occurs on an HOURLY BASIS for as long as the endpoint is running.
-Charges apply even if the endpoint is idle and not actively processing requests.
-To stop charges, you MUST DELETE the endpoint in your SageMaker console.
-Simply stopping requests will NOT stop billing.
This ensures you are only billed for the time you actively use the service.
Highlights
- **Entity-Level Obfuscation** The pipeline performs targeted obfuscation of sensitive entities such as NAME, PHONE, and more. These entities are replaced by realistic surrogates, preserving the document usability while ensuring no original data leaks. **Customizable Entity Scope** You define what matters. The pipeline allows full customization of which entities to obfuscate or preserve, giving you control over your de-identification strategy.
- **Rendering-Aware Replacement** Replacements are layout-aware. A name John Smith will be replaced with similar-length entity like Mike Burke, ensuring the document remains visually consistent and readable. **Consistent Entity Replacement** If John Smith is replaced by Mike Tyson on page 1, all other instances of John Smith on all pages will be replaced consistently, preserving referential integrity, which is critical for longitudinal or document-linked analysis.
- **Date Shifting Support** You can apply coherent temporal transformations, such as shifting all dates by 2 months, while preserving the internal temporal relationships between events. **Open Evaluation Dataset** We built and released a benchmark dataset to help evaluate document-level de-identification. It includes metrics and examples that showcase what this pipeline can achieve in realistic clinical settings.
Details
Unlock automation with AI agent solutions

Features and programs
Financing for AWS Marketplace purchases
Pricing
Free trial
Dimension | Description | Cost/host/hour |
|---|---|---|
ml.c5.9xlarge Inference (Batch) Recommended | Model inference on the ml.c5.9xlarge instance type, batch mode | $95.04 |
ml.m4.4xlarge Inference (Real-Time) Recommended | Model inference on the ml.m4.4xlarge instance type, real-time mode | $95.04 |
ml.c5.4xlarge Inference (Batch) | Model inference on the ml.c5.4xlarge instance type, batch mode | $95.04 |
ml.c5.4xlarge Inference (Real-Time) | Model inference on the ml.c5.4xlarge instance type, real-time mode | $95.04 |
ml.c5.9xlarge Inference (Real-Time) | Model inference on the ml.c5.9xlarge instance type, real-time mode | $95.04 |
ml.m4.4xlarge Inference (Batch) | Model inference on the ml.m4.4xlarge instance type, batch mode | $95.04 |
Vendor refund policy
No refunds are possible.
How can we make this page better?
Legal
Vendor terms and conditions
Content disclaimer
Delivery details
Amazon SageMaker model
An Amazon SageMaker model package is a pre-trained machine learning model ready to use without additional training. Use the model package to create a model on Amazon SageMaker for real-time inference or batch processing. Amazon SageMaker is a fully managed platform for building, training, and deploying machine learning models at scale.
Version release notes
Spark-OCR==6.0.0 Spark-Healthcare==6.0.2 Spark-NLP==6.0.1
Additional details
Inputs
- Summary
Image file or (Multiple and Single) PDF file are supported.
- Input MIME type
- application/octet-stream
Resources
Vendor resources
Support
Vendor support
For any assistance, please reach out to support@johnsnowlabs.com .
AWS infrastructure support
AWS Support is a one-on-one, fast-response support channel that is staffed 24x7x365 with experienced and technical support engineers. The service helps customers of all sizes and technical abilities to successfully utilize the products and features provided by Amazon Web Services.
Similar products




