AWS Partner Network (APN) Blog
Discover and Protect Sensitive Data with HCLTech’s DataPatrol Framework Built with Machine Learning on AWS
By Chinmaya Ranjan Mohanty and Partha Sarathi Das, Solution Architects – HCLTech
By Subramanian Thiyagarajan, Technical Architect – HCLTech
By Sandeep Roy, Practice Director – HCLTech
By Jerry Li and Deepak Chandrasekaran – AWS
HCLTech |
Most organizations tend to collect large volumes of sensitive data that include personally identifiable information (PII), personal health information (PHI), and payment card industry (PCI) data to provide relevant and customized services.
It’s critical to identify and protect the sensitive data collected from any unauthorized disclosure, and it’s the responsibility of every organization to effectively discover, control, and manage their sensitive data footprints and comply with relevant data protection laws.
The healthcare and life sciences industry, for example, deals with tremendous volumes of sensitive data containing clinical records, patient data, and other PHI. All of which falls under the purview of country-specific laws protecting sensitive data, such as the Health Insurance Portability and Accountability Act (HIPAA) in the United States.
The financial services industry, meanwhile, deals with PCI data and other sensitive data that needs to be encrypted at all times. Hence, it becomes important to apply automation and the latest technologies to detect sensitive data at the point of ingestion/integration and take necessary actions to avoid data leaks.
Organizations often have challenges in automatically detecting the growing list of sensitive data types and lack visibility into data security risks, especially when ingesting unstructured data. Customers rely on fully managed data security services that automate protection against sensitive data leaks and leverage the capabilities of machine learning (ML) and pattern matching techniques to swiftly address these limitations.
In this post, you’ll learn about HCLTech‘s DataPatrol Framework, which accomplishes critical tasks in the lifecycle of sensitive documents and improves sensitive data discovery and governance across your Amazon Web Services (AWS) environment.
HCLTech is an AWS Premier Tier Services Partner and AWS Marketplace Seller with AWS Competencies in Migration, DevOps, SAP, Storage, and Mainframe Modernization Consulting. HCLTech is also a member of the AWS Managed Service Provider (MSP) and Well-Architected Partner programs.
Solution Overview
DataPatrol helps reduce risk by isolating, encrypting/masking, and notifying responsible teams about any sensitive data coming into the system from third parties. DataPatrol helps customers identify sensitive data at the point of ingestion and take immediate action for protecting against sensitive data leakage.
It also helps customers economically comply with data security requirements. Any industry can leverage DataPatrol to ensure sensitive data is identified and corrective actions are taken to keep it secured and protected.
This framework uses machine learning and pattern-matching techniques for a growing list of sensitive data types, and allows customers to add custom-defined data types using regular expressions for proprietary sensitive data discovery relevant to their business.
DataPatrol entirely automates the discovery of sensitive data, isolates the sensitive data based on the detected severity (high/medium/low), and provides a complete suite of pre-built analytics on the findings for further review and action.
Severity levels are configurable and can be defined based on industry/sector specifications and implementation.
All identified highly sensitive data (PII, PHI, PCI) is automatically isolated to a 256-bit Advanced Encryption Standard (AES-256) location in Amazon Simple Storage Service (Amazon S3), and custom email alerts deliver relevant details to authorized users.
HCLTech’s DataPatrol solution is also integrated with ServiceNow for automated incident creation and workflow management to handle high-severity findings. HCLTech’s iONA (iAct) solution provides seamless integration with the ServiceNow tool, which is leveraged by DataPatrol for incident reporting and management.
Key Features
The following list highlights key features, as well as supporting AWS services, leveraged in building DataPatrol:
Sensitive Data Discovery
- Significance: Ability to swiftly identify and discover sensitive data, including PII, PHI, and PCI.
- Services used: Amazon Macie
- Rationale: Amazon Macie is a fully mannged, ML-driven data security service. It supports pattern-matching techniques and is easy to create custom data types that can be constantly added and updated.
- Key features: Fully managed, updated ML techniques for PII detection and the ability to define and use custom datatypes using regular expressions have proven to deliver quality discovery of a variety of sensitive datatypes from customers’ source data.
Secure Data Isolation and Encryption
- Significance: Provision to move sensitive files from the source to a secure target location and prevent sensitive data leaks.
- Services used: AWS Lambda, Amazon S3
- Rationale: AWS Lambda is an efficient, cost-effective serverless compute service that supports multiple programming languages; Amazon S3 is a highly reliable, secure, low-cost cloud data storage option.
- Key features: Assists in effective isolation of sensitive data files right at the ingestion layer and prevents further leakage to downstream systems.
Severity-Based Email Alerts
- Significance: Ability to notify users with custom email alerts in the event of sensitive data breaches.
- Services used: Amazon EventBridge, Amazon SNS
- Rationale: Amazon EventBridge provides serverless and seamless connection between multiple AWS services; Amazon Simple Notification Service (SNS) provides a flexible approach to publish messages to subscribers. Both services are cost-effective and easy to configure, manage, and scale.
- Key features: Based on the EventBridge events, this workflow automatically triggers Amazon SNS to send subscribers custom email notifications containing critical details on the sensitive data file location along with severity level warnings (high/medium/low).
Audit and Compliance Reports
- Significance: Equip businesses with a consolidated view of DataPatrol reports in an easily readable format (CSV) to review the sensitive data findings for audit and compliance requirements.
- Services used: AWS Lambda, Amazon S3
- Rationale: AWS Lambda is an efficient, cost-effective serverless compute service that supports multiple programming languages; Amazon S3 is a highly reliable, secure, low-cost cloud data storage option.
- Key features: A consolidated DataPatrol report for each patrolling job will be auto-downloaded to a customer-specified location for quick reviews and actions on the findings
Centralized Management of Sensitive Data Findings
- Significance: Collate, monitor, and process DataPatrol’s sensitive data findings into a centralized hub, providing a comprehensive view of security state and high-priority security issues.
- Services used: AWS Security Hub
- Rationale: Easy integration with AWS Security Hub to automatically publish sensitive data findings for broader analysis and centralized findings management;
- Key features: Integration with AWS Security Hub provides a comprehensive strategy to aggregate and analyze all sensitive data findings within a single window and store as a standard AWS Security Finding Format (ASFF) for further processing.
Incident Reporting and Management
- Significance: Provisioning seamless integration with ServiceNow to automatically raise incidents for every high severity alerts.
- Services used: AWS Lambda, HCLTech’s iONA (iAct) solution, ServiceNow
- Rationale: HCLTech’s in-house iONA (iAct) solution offers seamless integration with ServiceNow; AWS Lambda is an efficient, cost-effective serverless compute service that supports multiple programming languages.
- Key features: DataPatrol is fully integrated with HCLTech’s iONA (iAct) solution to auto-create incidents in ServiceNow for high-severity detection and assigns them to the appropriate user group for further review and action.
DataPatrol Dashboard
- Significance: To empower business users with rich preconfigured ML-driven dashboards to review key insights on sensitive data for all the processed DataPatrol jobs.
- Services used: Amazon QuickSight
- Rationale: Amazon QuickSight is a fully managed, ML-powered business intelligence (BI) service with self-service capabilities to launch rich interactive visuals and get ML-driven insights, useful auto-narratives in natural language, and KPI alerts provisioning.
- Key features: Delivers prebuilt ML-driven insights with auto-narratives that are embedded contextually in a dashboard using natural language for quick interpretation. Along with rich visuals and interactive dashboarding features, it includes details on the discovered sensitive data types, categories, jobs, total GBs classified metrics, source file processed details, and more useful insights.
Understanding the Process Flow Design
In this section, we will provide a simple walkthrough of the DataPatrol process flow diagram.
Figure 1 – Data Patrol flow diagram.
The data patrolling job gets auto-triggered after the arrival of source files in the designated input S3 bucket. Post-completion of the sensitive data discovery job, results are stored internally in the DataPatrol results repository, where key findings are extracted programmatically into a consolidated CSV report and stored in a customer-specified S3 bucket for audit and compliance requirements.
For sensitive detections, files are isolated from the source and moved to an isolated bucket with an auto-trigger process to open high-severity incidents in ServiceNow. Then, the incidents are assigned to specified workgroup for further action.
In parallel, another workflow is auto-triggered to notify authorized users with custom email alerts, providing sensitive file details along with the detected severity information.
Figure 2 – Sample email alerts sent by DataPatrol.
Detailed Solution Architecture
The key tenet for HCLTech’s DataPatrol framework is end-to-end automation across all components of the architecture, from scanning the sensitive data at the point of ingestion to dashboarding key insights for business consumption.
This framework can seamlessly detect several identified and custom sensitive data types catering to any industry, supporting PII protection and strictly adhering to data privacy, compliance, and regulatory needs such as GDPR, PCI-DSS, and HIPAA.
It can support unstructured data and can be plugged into any layer that requires sensitive data discovery for the underlying raw source data.
Figure 3 – DataPatrol solution architecture.
HCLTech’s DataPatrol architecture leverages native AWS services for sensitive data discovery and analytics. Source data, once loaded into the raw Amazon S3 bucket, will be encrypted using SSE-S3 and automatically invoke Amazon Macie. Upon completion, this will store the sensitive result findings in AWS Key Management Services (AWS KMS) encrypted S3 results repository.
Amazon EventBridge recognizes the high/low/medium-severity events and leverages Amazon SNS to push custom email alerts, providing subscribed users with processed sensitive data file details along with severity levels.
For high-severity detections, multiple workflows trigger automatically:
- Isolation of the source file from raw S3 bucket to a secure AES-256 encrypted target S3 bucket using SSE-KMS. It will be encrypted with a customer managed key that the custodian can access, which prevents the sensitive data from being read by anyone who doesn’t have access to that KMS key to decrypt it.
- Pushing SNS email notifications that contain the sensitive file details to authorized users.
- Auto-creation of high-severity incidents in ServiceNow using HCLTech’s iONA (iAct) solution for further review and action.
Custom tables and views are created in Amazon Athena, which make it easier to analyze data stored in S3 immediately.
Amazon QuickSight is leveraged to deliver rich and interactive visuals and ML-driven insights, providing a deeper understanding of the discovered sensitive data types, total GBs of classified metrics, data categories, and source file details with more interesting insights.
AWS Security Hub enables broader analysis of an organization’s overall security posture across AWS accounts and provides seamless integration support for dozens of AWS and AWS Partner products from a single place.
DataPatrol leverages this built-in integration support with AWS Security Hub, where all security findings are automatically normalized before they are ingested, providing a consistent way to view findings identified by resources, severities, and timestamps for quick search and action.
Dashboard Visualization Preview
With pre-built ML-driven dashboards, customers can gain quick insight into the type of sensitive data stored on AWS to swiftly identify and reduce data leakage risks and provide timely remediation and improved data security and governance.
DataPatrol uses AWS native services to make it easier to configure, customize, and deploy in your cloud environment with easy upgrades and updates while reducing overall total cost of ownership (TCO).
HCLTech’s DataPatrol dashboard provides rich visuals with key ML-driven insights:
- Total PII data count discovered for all the sensitive data jobs submitted.
- Total count of PII detections identified by its data type.
- Total bytes and GBs classified metrics for each submitted job.
- Auto-narratives embedded with key insights in natural language for quick inferences.
- Total PII detections at project-level detail.
- Detailed findings report that captures key information like account ID, bucket name, job run date, file details, finding ID, finding type, severity, PII total count, and summary description for both identified and custom datatypes, etc.
Figure 4 – Screenshot of a DataPatrol dashboard.
Conclusion
In this post, we discussed how HCLTech’s DataPatrol framework can be leveraged for:
- Automating sensitive data discovery in your AWS environment.
- Facilitating seamless inclusion of custom data types for sensitive data, proprietary to the customer.
- Isolating sensitive files to secure target location to prevent sensitive data leakage.
If sensitive data breaches occur, responsible teams will be notified with email alerts classified based on the severity level. Prebuilt machine learning-powered dashboards will provide comprehensive and useful insights on findings from DataPatrol presented in rich interactive visuals.
For more information or to schedule a demo session, reach out to HCLTech at DNA_DATA_BI_FABRIC@hcl.com. You can also learn more about HCLTech on AWS Marketplace.
HCLTech – AWS Partner Spotlight
HCLTech is an AWS Premier Tier Services Partner and MSP that serves hundreds of global enterprises to solve day-to-day and complex challenges with a dedicated full-stack business unit.
Contact HCLTech | Partner Overview | AWS Marketplace | Case Studies