What is data integrity?
Data integrity emphasizes maintaining data integrity, making sure data is accurate, error-free, consistent, and fully functional throughout its lifecycle. Maintaining data integrity in a single data store must be manageable, regardless of the number of access requests and data volume and velocity. Modern cloud environments require complex, continuous data movement between distributed data stores and services. High-throughput Online Transaction Processing (OLTP) systems require strict data integrity checks to maintain system consistency. Data engineers must implement data integrity checks on new and existing data stores and processes, including integration, backups, and cloud migrations. This article explores challenges and solutions to data integrity management in the cloud.
Data integrity is the process of maintaining data accuracy, consistency, and completeness throughout its lifecycle. It is a key part of data quality assurance, which ensures an organization's data is relevant and reliable for transaction processing, business intelligence, and analytics. Data integrity encompasses various methods and protocols for validating data while safeguarding sensitive information from unauthorized access.
Why is data integrity important? It ensures that an organization's data remains trustworthy for recording financial and other business activity, as well as decision-making. Data integrity is essential, irrespective of the tools and roles that handle the data and its transformations.
Data integrity is critical in Online Transaction Processing (OLTP) systems as it ensures accurate processing of business transactions, consistency in financial operations, and prevents issues like double-booking or lost transactions. Lapses in data integrity can result in consequences that include regulatory non-compliance and decreased customer satisfaction.
What are the challenges in maintaining data integrity?
Ensuring data integrity within an organization requires addressing human and technology-related data management challenges.
OLTP environments
The biggest data integrity challenge in OLTP environments is managing concurrent transactions while maintaining data consistency, especially during high-volume operations. This challenge requires balancing strict Atomicity, Consistency, Isolation, and Durability (ACID) compliance with performance requirements. Here, multiple users must be able to simultaneously modify the same data, without encountering race conditions and deadlocks, while maintaining the system's real-time processing capabilities.
Business intelligence and analytics
For business intelligence and analytics use cases, limited integration among data sources and systems prevents companies from maintaining a unified, accurate view of their data assets. Additionally, reliance on manual data entry and collection can introduce typos, omissions, and inconsistencies that compromise data accuracy.
Auditing and data trails
Another challenge is the absence of proper audit trails, making it difficult to track data history from collection to deletion. Organizations risk losing visibility into unauthorized data modifications. Legacy systems further complicate data integrity by using outdated file formats or lacking essential validation functions. Moving data to the cloud enables the implementation of more centralized data quality mechanisms and reduces the time and effort required for data integrity checks.
How is data protected in the cloud?
Data integrity can be divided into two broad types.
Physical integrity
Physical integrity processes protect data from damage and corruption due to natural disasters, power outages, hardware failures, or other factors impacting physical storage devices. In the cloud, physical integrity is automatically managed by the cloud provider. This is the cloud provider's responsibility under the Shared Responsibility Model.
For example, AWS data centers provide a four-layer data security infrastructure to the physical devices storing your data. Data security features include:
- Strict access controls with server room access secured by multi-factor authentication and electronic controls.
- Intrusion prevention measures, like automatic unauthorized data removal detection.
- Secure storage device management from installation and provisioning to uninstallation and decommissioning.
- Rigorous third-party audits on 2,600+ security requirements, including equipment inspections.
Logical integrity
Logical integrity processes ensure that data meets the underlying rules of the storage system in which it resides. Logical integrity can be further classified into four sub-types:
- Domain integrity ensures data accuracy by restricting values within a specific range, format, or predefined set (e.g., using data types and other similar data constraints).
- Entity integrity ensures individual data records can be uniquely identified through mechanisms like a primary key, preventing duplicate or null values in key fields.
- Referential integrity maintains consistent relationships between tables by enforcing foreign key constraints to prevent isolated data records.
- User-defined integrity implements business-specific rules beyond standard constraints, such as custom validation logic or application-level enforcement.
The cloud user is responsible for implementing logical integrity constraints and ensuring data quality. This is the customer’s responsibility under the Shared Responsibility Model.
However, AWS data services provide various mechanisms to support data integrity checking, such as checksum algorithms, data quality monitoring tools, and automated data integrity checks during backups and data synchronization.
Managed services can provide automatic and configurable guardrails for your data integrity. Within OLTP systems and databases, logical integrity processes help keep each transaction Atomic, Consistent, Isolated and Durable.
How to ensure data integrity in the cloud?
Consider the following measures to implement logical integrity in the AWS cloud.
Implement object data integrity
Most cloud data operations begin with Amazon S3 buckets, which can store any data type as objects. You may frequently move data between Amazon S3 buckets, databases, and other cloud services or on-prem storage. Amazon S3 provides built-in checksum mechanisms to reduce data integrity risks during uploads, downloads, and copies.
A checksum is a unique, fixed-length value generated from data using a specific algorithm. It creates a unique digital fingerprint, allowing systems to detect data corruption or unintended modifications. When copying objects, Amazon S3 calculates the checksum of the source object and applies it to the destination object. It raises alerts in case of a mismatch. Amazon S3 supports both full object and composite checksums for multipart uploads. Full object checksums cover the entire file, while composite checksums aggregate individual part-level checksums.
Use the checksum functionality as explained below.
Uploads
Amazon S3 supports several Secure Hash Algorithms (SHA) and Cyclic Redundancy Check (CRC) algorithms, including CRC-64/NVME, CRC-32, CRC-32C, SHA-1, and SHA-256. If using the AWS Management Console, select the checksum algorithm during upload. If no checksum is specified, Amazon S3 defaults to CRC-64/NVME.
Downloads
When downloading objects, request the stored checksum value to verify data integrity. Depending on whether the upload is complete or still in progress, retrieve checksum values using the GetObject, HeadObject, or ListParts operations.
Copying
If an object is copied using the CopyObject operation, Amazon S3 generates a direct checksum for the entire object. If the object was initially uploaded as a multipart upload, its checksum value will change upon copying, even if the data remains unchanged.
Implement data pipeline integrity
Another common use case is moving data to cloud data lakes, warehouses, or managed database services. Setting up data integrity checks in such data pipelines is error-prone, tedious, and time-consuming. You must manually write monitoring code and data quality rules that alert data consumers when data quality deteriorates.
During migration
AWS Database Migration Service (DMS) protects data integrity during migrations to AWS Cloud databases through multiple built-in safeguards and validation mechanisms. DMS performs automatic validation to compare source and target data, identifying and resolving discrepancies through data re-synchronization.
DMS includes checkpoint and recovery features that enable migrations to resume from the last known good state if interruptions occur, while providing comprehensive monitoring and logging capabilities to track migration progress. Additionally, DMS ensures data security through SSL encryption for data in transit and integration with AWS security services.
Database infrastructure
AWS databases protect data integrity through multiple comprehensive mechanisms and features, including automated backups and Multi-AZ deployments that ensure data durability and consistency. These databases enforce referential integrity through built-in constraints, maintain ACID compliance for transactional consistency, and provide point-in-time recovery capabilities. Managed database services, such as Amazon Relational Database Service (RDS) and Amazon Aurora, allow you to set specific controls for data integrity. For instance, Aurora allows you to set different transaction isolation levels on your OLTP database.
For enhanced protection, AWS databases support disaster recovery through multi-region deployments, replicating data across geographically distributed regions. Integration with Amazon CloudWatch helps identify and resolve potential data integrity issues before they impact operations.
Data integration
AWS Glue is a serverless data integration service for preparing and combining data in the AWS cloud. AWS Glue Data Quality feature reduces the manual data validation efforts from days to hours. It automatically recommends quality rules, computes statistics, and monitors and alerts you when it detects incorrect or incomplete data. It works with Data Quality Definition Language (DQDL), a domain-specific language you use to define data integrity rules.
In gathering data from OLTP systems for use in analytics, you can use AWS Glue pipelines to push data from your databases to analytics services.
You can further publish metrics to Amazon CloudWatch for monitoring and alerting.
Implement data backup integrity
Large enterprise projects may have diverse teams taking data backups and accessing Amazon S3 stores from diverse locations. Data governance becomes a challenge in such distributed data backup operations. Note that AWS databases come with built-in backup features.
AWS Backup is a fully managed service that centralizes and automates data protection across AWS services like Amazon Simple Storage Service (S3), Amazon Elastic Compute Cloud (EC2), Amazon FSx, and hybrid workloads in VMware. You can centrally deploy data protection policies to govern, manage, and configure your backup activities across AWS resources and accounts.
AWS Backup is designed to maintain data integrity throughout the data lifecycle, from transmission and storage to processing. It applies rigorous security measures to all stored data, regardless of its type, ensuring high protection against unauthorized data access. You retain complete control over data classification, storage locations, and security policies, allowing them to manage, archive, and safeguard data according to their needs.
AWS Backup collaborates with other AWS services to preserve data integrity using multiple mechanisms. This includes:
- Continuous checksum validation to prevent corruption.
- Internal checksums to verify data integrity in transit and at rest.
- Automatic redundancy restoration in the event of disk failures.
Data is redundantly stored across multiple physical locations, and network-level checks also help detect corruption during data transfers.
How can AWS help maintain data integrity?
Data integrity also improves trust in analytics, supports compliance, and ensures data remains valuable throughout its lifecycle. However, for on-premises deployments, making sure data integrity is challenging and expensive, and can result in hours lost due to manual, distributed, and redundant work.
Cloud technologies centralize the process and do most of the heavy lifting for you. Several physical and logical integrity checks are built in by default. Automation mechanisms self-generate the software rules necessary to achieve data integrity. Data engineers only have to configure settings or review the work done by automated mechanisms. Data integrity enables OLTP systems to maintain perfect accuracy while handling high-volume, real-time transactions, which is critical for reliable business operations and practices.
Get started by creating a free cloud account today.