What is Data Profiling?

Data profiling is the process of reviewing an organization's data in order to understand existing information, how it is stored, and the interconnections between the different data sets. Large enterprises collect data from hundreds or thousands of sources, and this can lead to redundancy, inconsistencies, and other data accuracy issues that impact future analytics projects. Data profiling aims to evaluate data quality using automation tools that identify and report content and usage patterns. It is a crucial pre-processing step before data can be utilized for analytics and business intelligence.

What is a data profile?

A data profile is a report that offers detailed insights into the attributes of a company’s data and any potential data quality issues that it may contain. The report focuses on metadata and statistical information, providing researchers with a comprehensive overview of the data's contents.

Statistical measures in a data profile help determine the quality of the data. The provide information about minimum and maximum values, frequency data, variation, mean and mode, percentiles and other insights into data distribution.

The metadata section of the report offers insight into the type of data that a company collects. It includes structural aspects, foreign key analysis to understand the relationships between data sets, and referential integrity analysis to validate consistency across different tables.

Why is data profiling important?

Here are the benefits of data profiling.

Enhance data organization

It is not uncommon for large enterprises to have several datasets that share information or include similar details. By using data profiling, businesses can identify the data source and determine which fields overlap with one another. Identifying redundancy can help to clean up data, improve organization, and facilitate better data-driven processes. Better data quality standards help enhance all data-driven processes in a business while reducing operational costs associated with duplicate efforts.

Improve collaboration

Data profiling reports also generate information about ownership and lineage. The organization gains a better understanding of who owns what data and where it originates. This knowledge enhances accountability and promotes more effective collaboration.

Streamline workflows

Data profiling encompasses automated processes that facilitate metadata identification and tracking of data flows. Your data researchers can spend less time on lengthy manual identification processes and focus on tasks that require more technical expertise. You can also remove any redundancies or inaccuracies and ensure that all the data used meets a higher standard.

Centralized governance

Data profiling centralizes information about data, providing a single-pane view of where data is stored, who owns it, and what information overlaps. You can overcome data silos and improve data access. Taking a holistic approach to documenting and mapping data ensures that everyone in your organization has a better understanding of their data. Profiling also demonstrates the relationship between different datasets and traces how it moves through the system, which is critical for compliance.

What are the use cases of data profiling?

There are several use cases of data profiling.

Data Quality

If a data operation fails, one of the easiest ways of locating the cause is to profile the data. A data profile report indicates if the data is incomplete, inaccurate, or contains an unexpected character that could be causing the error. Data engineers can run data profiles frequently to verify that data operations are functioning as expected and ensure that data remains of high quality.

Data migration

Data engineers can utilize data profile reports to identify when data systems are under stress and determine necessary adjustments for improved operational efficiency. Data profile reports can guide migration decisions to the cloud or any new setup. Data architects can rapidly gather the information needed to work more efficiently and streamline data pipeline development.

Master data management

Master data is the core data used across an organization, typically describing customers, products, suppliers, or other key assets. Master Data Management (MDM) applications are software solutions that enable organizations to manage and maintain the consistency and accuracy of their master data. When teams work on master MDM applications, they use data profiles to understand which systems are integrated by the project, the application scope, and whether there are any data inconsistencies. Businesses can utilize data profiling to identify data quality issues, null values, and errors as early as possible, thereby accelerating data standardization and supporting MDM.

What are the types of data profiling?

There are several different data profiling techniques.

Structure discovery

Structure discovery data profiling is a strategy that ensures all data is consistent across a database. It checks all the data in a specific field to verify that it is in the correct format and is structured consistently with all other entries in the field. For example, structure discovery might verify that all mobile numbers in a list have the same number of digits, flagging any that have missing or incompatible values.

Content discovery

Content discovery data profiling is a strategy that looks for any systemic issues in the data. These errors can take the form of incorrect values or improperly structured individual elements within the database.

Relationship discovery

Relationship discovery data profiling is tracing how different data sets connect, which are in use with others, and how data sets overlap. This style of profiling first inspects the metadata to determine which relationships are most prominent between datasets, then narrows the connective thread between fields to show a more holistic view of the relationship.

Metadata discovery

Metadata discovery data profiling compares data to its expected structure by assessing its metadata. It checks that the data behaves and operates as expected. For example, if a field is intended to be numerical but receives alphabetical responses, metadata discovery will flag this discrepancy as an error for further review.

Field-based profiling

Field-based profiling is a strategy that identifies data quality issues in a single field by checking that the data type and characteristics match. This approach can help identify inconsistencies in data or any outliers that may skew the data.

Multi-field profiling employs a similar strategy to understand the relationship between two distinct fields. Also known as cross-field profiling or cross-table profiling, it verifies that two fields are compatible if their data relies on each other. For example, a check could verify whether the state matches the appropriate zip code in customer address listings.

How does data profiling work?

Here are the main stages that data profiling moves through.

Preparation

Preparation is outlining what you want to achieve with your data profiling. This will begin with identifying which form of data profiling is most effective to achieve your business objectives. At this stage, you will also identify any metadata fields that you wish to research.

Data discovery

Next, you will identify what data is in your system. This stage aims to collect information on the structure of your data, its formats, content, and potential relationships between datasets. At this stage, you can conduct a statistical analysis to determine certain data features.

Standardization

Standardizing ensures that formats and structures across all your data align. At this stage, you will also eliminate any duplicate data and remove redundancies, thereby reducing the total amount of data that needs to be cleansed in the next step. If you need to apply business rules to standardize your data, this is where data rule validation occurs.

Cleansing

Cleansing involves detecting and removing errors, enriching the data by connecting it with other data sources, and remedying inconsistencies in the wider datasets.

Improvement

Finally, the data profiling process focuses on improvement, which involves monitoring data quality to ensure that any issues are resolved as quickly as possible. If you have certain data governance or data strategy goals, this stage is where you can ensure compliance and verify that your data is ingested and distributed correctly across your organization.

What are common data profiling functions?

Here are the common data profiling tools and functions.

Mathematical functions

Mathematical functions in data profiling are methods to calculate data completeness and identify any patterns that exist throughout a dataset. For example, absolute value, power, log, etc.

Aggregate functions

Aggregate functions focus on collecting multiple fields from rows or columns and then returning a singular value to summarize that information. For example, average, count, maximum, variance, and so on.

Text functions

Text functions are strategies to inspect alphabetical data entries, helping to assess the data quality of these string fields and interact with them. For example, find, char, trim, etc.

Date and time functions

Date and time functions allow researchers to inspect data that includes these fields. You can investigate specific dates or times, calculate the difference between dates, or return specific information from these fields. For example, convert time zones, return the month, year, and day from a given date, etc.

Window functions

Data profiling tools with window functions allow you to investigate column-based information. You can conduct cross-column profiling and column profiling across a rolling data window. For example, rolling window count, max, etc.

Web functions

Web functions operate on strings containing XML content. For any data connected to a web service, these functions are effective investigative tools. For example, converting data fields or extracting a value from a JSON object.

How can AWS support your data profiling requirements?

Amazon SageMaker Catalog provides data quality scores that help you understand the different quality metrics such as completeness, timeliness, and accuracy of your data sources. Amazon SageMaker Catalog integrates with AWS Glue Data Quality and offers APIs to integrate data quality metrics from third-party data quality solutions. Data users can see how data quality metrics change over time for their subscribed assets. To author and run the data quality rules, you can use your data quality tool of choice such as AWS Glue data quality. With data quality metrics in SageMaker Catalog, data consumers can visualize the data quality scores for the assets and columns, helping build trust in the data they use for decisions.

AWS Glue is a serverless data integration service that simplifies the process of discovering, preparing, and combining data for analytics, AI/ML, and application development. It provides all the capabilities needed for data integration, allowing you to start analyzing your data and putting it to use in minutes, rather than months.

AWS Glue DataBrew is the visual data preparation feature within AWS Glue that provides data profiling capabilities. You can:

Choose from over 250 prebuilt transformations to automate data preparation tasks, all without the need to write any code.
Automatically filter anomalies, convert data to standard formats, and correct invalid values.
Immediately use the prepared data for analytics and AI/ML projects.

Manually creating data quality rules by writing code to monitor data pipelines is a significant challenge in data profiling. AWS Glue Data Quality is another feature that automatically computes statistics, recommends data quality rules, monitors, and alerts you when it detects issues.

Get started with data profiling on AWS by creating a free account today.

What is Data Profiling?