Data Validation at Scale — Detecting and Responding to Data Misbehavior

5 min readJun 29, 2023

Editor’s note: Felipe is a speaker for ODSC Europe this June 14th-15th. Be sure to check out his talk, “Data Validation at Scale — Detecting and Responding to Data Misbehavior,” there!

In today’s data-driven world, companies rely heavily on data to make informed decisions and gain a competitive edge. This also means that low-quality data can have serious negative consequences for businesses. Incorrect or incomplete data can lead to poor decision-making, missed opportunities, and ultimately, financial losses. Data validation can be particularly challenging as the amount of data involved continues to grow steadily. As businesses generate and collect more data than ever before, the task of ensuring that all of this information is accurate and consistent becomes increasingly complex.

In this tutorial, we’ll introduce the concept of data logging and discuss how to validate data at scale by creating metric constraints and generating reports based on the data’s statistical profiles using the whylogs open-source package.

Case Study: Airbnb Listings in Rio de Janeiro

In this tutorial, we will validate data containing Airbnb’s listing activity and metrics from Rio de Janeiro, Brazil. The used dataset was adapted from the inside Airbnb project. Let’s download the dataframe with:

import pandas as pd

df_target =
pd.read_parquet("https://whylabs-public.s3.us-west-2.amazonaws.com/whylogs_examples
/Listings/airbnb_listings_target.parquet")

Let’s simulate a scenario where we want to assert the quality of a batch of production data. We will define a set of data quality checks, or constraints, assuming the existence of previous domain knowledge and experience with the data. These constraints operate on top of statistical summaries of data, rather than on the raw data itself. In whylogs, these statistical summaries are called profiles, so let’s begin with a brief introduction to data logging.

Data logging with whylogs

In a production setting, we need ways of monitoring data that are scalable and efficient. For a number of reasons, such as storage requirements or privacy concerns, using raw data for debugging/monitoring purposes might not be feasible.

For this reason, we’ll leverage data logging to generate statistical summaries of our data, which we can then use to track changes in our dataset, ensure data quality and visualize key summary statistics.

First of all, we can install whylogs (with the viz extra, which we’ll use later):

pip install whylogs[viz]

Let’s first create a profile of our target dataframe:

import whylogs as why
results = why.log(df_target)
profile_view = results.profile().view()

A profile is a lightweight statistical fingerprint of your dataset, which can be stored for later use or sent over to monitoring platforms by generating a profile view. It will provide you with valuable statistics on a column (feature) basis, such as:

Counters, such as the number of samples and null values
Inferred types, such as integral, fractional, and boolean
Estimated Cardinality
Frequent Items
Distribution Metrics: min, max, median, quantile values

A profile can be used for several purposes, such as a) data monitoring, b) visualization, c) drift detection, and d) data validation. In the next session, we will see how to perform data validation with Metric Constraints.

Data Validation with Metric Constraints

Constraints are a powerful feature built on top of whylogs profiles that enable you to quickly and easily validate that your data looks the way that it should. There are numerous types of constraints that you can set on your data (that numerical data will always fall within a certain range, that text data will always be in a JSON format, etc) and, if your dataset fails to satisfy a constraint, you can fail your unit tests or your CI/CD pipeline.

There are a number of ways to create Metric Constraints. In this example, we will use out-of-the-box helper constraints to facilitate the process.

We will create the constraints with the help of ConstraintsBuilder. That will allow us to progressively add the constraints we wish to build:

from whylogs.core.constraints import ConstraintsBuilder
from whylogs.core.constraints.factories import (
    no_missing_values,
    is_in_range,
    smaller_than_number,
    quantile_between_range,
    is_non_negative,
    frequent_strings_in_reference_set,
    column_is_nullable_integral,
)

room_set = {"Private room", "Shared room", "Hotel room", "Entire home/apt"}builder = ConstraintsBuilder(dataset_profile_view=profile_view)builder.add_constraint(no_missing_values(column_name="id"))
builder.add_constraint(is_in_range(column_name="latitude", lower=-24, upper=-22))
builder.add_constraint(is_in_range(column_name="longitude", lower=-44, upper=-43))
builder.add_constraint(smaller_than_number(column_name="availability_365", number=366))
builder.add_constraint(quantile_between_range(column_name="price", quantile=0.5, lower=150, upper=437))
builder.add_constraint(is_non_negative(column_name="bedrooms"))
builder.add_constraint(column_is_nullable_integral(column_name="bedrooms"))
builder.add_constraint(frequent_strings_in_reference_set(column_name="room_type", reference_set=room_set))constraints = builder.build()constraints.validate()

Calling validate() will return True if all the constraints pass, and False otherwise.

We can also visualize the constraints report with the viz module. With it, you can filter the displayed constraints by name or status (pass or fail), and, if you hover over each constraint’s status, it will provide you with additional context that was used to determine the constraint’s status:

from whylogs.viz import NotebookProfileVisualizer

visualization = NotebookProfileVisualizer()
visualization.constraints_report(constraints)

It looks like our data meets almost all of our assertions, with the exception of one — we should check the bedrooms column and see why its type is not the one we expect.

What’s Next

In this blog post, we have explored some of the capabilities of whylogs for data validation. However, it’s worth noting that there are a number of additional features within whylogs that we haven’t covered here.

For a more in-depth view of this topic, you can sign up for my upcoming workshop at ODSC Europe this June “Data Validation at Scale — Detecting and Responding to Data Misbehavior.” In the workshop, we will also see how to automatically generate constraints based on a reference dataset, row-level validation, triggering actions on failed conditions, and how to debug failed conditions.

See you there!

About the Author/ODSC Europe Speaker:

Felipe de Pontes Adachi is a Data Scientist at WhyLabs. He is a core contributor to whylogs, an open-source data logging library, and focuses on writing technical content and expanding the whylogs library in order to make AI more accessible, robust, and responsible. Previously, Felipe was an AI Researcher at WEG, where he researched and deployed Natural Language Processing approaches to extract knowledge from textual information about electric machinery. He is also a Master in Electronic Systems Engineering from UFSC (Universidade Federal de Santa Catarina), with research focused on developing and deploying fault detection strategies based on machine learning for unmanned underwater vehicles. Felipe has published a series of blog articles about MLOps, Monitoring, and Natural Language Processing in publications such as Towards Data Science, Analytics Vidhya, and Google Cloud Community.

Originally posted on OpenDataScience.com

Read more data science articles on OpenDataScience.com, including tutorials and guides from beginner to advanced levels! Subscribe to our weekly newsletter here and receive the latest news every Thursday. You can also get data science training on-demand wherever you are with our Ai+ Training platform. Subscribe to our fast-growing Medium Publication too, the ODSC Journal, and inquire about becoming a writer.

Data Validation at Scale — Detecting and Responding to Data Misbehavior

Case Study: Airbnb Listings in Rio de Janeiro

Data logging with whylogs

Data Validation with Metric Constraints

What’s Next

Written by ODSC - Open Data Science