AI Fairness in Industry

Published in

IBM Data Science in Practice

8 min readMay 26, 2023

By Erika Agostinelli (IBM Senior Data Scientist), Stefan van der Stockt (IBM Lead Data Scientist, Master Inventor) and Carolyn Saplicki (IBM Senior Data Scientist)

Various industries are utilizing AI in Sustainability practices more to create models that aid with critical asset performance management (APM) initiatives. These models are applied to anything from condition-based monitoring, predictive maintenance, asset failure prediction, detecting anomalies in daily usage patterns, to predicting the end-of-life of assets. The business success of these initiatives is highly dependent on the veracity of the employed AI models.

AI models in industry track the operation of assets in the field. Models can be employed to perform tasks such as predicting when maintenance should be performed based on (variable) operating conditions. Models can track the conditions that are known to lead to asset degradation over time, and pre-emptively update asset end-of-life expectancies. Condition-based monitoring data could also be used by AI models to orchestrate and schedule the use of assets in a more optimal way to promote profits, extend asset life, and avoid unplanned down-time.

All AI models are only as good as the data and development lifecycle that yielded them. In a mature AI lifecycle, every deployed model is constantly monitored to ensure that reality still matches the assumptions (and data) that was used to train the model. Model monitors are employed to consistently check aspects such as input data drift, model quality (“accuracy”), model fairness, model explainability, and others. Model quality, drift, and explainability, while critical and involving sophisticated calculation by themselves, are relatively easy to relate to in terms of business goals for industry use cases. But what does Fairness mean in industrial use cases such as manufacturing a car or operating a field of oil wells?

The Common Definition of AI Fairness

In Fairness Explained: Definitions and Metrics, fairness definitions and fairness metrics are presented in the context of a real-world example that predicts a criminal defendant’s likelihood of reoffending. In that scenario, we give a fairness definition based on protected attributes such as race, gender, or religion. We explore that there are numerous conflicting definitions of “fair” and the most appropriate mathematical formulation of a fairness metric is dependent on the use case itself. We utilize tools such as the Fairness Compass and AI Fairness 360 to understand the existence of bias between privileged (reference) and unprivileged (monitored) groups and then quantify its effects on a model’s outcomes.

However, approaching the topic of fairness in Industry is not such a straightforward exercise:

What is the definition of a protected attribute in any given industrial use case? Which groups are considered privileged and unprivileged? Why does the distinction even matter to the business?
What mathematical definition of “fairness” is appropriate to measure the disparity between the above groups?
What business objectives are supported by this fairness definition and supporting fairness metric? At what point (“threshold”) does any observed disparity between groups become a bias that needs to be addressed?
How should any such bias be mitigated? Does it involve retraining the model on new/different/additional data? Does it require physical changes to assets in the field? Does it require standard operating procedure or other process changes? Does it require a complete rethink of the AI method being used?

These questions are not trivial and have far-reaching consequences. The answers are critical to reflect on when AI is used to avoid unplanned downtime or to decide when or when not to send technicians out to perform work on multi-million-dollar assets.

The Definition of AI Fairness in Industry

AI models learn from historical data and can expose which combinations of characteristics frequently lead to certain outcomes (such as failures). Subject-matter experts (SMEs) can inspect these relationships and patterns and can then make judgements on how well the model outcomes match the observed data.

A salient assumption in the above statement is that with all things being equal, all assets should be treated similarly by any AI model, regardless of an asset’s “non-causal demographics”. AI models typically use data such as sensor readings, weather data, service history, production volumes, etc. to reason about an asset’s condition. Such models rarely (if ever) use non-causal “demographic” data such as an asset’s randomly assigned group ID in the ERP system, proximity to by-passers on the factory floor, or other data items that seemingly do not actually aid in making predictions. Yet these data types might have measurable and profound effects on model outcomes that may not be detected by traditional data science techniques if they are not actively being looked for. Model monitoring could expose that model outcomes, directly or indirectly, may in fact driven by non-causal factors such as ERP groupings, location, or different supplier contracts.

A definition of asset fairness may be useful here. To get there we need to map fairness concepts to industrial use cases, for example:

Protected Attributes: A protected attribute should be defined that maps to business-sensitive “non-causal” data such as location or an arbitrary asset grouping. The criteria should be that any form of discrepancy between such groups should a) ideally not be present and b) if it DOES exist it should be deemed as negative to the business.
Fairness Definition: With a defined protected attribute(s) we can now decide what “fair” means. This will inform us of what fairness metric to use (i.e. disparate impact, statistical parity difference, equality of odds, etc.).
Thresholds: Once we have a definition and measure of fairness, we can define the critical thresholds above/below which any detected discrepancy becomes problematic.

Defining the above concepts will allow us to detect situations in which the model is giving different ranges of positive/negative outcomes to different assets based on non-causal criteria (that is not even known to the model!). A data science team can now investigate why this is occurring and adapt the model accordingly.

Why is it important to track potential bias in APM?

Defining and monitoring the fairness of the ML models that you apply to your asset performance management (APM) strategy allows you to achieve greater insight into presence and magnitude of any detected bias. Ultimately, bias detection activities could uncover areas of improvement in the modelling approach that would otherwise not be detected. Strategy changes may also be needed to address any detected biases.

How can we monitor AI Fairness?

Utilizing SMEs’ knowledge and experience, you can create bias monitors to understand and quantify fairness. In this blog we used Watson OpenScale, a platform to oversee models lifecycle using quality, drift, fairness and explainability monitors. This enables data scientists to understand when and how a model is predicting failures in a biased way, allowing appropriate action to be taken. Learn more about APM model management here.

AI Fairness Monitoring in Industry — An Example

Using Maximo Application Suite and Cloud Pak for Data, we created and deployed an Intelligent Maintenance model that predicts whether an asset will fail. This model utilizes dynamic data and static data. Dynamic data are sensor variables that come from the asset utilization, such as temperature and speed. Static data are stationary variables that come from the asset itself, such as age and location.

Using Watson OpenScale, we monitor the quality of the model using feedback data (ground truth data). We also monitor drift and explainability using payload data (scoring data). OpenScale offers out-of-the-box fairness metrics but for our use case, we decided to create a custom metric which allows to formulate more complex and tailored metrics. More information on custom monitors in Watson OpenScale can be found here.

Texas vs New York: Which assets are more likely to be predicted to fail?

To uncover possible bias within our model, we created a custom fairness metric. Our metric essentially asks: Do assets in Texas and New York have different predicted outcomes when the assets’ temperature is below a certain range? This metric can provide us with information to highlight whether the model is biased toward one region in certain asset conditions.

In Watson OpenScale, our custom monitor continuously compares assets in different groups (regions where assets are located) using our fairness metric. Group 1 is Texas. Group 2 is New York. The predicted favorable label is “not-failing”. As Texas is a warmer climate than New York, SMEs are concerned general heat/humidity may impact the IoT sensors on our assets. Here, we are explicitly looking for an “non-causal” factor.

The metric we are monitoring is a ratio.

Ideally, we would aim for a value close to or equal to 1 to suggest equality between the different regions since the assets’ temperature should be the same in both locations.
A value < 1 implies higher benefit for the assets in New York since more assets are predicted as not-failing.
A value >1 implies a higher benefit for the assets in Texas since more assets are predicted as not-failing.

We utilized a bi-directional threshold. Our metric threshold is between 0.8 and 1.25 based on the 80% rule. We are concerned if our metric falls below 0.8 or above 1.25. Based on our monitor, we can initially see the metric is violating the lower threshold consistently until 5 PM. Here, we can begin to see some upward movement until the metric is not violating the thresholds.

When a threshold is violated, Data Scientists should first investigate if violations can be explained by the model and/or mitigated through Machine Learning practices. If a metric continues to violate a threshold, Data Scientists, SMEs, and Operations Engineers may need to collaborate to understand the source of the bias which could be connected to “non-causal” data.

Conclusion: Fair Asset Performance Management

AI in Sustainability practices aid with APM initiatives such as condition-based monitoring, predictive maintenance, asset failure prediction, detecting anomalies in daily usage patterns, and predicting the end-of-life of assets. These AI models increase profits, extend asset life, and avoid unplanned down-time. It is necessary that every deployed model is monitored to consistently check aspects such as drift, quality, fairness, explainability, and others.

At first glance, it may seem like AI fairness is not relevant with respect to APM in Industry related problems. However, fairness is critical in Industry. Fairness monitoring for APM AI models uncover and quantify how unexpected causal and non-causal factors can negatively affect your Machine Learning model and consequently your APM decisions. This fairness monitoring could uncover areas of improvement in the modelling approach that would otherwise not be detected. Furthermore, fairness metrics can uncover when strategy changes are needed to advance your APM approach.

We recommend that businesses rethink their ML-Ops strategy and include fairness-bias monitoring to ensure a Fair Asset Management Performance across their enterprise.