A financial services company uses a machine learning model to approve loan applications. The model is a gradient boosting classifier trained on historical loan data. Recently, the company noticed that the model's approval rate for applicants from a certain demographic group is significantly lower than for other groups, even though the model's overall accuracy remains high. The data science team has been asked to address this potential bias while minimizing the impact on overall model performance. The team has access to the training data and the trained model. They have limited time and budget. Which course of action should the team take first?
This directly addresses the root cause and is resource-efficient.
Why this answer
The most efficient first step is to analyze the training data for bias and then retrain the model with bias mitigation techniques like reweighting. Option A is wrong because collecting more data is resource-intensive and may not address bias. Option C is wrong because feature engineering may not help if the bias is in the labels.
Option D is wrong because post-hoc adjustments can introduce other issues and may not be as effective as addressing bias during training.