Leakage (machine learning)


In statistics and machine learning, leakage is the use of information in the model training process which would not be expected to be available at prediction time, causing the predictive scores to overestimate the model's utility when run in a production environment.
Leakage is often subtle and indirect, making it hard to detect and eliminate. Leakage can cause modeler to select a suboptimal model, which otherwise could be outperformed by a leakage-free model.

Leakage modes

Leakage can occur in many steps in the machine learning process. The leakage causes can be sub-classified into two possible sources of leakage for a model: features and training examples.

Feature leakage

Column-wise leakage is caused by the inclusion of columns which are one of: a duplicate label, a proxy for the label, or the label itself, when training the model which are not available at prediction time. This can include leaks which partially give away the label.
For example, including a "MonthySalary" column when predicting "YearlySalary"; or "MinutesLate" when predicting "IsLate"; or more subtly "NumOfLatePayments" when predicting "ShouldGiveLoan".

Training example leakage

Row-wise leakage leakage is caused by improper sharing of information between rows of data.
Data leakage types:
For time-dependent datasets, the structure of the system being studied evolves over time. This can introduce systematic differences between the training and validation sets. For example, if a model for predicting stock values is trained on data for a certain five-year period, it is unrealistic to treat the subsequent five-year period as a draw from the same population. As another example, suppose a model is developed to predict an individual's risk for being diagnosed with a particular disease within the next year.

Detection