Should you train preprocessing on the test set?


This is a mistake because it leaks data from your train set into your test set.

Consider this example, first a processing routine is applied:

def processing(df):

df = processing(df)

And then later the data is split into a test and train set:

X_train, X_test, y_train, y_test = train_test_split(df[features], df[target], test_size=0.33, random_state=42)

This is wrong, and only right by accident.

Do this the other way around. First split, then train your preprocessing on the train set.

See this answer:

“You should do the same preprocessing on all your data however if that preprocessing depends on the data (e.g. standardization, pca) then you should calculate it on your training data and then use the parameters from that calculation to apply it to your validation and test data.

For example if you are centering your data (subtracting the mean) then you should calculate the mean on your training data ONLY and then subtract that same mean from all your data (i.e. subtract the mean of the training data from the validation and test data, DO NOT calculate 3 separate means).”