# Should you train preprocessing on the test set?

NO. NEVER PREPROCESS YOUR TEST SET

This is a mistake because it leaks data from your train set into your test set.

Consider this example, first a processing routine is applied:

def processing(df):
...
return(df)

df = processing(df)


And then later the data is split into a test and train set:

X_train, X_test, y_train, y_test = train_test_split(df[features], df[target], test_size=0.33, random_state=42)


This is wrong, and only right by accident.

Do this the other way around. First split, then train your preprocessing on the train set.