1 minute read

This is another lesson that I learned building machine learning systems. For the other lessons that I learned click here.

When you are creating a data set, make a small statistics file.

I learned this the hard way. When you are creating your dataset, make sure to also create a file that contains the most important statistics of this dataset at the time of creation. At the very least you want to have the minimum, maximum, count, and number of unique values per variable that you have.

For me, the number of unique values is very important because I often end up making embeddings. And to make embeddings you need to know how much values you want to embed.

But really, this file can be anything ranging from a pickled dictionary to a dataframe or just some plain old text. Trust me, you will need these values later down the line. This is especially important when your dataset becomes big because then it can take nontrivial time to actually create the dataset.

So, keep in mind when you are creating a dataset to make a small statistics file!