1 minute read

If there is one thing that I’ve learned recently is that you really want to validate your inputs and outputs early.

Assuming you are some machine learning or software engineer and you are building a service that takes some data from somewhere, does something with it, and then puts that transformed data somewhere else.

The very first thing you should do is to validate your inputs and validate your outputs.

Get a test like this up and running ASAP. It will save you a lot of headache down the line.

def test_integration():
    # ... main routine ...
    f = download_as_json(bucket, key)
    try:
        data_model.model_validate(f) # datamodel = pydantic.BaseModel
    except pydantic_core._pydantic_core.ValidationError as ve:
        logger.info(f"Failed validating json: {json.dumps(f, indent=2)} ")
        assert 0
    logger.info(f"Successfully validated json: {json.dumps(f, indent=2)}")
    assert 1

When you have data coming in your want to be really sure that what you take in is really what you think you take in.

For example, we thought we would get a pd.DataFrame with a list of integers, but it turns out that if these are nullable then pandas turns this into a list of floats!

When you have data coming out you want to be really sure that what you spit out is actually what you want to spit out.

For example, we thought we were writing "1" stringified ints, but because of the above mentioned it turned into "1.0".

Moral of the story is this: get some input and output checking up asap. Pydantic is great for this.

Subscribe

Comments