When in doubt, batch requests
When in doubt, always try to batch your operations.
In the context of machine learning operations (mlops) I’ve learned that batching requests is almost always the better answer. Why? Mainly because network hops are expensive and inference is optimized for matrix math. If you must choose between 100 single requests or one batched request of similar size, choose the batched request. The batched request takes longer, but split over the number of requests in the batch the time per request is shorter.
Let me share two real examples:
- Example 1 (Model state storage): We built an API that had to do a
get_state()
to a database somewhere else. We did this around 200 times per batched request, which is a lot. In the end we parallelized this operation but if we had built out aget_state_batch()
, I’m sure it would be even faster. We did this with theset_state()
which we had aset_state_batch()
for and it was orders of magnitude faster. - Example 2 (Model inference): We built an API that did model inference and we took some engineering time to have this API operate on batched requests. This took some cross-team effort because of request multiplexing and load-balancing but in the end we managed to squeeze 100 requests together in a single big request, because of that we cut our machines down from 5 to 1 resulting in large cost savings. Total time per request was larger, but average latencies over the invididual requests were a lot shorter.
See also Request Batch on martinfowler.com
Comments