How to efficiently compress a list of integers with a maximum value in Python
Hereโs a quick piece of code that Iโve been using repeatedly in my projects. Currently, Iโm working on storing large amounts of data efficiently, dealing with millions of records that need to fit into packets under 4KB. This makes compression and optimized storage crucial. Since Iโve found this technique to be quite handy, I thought itโd be helpful to share it with you all, hoping you find it useful as well.
Imagine you have a list of integers lst = [1,2,3.., 100, 2, 3, 100]
in Python that you need to compress, but you also know that this list has a maximum value (100 in this case).
What you can then do, if you want to compress and store this array, is the following: pack it into bytes, compress it with zlib
, and then base64
encode it.
class CappedIntBlob(List[int]):
"""Capped integer Base64 Large Object aka CappedIntBlob.
Represents a list of 2-byte unsigned short integers, capped at max_val
When converted to str, will convert all integers into 2-byte values and then base64 encode them after compressing with zlib.
The constructor accepts either a base64 string or an iterable of integers.
"""
def __init__(self, contents: Union[str, Iterable[int]], max_val: int = 10):
"""Constructor
@param contents: the contents, either a list of ints or a base64 string with 2-byte unsigned short integer representation.
"""
self.max = max_val
lst: Iterable[int]
if isinstance(contents, str):
# If input is a string, decode and decompress it
bs: bytes = zlib.decompress(base64.b64decode(contents))
# < = little endian, H = unsigned short with 2-byte size integers
lst = struct.unpack("<" + "H" * int(len(bs) / 2), bs)
elif isinstance(contents, Iterable):
lst = [min(int(x), self.max) for x in contents]
else:
raise ValueError("Expecting a string or an iterable for contents")
super().__init__(lst)
def __str__(self):
"""Convert to string
@return: base64 string encoding for list of ints in 2 byte unsigned short integer representation.
"""
return base64.b64encode(zlib.compress(struct.pack("<" + "H" * len(self), *self))).decode("utf-8")
And this is how you use it
>>> CappedIntBlob([1,2,3])
[1, 2, 3]
>>> str(CappedIntBlob([1,2,3]))
"eJxjZGBiYGYAAAAaAAc="
>>> CappedIntBlob("eJxjZGBiYGYAAAAaAAc=")
[1, 2, 3]
assert str(CappedIntBlob([5, 10, 20])) == str(CappedIntBlob([5, 100, 200])) # true
https://app.bannerbear.com/projects/POobgvMNDkxzxAYW70/templates/3g8zka5Y2OlaDEJXBY
https://www.photopea.com/
Comments