Benchmark
The embedded structure allows fast read without loading the whole archive, which is the main advantage of this package.
In the following, we benchmark the random read performance and compare with the HDF5 format.
Data Generation
A square matrix of size 5000 with random floating-point numbers is used. The matrix is dumped onto the disk with different configurations.
- For
msglc,small_obj_optimization_thresholdvaries from 4KB to 4MB,numpy_encoderis switched off so the matrix is stored as plain json instead binary blob. - For
h5py, the chunk size is computed so that each block has a size similar tosmall_obj_optimization_threshold. Compression is optionally switched on.
The following code snippets show the relevant functions.
The write time of msglc is in general constant, because the packer needs to traverse the whole json object.
Depending on different configurations, h5py requires different amounts of time to dump the matrix.

msglc shall be used for data that is written to disk for cold storage and does not require frequent changes.
When compression is on, h5py needs to traverse the object just like msglc, thus requires a similar amount of time.
Read Test
We mainly test the random read. To this end, we repeatedly read random locations in the matrix and measure the time required.
The following is the result of reading 1000 elements.

The following is the result of reading 10000 elements.
