Reducing DuckDB Data Lake Storage by 52% with SciPy-style Sparse Arrays

duckdb
sparse
Author

Kian Ghodoussi

Published

September 21, 2025

It was pretty exciting to discover that nearly three-quarters of our storage volume was comprised of near-zero float32s.

Background

At Sturdy Statistics, we work with sparsity inducing hierarchical mixture models for unsupervised and semi-supervised text organization. These models are extensions of topic models, with the ability to add arbitrary levels of hierarchy, additional branches, and embed metadata into the structure of the graph. Our models learn a set of high level “topics” from the data, and then annotates the original data with these high level topics. We use DuckDB to store this data in a tabular structure at the paragraph and document level.

Because we explicitly model the power-law behavior of text data, our priors adapt to a rich-get-richer scheme. This has the really nice side effect of enforcing sparsity. There are a number of analytical benefits to this sparsity, but for the purpose of this post, the primary benefit is some sweet storage and speed wins.

Sparsity

The next steps were

  1. Write arrays into DuckDB in a sparse format
  2. Write a set of wrappers that enable me to operate on these sparse arrays as if they are standard DuckDB lists
  3. Benchmark performance and storage wins (I’m optimistic)

For (1), I decided to use SciPy style sparse arrays (Dimension: int, indices: list(int8), Values: list(float32)). For example: [7, 0, 0, 0, 23] => (5, [1, 5], [1, 23])

Example Conversion in Numpy
import numpy as np
def numpy_to_sparse(x):
    x = np.where(x < np.max(x)/3e3, 0.0, x)
    inds = np.where(x!=0)[0]
    vals = x[inds]
    dim = len(x)
    vals = (dim, inds.astype(np.uint16), vals.astype(np.float32))
    return vals

def numpy_from_sparse(vals):
    dim = vals[0]
    inds = vals[1]
    data = vals[2]
    arr = np.zeros(dim, dtype=np.float32)
    arr[inds] = data
    return arr

When I started working on the wrapper functions, I pretty quickly had to decide whether I wanted to store this sparse format as 1 column or to split it up into 3 different columns. The benefit of storing it in one column is that I could fully abstract away the fact that these arrays are sparse. However, because DuckDB is a columnar data store, I opted to split out the indices and the values so I could have the flexibilty access either value alone without having to read both. Our data is sparse enough that the indices alone provide a mechanism to rapidly filter excerpts that contain a topic to some degree. If I stored the indices in the same column as the values, I would have to read 4x as much data (int8 vs int8+float32). The fact that I could think of one immediate performance win made my choice easy. I also fully removed the dimensionality parameter, opting to store it in the table metadata.

If my goal were to integrate this into DuckDB, I would have opted for the more ergonomic approach.

I put together a few helper functions that enabled me to perform all the operations We use these functions internally and publicly expose them in our API.

Sparse SQL Functions
CREATE OR REPLACE TEMPORARY MACRO sparse_list_extract(position, inds, vals) AS  
    COALESCE(vals[list_position(inds, position)], 0);

CREATE OR REPLACE TEMPORARY MACRO sparse_list_select(positions, inds, vals) AS  
    list_transform(positions, x-> sparse_list_extract(x, inds, vals));

CREATE OR REPLACE TEMPORARY MACRO dense_x_sparse_dot_product(arr, inds, vals) AS 
    list_dot_product(list_select(arr, inds)::DOUBLE[], vals::DOUBLE[]);

CREATE OR REPLACE TEMPORARY MACRO sparse_to_dense(K, inds, vals) AS 
    list_transform(
        generate_series(1, K), 
        x-> coalesce(list_extract(vals, list_position(inds, x)), 0)
    );

These helpers replace array indexing, multi-indexing and converting sparse lists to dense. I was actually surprised how rarely I actually needed to perform this conversion: instead for operations such as Hellinger distance or dot products, I was simply able to reduce the dimensionality of the arrays to the overlap in sparse index values (spoiler: performance wins are upcoming).

Benchmarking the Code

My first attempt was to benchmark DuckDB file sizes storing various levels of sparsity. I wanted to have a really clear and granular understanding of how sparsity affects data at various levels of dimensionality and sparsity. I decided to put together some artificial data to benchmark the results

Hint: this was a bad idea

I used numpy to generate a matrix of KxD where K is the number of topics (embedding dimension in Neural terms) and D is the number of documents. I set the first S rows to be 1/K and the rest were zero. I took these matrices and created two DuckDB files: 1 with raw storage and the other sparsely encoded.

I was shocked to see that there was virtually no difference in the file size. I was pretty stunned that storing a 512 dimensional array was no different in duckdb than 2 8 dimensional arrays.

In my eagerness to get an easy 75% reduction in file size, I completely overlooked the fact the DuckDB runs snappy compression on it’s files. This immediately deflated my initial optimism.

However, when it comes to compression benchmarks, compressing contiguous stretches of zeros and ones is about as easy as it gets. I decided to conduct some more rigorous benchmarks using random numbers in random places. I saw a slightly better result, achieving an average file size reduction of 80%.

Note: at this point in time my entire DuckDB file was a single database table comprised of a single list either raw or sparsely stored across two columns.

This was a solid win, albeit one much less than I hoped for. But it was enough to see this project through to the next stage.

Real Data (or the only benchmark that mattered)

I downloaded a random subset of our production DuckDB field to migrate to our sparse format for storage and performance benchmarking. I repurposed my previous script to read the original DuckDB files, convert the relevant list fields to a sparse format, and write a new file. The result made me simultaneously excited, confused, and hesitant

This is a major reduction in our file sizes, hitting an over 50% reduction for some of our largest files. What happened?

I previously mentioned that my test was focused on a single column at a time. In production, we store several sparsely populated arrays along side any arbitrary metadata our integrations or a user might upload. I was expecting this to actually reduce the impact of our total file size reduction. It appears instead that the additional data handicapped the array compression to such a degree that we actually saw significant storage wins from my previous benchmarks not just relative to the array storage but even relative to the entire file size.

While we didn’t see the 75% reduction in size I was hoping for due to compression, we were able to achieve an average 52% reduction across all files with the largest reduction coming from the largest files.

Scaling with Dimensionality

As I, to some extent, expected, the greater the dimensionality of the array, the greater the storage win. Because we tend to increase the number of topics on larger datasets, our largest datasets also have the largest arrays which led to the largest reduction in size.

While the large dataset performance is a big win, we almost care more about general performance because we support so many use cases. Our public Deep Dive Search creates a few hundred indices per day, split between our fast mode (48 dimensions) and our comprehensive mode (96 dimensions).

Our production platform integrations has a slightly lower number of analyses trained per day with a default of 196 dimensions.

Our large scale models tend to be a more deliberate effort, with maybe 1-2 of these trained per week. Obviously we hope each of these numbers go up over time, but at the moment the total amount of storage is relatively evenly split between these categories of models.

Performance

I was both optimistic and concerned about the impact of sparse storage on performance. It turned out that we saw no observable changes in our search query performance or our topic retrieval performance. This was likely because I was able to leverage my dense_x_sparse_dot_product SQL function to avoid casting any sparse arrays to dense.

Our TopicSearch and TopicDiff however saw extremely meaningful performance degradations (2-5x slower than before). These APIs are much more analytical, doing statistics on top of our topic counts, and expected to operate with dense arrays. I simply converted the sparse arrays to dense in my first pass, which obviously added significant overhead.

However, our new sparse storage format opened the door to fully leveraging UNNEST much more efficiently due to the lower dimensionality of the arrays and the separate indices & values arrays. Because of sparse storage, an UNNEST that would have previously expanded my number of rows by a factor of 500 now expanded my number of rows by a factor of 3-8.

This not only sped up my DuckDB code but also allowed me to migrate a major portion of my python post-processing code to DuckDB. With some tinkering I was able to see a solid 2-10x performance improvement, with the larger improvements occurring on the larger datasets.

Conclusion

My main conclusion here that I’ve learned before and will doubtlessly learn again is the only benchmark that matters in on your own data. Simulations and estimates are helpful thought exercises but the results are not.

The other conclusion is that one of my goals was to make it as seamless as possible to treat sparse arrays like dense arrays. However, it’s a completely different data structure that actually opens the door to a lot of performance improvements if you really take advantage of how it’s being stored under the hood.

I have these sparse array functions saved in a very small Github repo. I also put together some documentation with examples. We plan to reach out to DuckDB in the near future to see if there is any interest in support Sparse style arrays. If anyone else is interested in running with this idea in the short term either for their own project or on an open source contribution, I’d love to chat or help out however I can.