I Thought This Would Take 20 Minutes

A short story about good abstractions colliding

I recently lost an entire day to a data ingestion task that I was confident would take about twenty minutes.

This post is about that day.

I’m not criticizing the tools — Clojure, tech.v3.dataset, tmducken, DuckDB, and the NumPy file format — are all excellent. I use them because they’re well-designed and reliable. But this simple-seeming task hit a perfect storm of incompatibilities among several (individually) good abstractions. Along the way, I learned a lot about how representation and language boundaries can kill performance.

The Problem

I’m working in Clojure, and I need to import a NumPy binary file into a DuckDB table. My .npy file contains a large 2D array. The array is:

Row-major (C order)
Little-endian
float32
Small enough to easily fit into memory
Large enough that boxing every element is prohibitive (i.e., I can’t afford to represent each number as an independent heap object.)

(I wasn’t using Python or NumPy itself; I routinely use the NumPy format in our C programs because it’s simple, well-specified, and trivial to map directly into memory.)

My goal was to load this data into DuckDB with the following schema:

A single column
Each row contains an array (vector) of float32s
Eventually converted to a sparse array

Conceptually, this is a very natural representation: in machine learning terms, each row is an observation, and the list is a fixed-length feature vector. While DuckDB is a column-major database, this isn’t an incompatibility: a column of vectors is physically stored in a way that mirrors a row-major file. Because the logical structure of the database matches the physical layout of my NumPy file, I should have been able to bulk-append the data with zero copying, zero transposing, and zero boxing.

This seemed like a straightforward plumbing task.

Decoding NumPy (The Easy Part)

The first step was decoding the raw NumPy payload into a Java primitive array.

My NumPy files are little-endian; the JVM is big-endian. That means one pass over the data to decode and byte-swap. This is annoying, but straightforward and unavoidable when moving data from C to the JVM.

For floats, the process roughly looks like this:

(...
  (:require
    [sturdy.fs :as sfs])
  (:import
    (java.nio ByteBuffer ByteOrder)))

(let [bs  (sfs/slurp-bytes file)
      bb  (doto (ByteBuffer/wrap bs) (.order ByteOrder/LITTLE_ENDIAN))
      n   (quot (alength bs) 4)
      arr (float-array n)]
  (dotimes [i n]
    (aset-float arr i (.getFloat bb))))

Once decoded, I had a Java primitive array containing the full row-major payload. Loading this into a tech.v3.dataset was straightforward and required no copying, or transposing:

(require '[tech.v3.dataset :as ds])
(require '[tech.v3.datatype :as dtype])

(let [buf      (dtype/->buffer data)
      rowviews (mapv (fn [r]
                         ;; zero-copy view into buf
                         (dtype/sub-buffer buf (* r cols) cols))
                       (range rows))]
  (ds/->dataset {:c1 rowviews}))

The resulting function looks like this:

(np/npy->dataset-rowlists "test-resources/npy-fixtures/shape_2x3__dtype_f4.npy")
;; => _unnamed [2 1]:
;;    |               :c1 |
;;    |-------------------|
;;    | [-3.0 -1.75 -0.5] |
;;    |   [0.75 2.0 3.25] |

This representation matches my target schema exactly: one column, where each row is a list of floats. Since tech.v3.datasetsupports list-valued columns, the path forward seemed trivial.

The plan was to simply pass this dataset to tmducken and call insert-dataset!.

Unfortunately, I hit a hard wall: tmducken currently cannot append datasets with list-valued columns.

The False Start: Fighting the Formats

That one limitation changed everything. To get the data in, I had to transform it into a format tmducken could accept: a “wide” dataset with hundreds of scalar columns (one per feature). This meant I had to:

Transpose the data in Clojure (from row-major list to column-major scalars).
Ingest the wide table.
Transpose again inside DuckDB (collapsing columns back into a list) using SQL.

The Clojure Transpose

Transposing a large dataset in a dynamic language without boxing is tricky. I ended up writing a macro to generate type-specific loops to maintain primitive performance:

(defmacro def-transposer
  "Define a transpose fn for a primitive array type.
   src-tag is one of: bytes, shorts, ints, longs, floats, doubles
   array-ctor is e.g. float-array
   aset-fn is e.g. aset-float"
  [fname src-tag array-ctor aset-fn]
  (let [src  (with-meta (gensym "src")  {:tag src-tag})
        rows (gensym "rows")
        cols (gensym "cols")
        dsts (gensym "dsts")
        dstj (with-meta (gensym "dst") {:tag src-tag})]
    `(defn ~fname
       [~src ~rows ~cols]
       (let [~dsts (vec (repeatedly ~cols #(~array-ctor ~rows)))]
         (loop [i# 0
                base# 0]
           (when (< i# ~rows)
             (loop [j# 0]
               (when (< j# ~cols)
                 (let [~dstj (~dsts j#)]
                   (~aset-fn ~dstj i# (aget ~src (+ base# j#))))
                 (recur (long (inc j#)))))
             (recur (long (inc i#)) (long (+ base# ~cols)))))
         ~dsts))))

Which I then instantiated once per primitive type:

(def-transposer transpose-float  floats  float-array  aset-float)
(def-transposer transpose-double doubles double-array aset-double)
(def-transposer transpose-byte   bytes   byte-array   aset-byte)
(def-transposer transpose-short  shorts  short-array  aset-short)
(def-transposer transpose-int    ints    int-array    aset-int)
(def-transposer transpose-long   longs   long-array   aset-long)

I did the same for a “split” operation (used for Fortran-ordered arrays), and added a small dispatch table to select the correct implementation based on an array’s order and dtype.

It worked, and it was fast enough, but it was a lot of machinery for a task that conceptually shouldn’t exist.

The SQL Re-Transpose

Once the wide data was in DuckDB, I had to fold it back into the target schema. Because the table had hundreds of columns, I had to build the SQL programmatically:

(defn wide->array
  [conn src-table-name dst-table-name {:keys [dtype col-names]}]
  (let [sql-type  (duckdb-type dtype)
        num-cols  (count col-names)
        arr_spec  (->> col-names
                       (map (fn [c] (name c)))
                       (string/join ", ")
                       (format "array_value(%s)"))
        sql'  (format "SELECT line_no, %s::%s[%d] AS vals FROM %s ORDER BY line_no"
                      arr_spec
                      sql-type
                      num-cols
                      src-table-name)
        sql   (format "CREATE OR REPLACE TEMPORARY TABLE %s AS %s"
                      dst-table-name
                      sql')]
    (duckdb/run-query! conn sql)))

This worked, but it felt inelegant. DuckDB is optimized for columnar storage, but managing hundreds of individual column streams simultaneously—only to immediately collapse them—creates significant metadata overhead and memory pressure.

I had successfully fought the JVM’s type system on one side and the SQL schema on the other. But the result was two transposes and a copy, just to end up exactly where I started.

The Real Solution: Stop Fighting

Stepping back, I realized I was trying to force a square peg into a round hole through brute force and sheer will. The better solution was to abandon the “table” concept entirely during ingestion and treat the data as a sparse coordinate list.

Instead of transposing the array, I can unnest the data as I read it. I decode the stream and accumulate row and column indices for every value. This requires no transpose, doesn’t explode my column count, and naturally handles both C-ordered and Fortran-ordered arrays.

Here’s my new function in action:

(npy->dataset-unnested-nz "test-resources/npy-fixtures/shape_2x3__dtype_u4.npy")
;; => _unnamed [6 3]:
;;    | :row_no | :col_no |       :val |
;;    |--------:|--------:|-----------:|
;;    |       0 |       0 | 4294967291 |
;;    |       0 |       1 |         12 |
;;    |       0 |       2 |         29 |
;;    |       1 |       0 |         46 |
;;    |       1 |       1 |         63 |
;;    |       1 |       2 |         80 |

Unnesting expands the storage somewhat—4 bytes for a single float32 becomes 14 bytes per element (value + row index + col index). However, my data is sparse, so this form allows me to simply skip zero values. This offsets the extra storage cost and, in my case, actually saved memory.

The Results

Switching to this approach was a massive win.

Complexity: The transposing macros and dynamic SQL generation vanished.
Performance: The ingestion time dropped from ~850 ms to under 80 ms—a more than 10x speedup.

More importantly, the solution (almost) respects the structure the data already has. Instead of fighting each language on either side of the translation, the pipeline now moves the data more naturally from its source to its destination.

The Lesson

The lesson I’m taking away isn’t “use UNNEST” — it’s that intractable performance problems often come from abstractions meeting at the wrong boundary. (Though “always use UNNEST” isn’t a bad takeaway, either.)

The data started out contiguous, typed, and structured. In my first attempt, as it crossed library boundaries, those guarantees were erased, and most of my effort went into reconstructing them. The “obvious” path of treating the data as a dense table forced me to fight against the physical reality of how the data was stored.

By shifting the perspective to a coordinate list, the friction disappeared.

If you’re interested, I wrapped this work into a small, open-source Clojure library for reading NumPy .npy files and moving them efficiently into columnar systems. You can find the code on GitHub, or browse the documentation on cljdoc.