Showcase: Processing 10M H01 Synapses in 24m on a 13-Year-Old Workstation (Alternative to JVM/Spark)

Hey everyone,

I’ve been working heavily with the H01 human cortex dataset recently, and I found myself incredibly frustrated by the memory bloat and massive overhead of traditional big-data frameworks when trying to parse the raw JSON exports into structured Parquet.

To bypass the need for an expensive HPC cluster or heavy cloud instances, I built a highly optimized, low-latency distributed pipeline designed to run on resource-constrained hardware. The underlying routing logic is lightweight enough that tasks can theoretically execute on edge devices with as little as 2 cores and 4 MB of RAM.

I wanted to share a recent benchmark to see if anyone else in the community is exploring decentralized edge computing for these types of heavy connectome pipelines.

The Benchmark: I ran this on a legacy 13-year-old Dell workstation to stress-test the memory footprint and I/O efficiency.

  • Input: ~7.1 GB of semi-structured JSON (10 export files, ~9.99 million raw synapse records).

  • Output: ~405 MB of structured, columnar Parquet data (~962k rows per shard).

  • Processing Time: ~24 minutes.

  • Peak Memory Footprint: 16.2 GB (Utilizing only 6% of available system RAM).

The architecture uses a decentralized Controller/Agent leasing model to prevent I/O bottlenecks. The agents drop dead-weight columns (like empty contact_area fields) inline, execute the transformation entirely in memory, and write directly to shared storage.

Would love to hear if anyone else is tackling ConnectomeDB/H01 bottlenecks without massive clusters, or if you have thoughts on optimizing these output schemas for downstream graph embeddings!