Hey everyone,
I’ve been working heavily with the H01 human cortex dataset recently, and I found myself incredibly frustrated by the memory bloat and massive overhead of traditional big-data frameworks when trying to parse the raw JSON exports into structured Parquet.
To bypass the need for an expensive HPC cluster or heavy cloud instances, I built a highly optimized, low-latency distributed pipeline designed to run on resource-constrained hardware. The underlying routing logic is lightweight enough that tasks can theoretically execute on edge devices with as little as 2 cores and 4 MB of RAM.
I wanted to share a recent benchmark to see if anyone else in the community is exploring decentralized edge computing for these types of heavy connectome pipelines.
The Benchmark: I ran this on a legacy 13-year-old Dell workstation to stress-test the memory footprint and I/O efficiency.
-
Input: ~7.1 GB of semi-structured JSON (10 export files, ~9.99 million raw synapse records).
-
Output: ~405 MB of structured, columnar Parquet data (~962k rows per shard).
-
Processing Time: ~24 minutes.
-
Peak Memory Footprint: 16.2 GB (Utilizing only 6% of available system RAM).
The architecture uses a decentralized Controller/Agent leasing model to prevent I/O bottlenecks. The agents drop dead-weight columns (like empty contact_area fields) inline, execute the transformation entirely in memory, and write directly to shared storage.
Would love to hear if anyone else is tackling ConnectomeDB/H01 bottlenecks without massive clusters, or if you have thoughts on optimizing these output schemas for downstream graph embeddings!