Nifti i/o with Cython?

Ariel_Rokem · January 21, 2025, 10:12pm

Hello! Has anyone here ever written nifti I/O code that wraps the nifti_clib code directly in Cython? Was that a big win in terms of run-time? We are interested in reading many files in parallel, so we’d like to use openmp in the Cython code to parallelize reads. Seems like that would best be done this way, if we don’t particularly care about portability? Or maybe we’re thinking about this the wrong way?

Thanks for any input/ideas!

Ariel

effigies · January 22, 2025, 2:59am

I doubt you’ll get a lot of benefit from nifti_clib. Loading and correctly handling headers is not the bottleneck; most often it’s gzip. If you have indexed_gzip installed, you will get all the benefits of Cython, although not of a parallel gzip algorithm. I have seen (but not used) rapidgzip, which might be of interest before writing your own library.

Assuming what you want at the end of the day is either a nibabel image or a numpy array, then I would suggest that you consider the ArrayProxy class, which is how we represent the information needed to load data on-demand: nibabel/nibabel/arrayproxy.py at master · nipy/nibabel · GitHub. In particular, the __array__() and __getitem__() methods are the main ways people will access the data (np.asanyarray(img.dataobj)). An optimizing loader could work with the ArrayProxy spec:

def dataloader(fname_or_fobj, shape, dtype, offset, slope, intercept):
    ...

def imgloader(fname):
    proto_img = nb.load(fname)
    data = dataloader(
        fname,
        proto_img.shape,
        proto_img.dataobj.dtype
        proto_img.dataobj.offset,
        proto_img.dataobj.slope,
        proto_img.dataobj.inter,
    )
    return proto_img.__class__(data, proto_img.affine, proto_img.header)

cc @paulmccarthy for his thoughts. Also @psadil was profiling things recently and might have some insight.

psadil · January 22, 2025, 1:23pm

Just in case: have you had a chance to confirm and test that the filesystem you’d be working on handles parallel I/O well? I ask because the cluster I tend to work on has only one filesystem that is guaranteed to be secure enough for files with personally identifying information. That filesystem is, tragically, unable to cope with many simultaneous filesystem interactions, and it’s slow, too. Even simpler forms of parallelism have caused problems after a certain scale (e.g., running tens of participants through fMRIPrep).

Ariel_Rokem · January 22, 2025, 5:24pm

Thank you both for weighing in!

@psadil: I think that it’s likely that the filesystem we’re working with will support this since we’re going to be working on an AWS SSD (see Performance Impact of Parallel Disk Access | Piotr Kołaczkowski, which I think applies here). Potentially also directly with S3.

@effigies: to your point, what if we plan to unzip the files we are working with in advance? We’d much rather uncompress once upfront and for our application we’re going to need to make multiple passes of reading the data. Would you expect a benefit of cythoning in that case?

Any thoughts about going full zarr for this use-case?

We’ll definitely need to do some empirical benchmarking of different ways of doing this, but it’s really helpful to hear about issues we might run into and the pointer to arrayproxy of course.

Thanks again!

psadil · January 22, 2025, 6:17pm

Regarding zarr, could you say a bit more about the specific scenario? For example, is this for fitting some DL model, perhaps with tensorflow or pytorch? If that’s the situation, then I’d hazard that it’d be worth at least starting with their built-in dataset tools, and seeing whether the achieved speed is sufficient. By built-in tools, I mean tensorflow records (see their discussion of i/o optimizations) or pytorch datasets/dataloaders (see their discussion of parallel file access, which I understand to be based on a multiprocess model, in which each process would be reading files that contain only the data array as either a serialized numpy or pyarrow object). If that’s not your scenario, then sorry for the noise (though, that tensorflow discussion of i/o optimizations may still be helpful since it discusses not only parallel data extraction but also strategies like optimizing the overlap of i/o vs compute operations through prefetching).

effigies · January 22, 2025, 10:39pm

Definitely one approach. And you could just as easily apply scale factors and save as float32 (or any appropriate dtype) at that point, which would mean you can memmap or use any other method to achieve random access.

Possibly a very specific access mechanism, but it’s going to be hard to beat numpy.memmap(). Possibly the ArrayProxy.__getitem__ slicer calculations could be sped up, but I have always assumed that the I/O is the slow bit, and we farm out to numpy for that. That said, it’s possible that since they were written 15+ years ago, numpy has provided functions that eliminate the need for most of our code.

If you’re going to be putting chunks on S3, I don’t see any reason not to, although I’m not 100% sure what all going full zarr implies. You may want to look at GitHub - neuroscales/nifti-zarr: A draft specification for the nifti-zarr format (PR#1 has significant discussion, PR#7 has the most recent proposal) for a round-trippable Zarr interpretation of Nifti.

Ariel_Rokem · January 23, 2025, 4:27am

I’m not 100% sure what all going full zarr implies.

Yeah - basically what you said - put chunks on S3, ignoring the fact that the data originally came from a nifti, with minimal metadata (I like the idea of casting everything to a space-conserving dtype!). Use zarr from then on. That’s admittedly a kludge.

You may want to look at GitHub - neuroscales/nifti-zarr: A draft specification for the nifti-zarr format (PR#1 has significant discussion, PR#7 has the most recent proposal) for a round-trippable Zarr interpretation of Nifti.

Thanks for pointing this out! That seems quite well worked out and potentially useful, if not for our current use-case certainly for many future ones.

paulmccarthy · January 24, 2025, 10:50am

Hi all, I don’t think I can add much to this discussion, apart from mentioning that, if you’re going to use e.g.numpy.memmap, there’s probably no need to use Cython. Cython is a very good choice if you need to interface with an existing C library, but it sounds like this may not be necessary for you.