Caching in pybids?

JarodRoland · April 24, 2025, 2:14pm

Summary of what happened:

When running the abcd-hcp-pipeline via docker I get an error describing a missing file that I intentionally deleted. The file is no longer in the BIDS directory, so I’m not sure why the pipeline is still looking for it. I’m assuming it must be using a cache of the BIDSLayout object, but I don’t know how to clear such a cache to read from the disk to understand that file has been removed.

Command used (and if a helper script was used, a link to the helper script or the command generated):

Standard Docker container running v0.1.4 of the abcd-hcp-pipeline with basic subject/session/etc input parameters

Version:

abcd-bids-pipeline version 0.1.4

Environment (Docker, Singularity / Apptainer, custom installation):

Docker

Relevant log outputs (up to 20 lines):

Traceback (most recent call last):
  File "/app/run.py", line 397, in <module>
    _cli()
  File "/app/run.py", line 69, in _cli                                  
    return interface(**kwargs)
  File "/app/run.py", line 277, in interface
    for session in session_generator:
  File "/app/helpers.py", line 39, in read_bids_dataset
    layout = BIDSLayout(bids_input, index_metadata=True)
  File "/usr/local/lib/python3.6/dist-packages/bids/layout/layout.py", line 212, in __init__
    indexer.index_metadata()
  File "/usr/local/lib/python3.6/dist-packages/bids/layout/index.py", line 198, in index_metadata
    with open(bf.path, 'r') as handle:
FileNotFoundError: [Errno 2] No such file or directory: '/bids/sub-11923040/ses-01/anat/sub-11923040_ses-01_run-01_T2w.json'

in the log output above you see the error regarding a file with run_01. This subject had two runs of the T1w & T2w images. The automated pipeline was failing to properly merge the two runs causing so I merged it myself, removed the run-01 and run-02 nii.gz and json files, and replaced with a single file with acq-mean_T2w specifier instead. So I would expect the pipeline to find sub-11923040_ses-01_acq-mean_T2w.json but instead it is still looking for sub-11923040_ses-01_run-01_T2w.json even though that file is no longer in the directory tree on disk.

I’m guessing the pybids python package might have cached the BIDSLayout object somewhere, which woudl make sense for efficiency to avoid rebuilding a large layout with lots of subjects, but it isn’t being refreshed when I rerun the container pipeline. Any thoughts on if such caching occurs and where such a file may be stored on disk that I can remove? The docker container is reloaded each time on the HPC cluster, so it can’t be in the container, must be on the mounted directories somewhere, but I’m at a loss.

adelavega · April 24, 2025, 5:18pm

Hi Jarod,

I don’t know enough about the abcd-bids-pipeline but there is a cache for pybids, but from a cursory look at the pipeline it looks like its not being used:

github.com

DCAN-Labs/nhp-abcd-bids-pipeline/blob/58b94cada048a404af693fa1b033bd34c41d08d8/app/helpers.py#L38


      
              positive: spin echo filename list (if applicable)
              negative: spin echo filename list (if applicable)
            },
            fmap_metadata: {
              positive: bids meta data list (if applicable)
              negative: bids meta data list (if applicable)
            },
          }
          """
          
          layout = BIDSLayout(bids_input, index_metadata=True)
          subjects = layout.get_subjects()
          
          # filter subject list
          if isinstance(subject_list, list):
              subjects = [s for s in subjects if s in subject_list]
          elif isinstance(subject_list, dict):
              subjects = [s for s in subjects if s in subject_list.keys()]
          
          subsess = []
          # filter session list

Having said that, the more likely issue is not the cache, but that that JSON sidecare file is expected by BIDS to be there.

@effigies I’m a little confused because it looks to me that the spec does not require this file, but instead recommends it-- any suggestions?

For now, can you make an empty file (or an empty JSON file with an empty object), to see if that satisifies pybids?

JarodRoland · April 24, 2025, 5:28pm

Hi @adelavega,
Thanks for the reply. The json file is there. Below is the anat directory with the T1 and T2 files (after I deleted the run-01 and run-02 versions and replaced with acq-mean versions). So it should have everything it needs for these acq-mean versions, and i’ve deleted the run-01 & run-02 versions, but the BIDSLayout object still expects them to be there and throws an error when they’re gone.

anat
├── sub-11923040_ses-01_acq-mean_T1w.json
├── sub-11923040_ses-01_acq-mean_T1w.nii.gz
├── sub-11923040_ses-01_acq-mean_T2w.json
└── sub-11923040_ses-01_acq-mean_T2w.nii.gz

effigies · April 24, 2025, 5:45pm

PyBIDS does not require any JSON files to be present, it just walks over what it finds. I would do:

host     $ docker run <docker args> --entrypoint=bash <image>
container$ ls -l /bids/sub-11923040/ses-01/anat/

That will let you see the view that the process has. If that all looks normal, you can then use:

container$ python -m pdb /app/run.py

That will allow you to look in more detail at the state of the program when it crashes. I would try to figure out what bf is and see if there are any variables that help determine how it came to point at a non-existent file.

I agree with @adelavega that there shouldn’t be an attempt to reuse a cached database.

JarodRoland · April 24, 2025, 9:56pm

Running with pdb in an interactive Docker was a great idea. It turns out the problem disappears in that environment. Interactive jobs on this HPC cluster run on different nodes than those submitted to the standard queue, so now I’m wondering if there is some funny business with the HPC filesystem and maybe not pybids. Thank you again for the quick replies, I’m still perplexed, but I’ll keep digging.

JarodRoland · April 25, 2025, 5:24pm

Sure enough there is some funny business going on with the HPC. Apparently the system has some kind of caching setup with the filesystem that speeds up file access to other patrons of the system don’t slow it down for everyone with high I/O operations. Unfortunately, that resulted in my issue that was tough to figure out. The key insight was running on an interactive node that doesn’t use the caching system and therefore avoided the strange error. Anyway, just closing the loop, problem solved and it was not pybids’ fault. Thanks again for the help!