How to access LFP files listed in the IBL as being accessible

babou_osellot · December 15, 2023, 11:40pm

I am trying to download several lfp files from a brain region. In the tutorila they show CA1 as an example so I used it here. There are supposedly 173 sessions worth of data and 173 insertions which have a channel in CA1. But when I got to loop through those to pull each LFP many of the .bin files appear to be missing and I get an error. I think I am doing something wrong.

My code:

from one.api import ONE

import spikeglx

from brainbox.io.one import load_channel_locations

one = ONE(password='international')

#Searching for datasets

brain_acronym = 'CA1'

# query sessions endpoint

sessions, sess_details = one.search(atlas_acronym=brain_acronym, query_type='remote', details=True)

print(f'No. of detected sessions: {len(sessions)}')

# query insertions endpoint

insertions = one.search_insertions(atlas_acronym=brain_acronym)

print(f'No. of detected insertions: {len(insertions)}')

Returns:

No. of detected sessions: 173
No. of detected insertions: 173

But then I get an error when I search for these insertions:

session_list = [x for x in sessions]

# probe id and experiment id

eid = session_list[0]

pid, probename = one.eid2pid(eid)

band = 'lf' # either 'ap','lf'

# Find the relevant datasets and download them

dsets = one.list_datasets(eid, collection=f'raw_ephys_data/{probename}', filename='*.lf.*')

data_files, _ = one.load_datasets(eid, dsets, download_only=False)

bin_file = next(df for df in data_files if df.suffix == '.cbin')

# Use spikeglx reader to read in the whole raw data

sr = spikeglx.Reader(bin_file)

Returns an error:

--------------------------------------------------------------------------- TypeError Traceback (most recent call last) Cell In[34], [line 10](vscode-notebook-cell:?execution_count=34&line=10) [8](vscode-notebook-cell:?execution_count=34&line=8) dsets = one.list_datasets(eid, collection=f'raw_ephys_data/{probename}', filename='*.lf.*') [9](vscode-notebook-cell:?execution_count=34&line=9) data_files, _ = one.load_datasets(eid, dsets, download_only=False) ---> [10](vscode-notebook-cell:?execution_count=34&line=10) bin_file = next(df for df in data_files if df.suffix == '.cbin') [12](vscode-notebook-cell:?execution_count=34&line=12) # Use spikeglx reader to read in the whole raw data [13](vscode-notebook-cell:?execution_count=34&line=13) sr = spikeglx.Reader(bin_file) TypeError: 'NoneType' object is not iterable

Gaelle_Chapuis · December 18, 2023, 11:02am

Dear Angus,
The issue here is that you are converting an EID (session) to a PID (insertion), however there can be multiple probe insertions done within a single session.
This is what happens here,

The variables pid and probename contains 2 insertions (len(probename) == 2).

As a result, the query you subsequently make on the dataset is invalid, as probename is now a list of length 2 (and not a string):

# Find the relevant datasets and download them
dsets = one.list_datasets(eid, collection=f'raw_ephys_data/{probename}', filename='*.lf.*')

as a result dsets is an empty list [] and the following part of the code cannot run.

Gaelle_Chapuis · December 18, 2023, 11:05am

The documentation here identifies first a PID of interest (a single string) and then converts it to an EID and probe name:

https://int-brain-lab.github.io/iblenv/notebooks_external/loading_raw_ephys_data.html#Option-2:-Download-all-of-raw-ephys-data

You can use directly the insertions variable you created in the example code above to query for the datasets. Let us know if you run into issues this way.
Cheers

babou_osellot · December 18, 2023, 8:26pm

Thanks I figured that out a coupl days ago. Nearly got my script running.

babou_osellot · December 18, 2023, 8:54pm

I’m working on a shared server and I don’t want to eat up too much memory. So one last thing that would be useful for me to know though is if my strategy for wiping the data after I use it is ok. Since I’m looping through all the LFP data and eventually the rest of the data I am going to delete what gets called after I’ve run an analysis on the session and save the outputs to .csvs and .pngs.

So I’ve set up my cache here:

(base) [acampbell@itchy IBL_data_cache]$ pwd
/space/scratch/IBL_data_cache
(base) [acampbell@itchy IBL_data_cache]$ ls
cache  cache_info.json  cortexlab  datasets.pqt  histology  hoferlab  QC.json  sessions.pqt
(base) [acampbell@itchy IBL_data_cache]$

I’m planning on making bash calls inside my python script that finds the data for that mouse and wipes it out at the end of ever loop or thread. It will takes the path from the session info dictionaries to build a path to the data and then wipe the folder containing everything the LFP, spiking and task data. Will a simple bash subprocess call running rm -r cortexlab/Subjects/KS020/2020-02-07 for instance mess with any of the cache management scripts? Are there specific functions in the One-api I could use to do that which would play nice with the rest of the code?

owinter · December 20, 2023, 9:38pm

Hello,
I am not aware of ONE tools for managing cache files.
One suggestion could be to glob for raw_ephys_data folders and remove them in particular.

From the shell I often use the find command as such to monitor disk space taken by a given file type.

 find -name "*.cbin" -exec du -ch {} +

You can easily modify to get the file names and potentially delete them.

babou_osellot · January 11, 2024, 5:41am

@owinter So I have tried to remove the files but some process is still using them, can’t quite idenitify it since I deleted all the variables related to the

# note session_id is the eid
# dont_wipe_these_sessions is a list for debugging to avoid repeatedly downloading during debugging
# or f I want to keep that session around for some other reason

if (session_id not in dont_wipe_these_sessions):
    session_path = str(one.eid2path(eid))
    # create the full path to the directory you want to delete
    # create the full path to the directory you want to delete
    dir_to_delete = f"{session_path}/raw_ephys_data"
    print

    # call the bash command to reove files so directory can also be removed
    remove_from_path_command = "find " + dir_to_delete + " -type d -exec rm -rf {} +"
    subprocess.run(remove_from_path_command, shell=True)
    
    # Then, remove the directory itself
    remove_dir_command = "find " + session_path + " -type d -name 'raw_ephys_data' -exec rm -r {} +"
    subprocess.run(remove_dir_command, shell=True)

But this results in this error:

rm: cannot remove ‘/space/scratch/IBL_data_cache/churchlandlab_ucla/Subjects/UCLA033/2022-02-15/001/raw_ephys_data/probe00/.nfs00000000000cdb1800000286’: Device or resource busy
rm: cannot remove ‘/space/scratch/IBL_data_cache/churchlandlab_ucla/Subjects/UCLA033/2022-02-15/001/raw_ephys_data/probe01/.nfs00000000000cdb2400000287’: Device or resource busy
rm: cannot remove ‘/space/scratch/IBL_data_cache/churchlandlab_ucla/Subjects/UCLA033/2022-02-15/001/raw_ephys_data/probe00/.nfs00000000000cdb1800000286’: Device or resource busy
rm: cannot remove ‘/space/scratch/IBL_data_cache/churchlandlab_ucla/Subjects/UCLA033/2022-02-15/001/raw_ephys_data/probe01/.nfs00000000000cdb2400000287’: Device or resource busy
rm: cannot remove ‘/space/scratch/IBL_data_cache/churchlandlab_ucla/Subjects/UCLA033/2022-02-15/001/raw_ephys_data/probe00/.nfs00000000000cdb1800000286’: Device or resource busy
rm: cannot remove ‘/space/scratch/IBL_data_cache/churchlandlab_ucla/Subjects/UCLA033/2022-02-15/001/raw_ephys_data/probe01/.nfs00000000000cdb2400000287’: Device or resource busy

I ended up having to delete the one variable and restart it at every loop. I am a bit concerned for when I try to implement mulitprocessing or multithreading if having another one process running would interfere with that process or threads ability to clear the memory.

I guess I can just wipe all the sessions from memory at the end, but I prefer to never really be eating up too much shared memory. The Dandi-cli has this funtionality to just call what you need then wipe it and I think it’s a nice feature. I jury-rigged it into the allen brain sdk script I wrote.

Will try running this is a multiprocessing function to see what happens.