Destripping the IBL LFP takes a very long time, can I restrict which channels I run it on?

babou_osellot · January 4, 2024, 6:28pm

I am currently running analyses on the LFP data from IBL sessions containing CA1 recordings, and the destriping step is taking prohibitively long. Storing the data after being destripped may be an option for me, but it would be rude to take up so many resources on the server I’m sharing. The alternative is that I have to destripe every single LFP file for over 201 sessions I want to use. So, for somewhere between 200-600 individual probes, on the server, this step seems to take on average ~24-25 minutes. I am running destripe on the entire probe for the whole session as I need to check the LFP across the entire session. I can parallelize this operation, but I think I could at the most take up 5 cores at a time. I think this would still mean the script needs 3 days or more to run. And I haven’t even included the other steps which will add even more time. I may need to run this pipeline multiple times as well at different thresholds.

I really only need to run this on CA1 channels and two other channels in other randomly chosen regions for controls. But the destripe() function, as I understand it, relies on channels in spatial proximity to remove the artifacts: https://int-brain-lab.github.io/iblenv/_modules/ibllib/dsp/voltage.html

I suppose I could remove LFP data not recorded during the actual session before destripping, as I have no interest in non-trial related data, but aside from this and parallelization, I can’t think of alternatives.

As I see it I have 5 options:

What are some options for me to save time on this step? Am I correct that I need the LFP from all the channels for destripe() to work?
I have also seen this object: _iblqc_ephysChannels.apRMS, do you have the destripped data accessible somewhere? Or is this only for preprocessing?

Or, if I need to store the destripped data…

How much space would I need to store the destripped data? Each server I have available is ~500 GB, and the lab’s main focus is not this. We need that space for running other software. The Allen institute only stores every 4th channel of their LFP. I could do the same here I suppose.
I am using frequency bands for my analysis I noticed that you have this data provided:
https://int-brain-lab.github.io/iblenv/notebooks_external/loading_ephys_data.html#Exploring-Ephys-Data
I supposed I could just use this data instead, but do you provide instantaneous frequency over the whole recording? That would be this object here correct: _iblqc_ephysSpectralDensity.power

Which of the 4 options seems best?

My code:

import os
import subprocess 
import numpy as np
import pandas as pd
from scipy import io, signal, stats
#from fitter import Fitter, get_common_distributions, get_distributions
import scipy.ndimage
from scipy.ndimage import gaussian_filter
from scipy.ndimage import gaussian_filter1d
import matplotlib.pyplot as plt
# for ripple detection
import ripple_detection
from ripple_detection import filter_ripple_band
import ripple_detection.simulate as ripsim # for making our time vectors
import piso #can be difficult to install, https://piso.readthedocs.io/en/latest/
from tqdm import tqdm
from iblatlas.atlas import AllenAtlas
from iblatlas.regions import BrainRegions
from one.api import ONE
import spikeglx
from brainbox.io.one import load_channel_locations
from brainbox.io.spikeglx import Streamer
from neurodsp.voltage import destripe

# load in the brain atlas and the brain region object for working with the ccf and ABI region id's in channels objects
ba = AllenAtlas()
br = BrainRegions() # br is also an attribute of ba so could to br = ba.regions

#Searching for datasets
brain_acronym = 'CA1'
# query sessions endpoint
sessions, sess_details = one.search(atlas_acronym=brain_acronym, query_type='remote', details=True)

# query insertions endpoint
insertions = one.search_insertions(atlas_acronym=brain_acronym)
session_list = [x for x in sessions]

# probe id and experiment id
for sesh in session_list[140:141]:
    eid = sesh
    pid, probename = one.eid2pid(eid)
    print(f'Probe: {pid}, Probe name: {probename}')
    
    band = 'lf' # either 'ap','lf'
    #for this_probe in pid:
    for this_probe in probename:
        print(this_probe)
        # Find the relevant datasets and download them
        dsets = one.list_datasets(eid, collection=f'raw_ephys_data/{this_probe}', filename='*.lf.*')
        print(type(dsets))
        print(len(dsets))
        data_files, _ = one.load_datasets(eid, dsets, download_only=False)
        bin_file = next(df for df in data_files if df.suffix == '.cbin')

        # Use spikeglx reader to read in the whole raw data
        print("sr = spikeglx.Reader(bin_file)")
        start_time = time.time()
        sr = spikeglx.Reader(bin_file)
        print(f"done, time elapsed: {time.time() - start_time}")
        # Important: remove sync channel from raw data, and transpose
        print("raw = sr[:, :-sr.nsync].T")
        start_time = time.time()
        raw = sr[:, :-sr.nsync].T
        print(f"done, time elapsed: {time.time() - start_time}")
        # Reminder : If not done before, remove first the sync channel from raw data
        # Apply destriping algorithm to data
        print("destriped = destripe(raw, fs=sr.fs)")
        start_time = time.time()
        destriped = destripe(raw, fs=sr.fs)
        print(f"destripped shape : {destriped.shape}")
        print(f"done, time elapsed: {time.time() - start_time}")
        # create time index, this line does not actually do that
        print("getting lfp_time_index data")
        lfp_time_index_path = one.list_datasets(eid, collection=f'raw_ephys_data/{this_probe}', filename='*sync.npy')
        lfp_time_index, _ = one.load_datasets(eid, lfp_time_index_path, download_only=False)
        lfp_time_index =lfp_time_index[0]
        print(f"lfp_time_index shape : {lfp_time_index.shape}")
        print("done")
        
               
        #load in the channel metadata
        print("getting channels data")
        collectionname = f'alf/{this_probe}/pykilosort'
        channels = one.load_object(eid, 'channels', collection=collectionname)
        channels.allen2017_25um_acronym = br.id2acronym( channels['brainLocationIds_ccf_2017'] )
        print(f"channels shape : {channels.shape}")
        print("done")

babou_osellot · January 4, 2024, 7:54pm

Saving the destripped LFP may be an option a single probe compressed is 43 GB, however I could just keep three of the channels which would be ~300 Mb.

I was hoping at a later date though to compare across regions which means I’d have to run this step all over again. I think forcing the end user to run the destripping step themselves may limit the accessibility of this dataset, it certainly makes doing exploratory analysis with the LFP very tedious. Each time I want to look at a different session or probe I need to wait 20 minutes for destripping to run, and I have fairly powerful servers which I don’t have to share outside of the lab. Most people on shared reasoures have time limits on jobs they can run and interactive sessions will be even more time limited. It would be much better to simply be able to load the preprocessed data like the Allen Insitute has done.

On the other hand sometimes things just take a long time and running an analysis over 200 mice with between 1-3 probes in them might just have to be done in batches. Sometimes data analysis is just slow and not much can be done about it.

Gaelle_Chapuis · January 5, 2024, 6:33am

Hello,

We are aware that the raw LF is impractical for large scale analysis, and are planning to perform signal processing in order to reduce its size and practicality.
We have internally not done analysis on the LF and this is why although the data is fully accessible, it is not yet in the best format for analysis.

@owinter may want to add on this point, and comment on your approach options.

Kind regards

owinter · January 5, 2024, 10:10am

Hello,

I suggest to use the destripe_lfp dedicated function, as the AP band destripe is much greedier and the spatial filter not adapted for LF band.
Once you have done this, the destriping on LFP data should be mostly I/O (it’s one FFT, iFFT and and CAR).

If you work over a network (NFS mount or similar) I would also advise to make sure you read the data only once.

Note that there is a whole discussion to have about whether or not we should do CAR, and since you are destriping the whole files, how to perform an automated wrong channel detection before doing so.

As for the practicality of the format, again this is raw data and we haven’t run large scale analysis nor quality control on raw LFP voltage traces yet. There are transformations to perform so the full BWM LFP data fits on a consumer external drive, we just haven’t done this yet.

babou_osellot · January 5, 2024, 6:25pm

@Gaelle_Chapuis Yes the data set is quite new, I hope I wasn’t coming across as critical. Its amazing, the metadata is really well organized. I’m sure you’re still having internal debates about what to share and how to go about it. Especially since this project will lay the foundation for the format of future largescale ephys projects. I was just trying to get advice and provide insight into a user experience.

@owinter I will try the destripe_lfp() algorithm, thank you for pointing it out to me. Since you have not done automated detection of bad channels should there be any clear signs of poor quality I should do on my end? Given the size of the dataset I’m happy to have fairly high quality standards and restrict it to a subset. I know this isn’t bad channel detection persay but 'm already using two external channels outside Hp to detect movement artifacts by looking for excessively high correlations during high freq events.

Also I am interested in file formatting issues for neuroscience data, so if you have any readings on that topic please share. Are you moving towards using HDF5 rather than .imec?