I am currently running analyses on the LFP data from IBL sessions containing CA1 recordings, and the destriping step is taking prohibitively long. Storing the data after being destripped may be an option for me, but it would be rude to take up so many resources on the server I’m sharing. The alternative is that I have to destripe every single LFP file for over 201 sessions I want to use. So, for somewhere between 200-600 individual probes, on the server, this step seems to take on average ~24-25 minutes. I am running destripe on the entire probe for the whole session as I need to check the LFP across the entire session. I can parallelize this operation, but I think I could at the most take up 5 cores at a time. I think this would still mean the script needs 3 days or more to run. And I haven’t even included the other steps which will add even more time. I may need to run this pipeline multiple times as well at different thresholds.
I really only need to run this on CA1 channels and two other channels in other randomly chosen regions for controls. But the destripe()
function, as I understand it, relies on channels in spatial proximity to remove the artifacts: https://int-brain-lab.github.io/iblenv/_modules/ibllib/dsp/voltage.html
I suppose I could remove LFP data not recorded during the actual session before destripping, as I have no interest in non-trial related data, but aside from this and parallelization, I can’t think of alternatives.
As I see it I have 5 options:
-
What are some options for me to save time on this step? Am I correct that I need the LFP from all the channels for
destripe()
to work? -
I have also seen this object: _iblqc_ephysChannels.apRMS, do you have the destripped data accessible somewhere? Or is this only for preprocessing?
Or, if I need to store the destripped data…
-
How much space would I need to store the destripped data? Each server I have available is ~500 GB, and the lab’s main focus is not this. We need that space for running other software. The Allen institute only stores every 4th channel of their LFP. I could do the same here I suppose.
-
I am using frequency bands for my analysis I noticed that you have this data provided:
https://int-brain-lab.github.io/iblenv/notebooks_external/loading_ephys_data.html#Exploring-Ephys-Data
I supposed I could just use this data instead, but do you provide instantaneous frequency over the whole recording? That would be this object here correct: _iblqc_ephysSpectralDensity.power
Which of the 4 options seems best?
My code:
import os
import subprocess
import numpy as np
import pandas as pd
from scipy import io, signal, stats
#from fitter import Fitter, get_common_distributions, get_distributions
import scipy.ndimage
from scipy.ndimage import gaussian_filter
from scipy.ndimage import gaussian_filter1d
import matplotlib.pyplot as plt
# for ripple detection
import ripple_detection
from ripple_detection import filter_ripple_band
import ripple_detection.simulate as ripsim # for making our time vectors
import piso #can be difficult to install, https://piso.readthedocs.io/en/latest/
from tqdm import tqdm
from iblatlas.atlas import AllenAtlas
from iblatlas.regions import BrainRegions
from one.api import ONE
import spikeglx
from brainbox.io.one import load_channel_locations
from brainbox.io.spikeglx import Streamer
from neurodsp.voltage import destripe
# load in the brain atlas and the brain region object for working with the ccf and ABI region id's in channels objects
ba = AllenAtlas()
br = BrainRegions() # br is also an attribute of ba so could to br = ba.regions
#Searching for datasets
brain_acronym = 'CA1'
# query sessions endpoint
sessions, sess_details = one.search(atlas_acronym=brain_acronym, query_type='remote', details=True)
# query insertions endpoint
insertions = one.search_insertions(atlas_acronym=brain_acronym)
session_list = [x for x in sessions]
# probe id and experiment id
for sesh in session_list[140:141]:
eid = sesh
pid, probename = one.eid2pid(eid)
print(f'Probe: {pid}, Probe name: {probename}')
band = 'lf' # either 'ap','lf'
#for this_probe in pid:
for this_probe in probename:
print(this_probe)
# Find the relevant datasets and download them
dsets = one.list_datasets(eid, collection=f'raw_ephys_data/{this_probe}', filename='*.lf.*')
print(type(dsets))
print(len(dsets))
data_files, _ = one.load_datasets(eid, dsets, download_only=False)
bin_file = next(df for df in data_files if df.suffix == '.cbin')
# Use spikeglx reader to read in the whole raw data
print("sr = spikeglx.Reader(bin_file)")
start_time = time.time()
sr = spikeglx.Reader(bin_file)
print(f"done, time elapsed: {time.time() - start_time}")
# Important: remove sync channel from raw data, and transpose
print("raw = sr[:, :-sr.nsync].T")
start_time = time.time()
raw = sr[:, :-sr.nsync].T
print(f"done, time elapsed: {time.time() - start_time}")
# Reminder : If not done before, remove first the sync channel from raw data
# Apply destriping algorithm to data
print("destriped = destripe(raw, fs=sr.fs)")
start_time = time.time()
destriped = destripe(raw, fs=sr.fs)
print(f"destripped shape : {destriped.shape}")
print(f"done, time elapsed: {time.time() - start_time}")
# create time index, this line does not actually do that
print("getting lfp_time_index data")
lfp_time_index_path = one.list_datasets(eid, collection=f'raw_ephys_data/{this_probe}', filename='*sync.npy')
lfp_time_index, _ = one.load_datasets(eid, lfp_time_index_path, download_only=False)
lfp_time_index =lfp_time_index[0]
print(f"lfp_time_index shape : {lfp_time_index.shape}")
print("done")
#load in the channel metadata
print("getting channels data")
collectionname = f'alf/{this_probe}/pykilosort'
channels = one.load_object(eid, 'channels', collection=collectionname)
channels.allen2017_25um_acronym = br.id2acronym( channels['brainLocationIds_ccf_2017'] )
print(f"channels shape : {channels.shape}")
print("done")