Tensorflow issue when trying to use nibabel in dataset

asaiyed · April 23, 2022, 2:40am

I have been trying to develop my machine learning model dealing with Nifti files. I originally just loaded them into a numpy array but after augmentation the amount of data was too large for the RAM.

I discovered that one should make use of things such as generators / tools provided by tf.Data.

As such I attempted the following.

def load_images(imagePath):   
    image = nib.load(imagePath)   
    image = image.get_fdata()   
    image = tf.image.per_image_standardization(image)   
    label = (int)(imagePath.split('Grade')[1][0]) - 1 
    return (image, label)  

dataset = tf.data.Dataset.from_tensor_slices(all_paths)
dataset = (dataset     
        .shuffle(1024)
        .map(load_images, num_parallel_calls=AUTOTUNE)
        .cache()
        .repeat()
        .batch(64)
        .prefetch(AUTOTUNE) )

Hopefully the code is straightforward but if anything needs further clarification please do ask. Originally the first two lines of load_images() were it’s own function but I tried it like this to try and resolve the issue.

The issue is that I am getting the following error at the map line:

TypeError: in user code:  File "<ipython-input-64-56e9744da5d3>", line 5, in load_images  *     image = nib.load(imagePath) 

File "/usr/local/lib/python3.7/dist-packages/nibabel/loadsave.py", line 42, in load  *     stat_result = os.stat(filename)  

TypeError: stat: path should be string, bytes, os.PathLike or integer, not Tensor

I don’t think it likes me using nibabelfunctions in the mapping function but I cannot think of any other way to do it. I tried following the answer here but this just gave me another error about using tf.function

psadil · April 23, 2022, 9:19pm

I’m not an expert but recently went through this for another project. I first converted the images and labels into as set of TFRecords. The format is tricky to work with, but it’s (apparently) optimized for loading by tf.data.Datasets. The script below shows an example of getting this to work.

Note that the network I was using worked with 3d images, and so the batches of data were of shape
(batch_size, x_dim, y_dim, z_dim, 1), rather than what is typical for 2d images: (batch_size, x_dim, y_dim, n_channels)

Also, preprocessing images with functions like tf.image.per_image_standardization can slow things down substantially. In the example below, preprocessing is done in the storage of the tfrecords files, within the function serialize_example.

Sorry to dump a pile of code. Do followup if you have any questions

import math
from numbers import Number
import numpy as np
import tensorflow as tf
import nibabel as nib
from nilearn import plotting

def _bytes_feature(value):
    """Returns a bytes_list from a string / byte."""
    return tf.train.Feature(bytes_list=tf.train.BytesList(value=[value]))


def _float_feature(value):
    """Returns a float_list from a float / double."""
    return tf.train.Feature(float_list=tf.train.FloatList(value=[value]))


def serialize_example(img: str, label: Number) -> tf.train.Example:
    # convert to numpy arraw
    nii0 = np.asanyarray(nib.load(img).dataobj)
    
    # rescale to 0-1
    # if you want to use tf.image.per_image_standardization(), this would be the place to do so
    # instead of the rescaling
    nii = (nii0 - nii0.min())/(nii0.max() - nii0.min()).astype(np.float32) 

    feature = {
        'label': _float_feature(label),
        'image_raw': _bytes_feature(tf.io.serialize_tensor(nii).numpy())
    }

    return tf.train.Example(features=tf.train.Features(feature=feature))


def write_records(niis, labels, n_per_record: int, outfile: str) -> None:
    """
    store list of niftis (and associated label) into tfrecords for use as dataset
    """
    n_niis = len(niis)
    n_records = math.ceil(len(niis) / n_per_record)

    for i, shard in enumerate(range(0, n_niis, n_per_record)):
        print(f"writing record {i} of {n_records-1}")
        with tf.io.TFRecordWriter(
                f"{outfile}_{i:0>3}-of-{n_records-1:0>3}.tfrecords", 
            options= tf.io.TFRecordOptions(compression_type="GZIP")
        ) as writer:
            for nii, label in zip(niis[shard:shard+n_per_record], labels[shard:shard+n_per_record]):
                example = serialize_example(img=nii, label=label)
                writer.write(example.SerializeToString())


def parse_1_example(example) -> tf.Tensor:
    X = tf.io.parse_tensor(example['image_raw'], out_type=tf.float32)
    return tf.expand_dims(X, 3), example['label'] 


def decode_example(record_bytes)-> dict:
    example = tf.io.parse_example(
        record_bytes,     
        features = {
          "label": tf.io.FixedLenFeature([], dtype=tf.float32),
          'image_raw': tf.io.FixedLenFeature([], dtype=tf.string)
          }
    )
    return example

def get_batched_dataset(files, batch_size: int = 32, shuffle_size: int=1024) -> tf.data.Dataset:
    dataset = (
        tf.data.Dataset.list_files(files) # note shuffling is on by default
        .flat_map(lambda x: tf.data.TFRecordDataset(x, compression_type="GZIP", num_parallel_reads=8))
        .map(decode_example, num_parallel_calls=tf.data.AUTOTUNE)
        .map(parse_1_example, num_parallel_calls=tf.data.AUTOTUNE)
        .cache()  # remove if all examples don't fit in memory (note interaction with shuffling of files, above)
        .shuffle(shuffle_size)
        .batch(batch_size, num_parallel_calls=tf.data.AUTOTUNE)
        .prefetch(tf.data.AUTOTUNE)
    )
    return dataset


# store example set of image with labels 0-9
# (e.g., put nifti of label MNI152_T1_1mm_brain.nii.gz in the working directory)
mni_nii = ['MNI152_T1_1mm_brain.nii.gz']*10

# store examples in each tfrecord. number of examples per record is configurable.
# aim for as many examples as produces files of size > 100M 
write_records(mni_nii, [x for x in range(10)], 10, "tmp")

# read the records back. this will be the list of files generated by write_records()
# a full dataset will have a list with many records
list_of_records=['tmp_000-of-000.tfrecords']
ds = get_batched_dataset(list_of_records, batch_size=2, shuffle_size=10)

# ds can now be passed to model.fit
# but first!!!!! 
# the serialization is a lot, so it is a good idea to verify that the images
# look okay when loaded
(Xs, Ys) = next(ds.as_numpy_iterator())

# (batch_size, )
# order will depend on shuffle (turn off all shuffling to verify order)
Ys.shape

# (batch_size, x_dim, y_dim, z_dim, 1)
Xs.shape

# convert first element in batch to in-memory nibabel.nifti format for display
nii = nib.Nifti1Image(np.squeeze(Xs[0,]), affine=np.eye(4)*2)

# this should look like a typical brain
plotting.plot_anat(nii)

asaiyed · April 23, 2022, 9:31pm

Thanks - that’s all super useful. I will try it out and get back to you hopefully with success but if not with more questions if that’s okay.

Thanks again.

asaiyed · April 24, 2022, 3:49am

Hey Patrick - I’ve just gone through your code and tried to make sense of it and initially tried to adapt it to my own code.

I’m still slighly confused about a few things which I hope you don’t mind answering (they might be very simple but please bare with me).

The line where you create the mni_nii list, is this just a list of all file paths to the nifti files for me?
Currently, my directory looks a little like this
```
drive/
└── preprocessed
    ├── Grade2
    ├── Grade3
    └── Grade4
```
with 1155 nii.gz files in each. I know how to create a list of all paths to all 3465 files but is this what mni_nii should be holding or am I mistaken?
Once this is done, how do I use get_batched_dataset to create train, validation, and test sets? I’d otherwise have used sklearn but not sure how to do it in this case. Does the shuffle in the method mean I can just use it directly to do so? How do I create three different sets of particular sizes? Is that what you’ve done on this line:
ds = get_batched_dataset(list_of_records, batch_size=2, shuffle_size=10)?

If then I’ve understood correctly, if I wanted a train set of size 2500 and test of 1000 I do
```
train = get_batched_dataset(list_of_records, batch_size=2500, shuffle_size=100)
test =  get_batched_dataset(list_of_records, batch_size=1000, shuffle_size=100)
```
but then how does this ensure that the same files aren’t being used in both sets?

Thanks a lot for your help. I’ve been quite unwell for a while and my undergrad thesis is well past the deadline so your help is really appreciated.

psadil · April 24, 2022, 4:38am

Hi @asaiyed . Happy to answer what I can.

I know how to create a list of all paths to all 3465 files but is this what mni_nii should be holding or am I mistaken?

Yes. The list would comprise the paths to the .nii.gz files that you want stored as a tfrecord.

Once this is done, how do I use get_batched_dataset to create train, validation, and test sets? I’d otherwise have used sklearn but not sure how to do it in this case. Does the shuffle in the method mean I can just use it directly to do so?

Unfortunately everything quickly gets much more complicated when the data don’t fit in RAM. The idea would be to create three separate sets of tfrecords, one for each of the training, validation, and test sets. So, you’d generate those yourself beforehand (perhaps with running sklearn.model_selection.train_test_split on the full list of paths).

write_records(list_of_train_paths, [x for x in range(10)], 10, "train") # will write out many train**.tfrecords
write_records(list_of_val_paths, [x for x in range(10)], 10, "validation") # will write out many validation**.tfrecords
write_records(list_of_test_paths, [x for x in range(10)], 10, "test") # will write out many test**.tfrecords

Then, you would use these three separate lists of files for creating 3 separate datasets. Assuming that you’re using tf.keras, this would look something like

train_ds = get_batched_dataset(list_of_train_records, ...)
validation_ds = get_batched_dataset(list_of_validation_records, ...)

# [...model definition...]

model.fit(x=train_ds, validation_data=validation_ds) 

test_ds = get_batched_dataset(list_of_test_records, ...)

# check score on test_ds
model.evaluate(x=test_ds)

The use of the shuffle method in the function to create datasets – get_batched_dataset – causes the datasets to be resorted at each epoch. This is roughly equivalent to what happens when the argument shuffle=True for the tf.keras.Model.fit() method.

If then I’ve understood correctly, if I wanted a train set of size 2500 and test of 1000 I do

That’s not quite right. As you suspected, this would cause samples to be used for both training and validation. The use of separate sets of tfrecords will avoid this.

The argument batch_size will control how many samples are sent to the GPU at one time (during what tf also calls a “step”). This will be equivalent to whatever you were using for the batch_size argument of the model.fit() method, back when all observations fit into memory and the batching could be done for data held in memory.

You may find this page helpful, which provides some detail into how the batch method of a tf.data.dataset.

HTH

asaiyed · April 25, 2022, 3:19am

That was great thanks, will try it out ASAP.

On a side note, can you answer a few other questions for me?

If my scans have been downsized so that each nifti is 128x128x64, can these only be used with a 3D CNN? I’d rather use a 2D but I find the documentation difficult to follow in terms of developing a model from scratch with this kind of data.

Given this size of scan, is 3D the only choice? What must I do to be able to use a 2D CNN?

Thanks once again, I hope I’m not taking too much of your time.

psadil · April 25, 2022, 1:26pm

I’m not sure. This is around the edge of my knowledge. There are examples of treating the third dimension as a channel, e.g., Saha et al. (2021). In that case, the convolution would happen within each slice, which may or may not be optimal. Optimality will depend on factors that are hard to know in advance, including specific features of your data and prediction task. That said, Saha et al. construct networks from volumes that have just 3 channels, which are analogous to the RGB channels in a typical image. I’m not sure whether tf lets you use an arbitrary number of channels (e.g., your 64). If you have a network designed for 2D color images, might as well try and see what happens.

Another strategy for using a 2D network would be to store the slices of the image as 64 independent examples. Whether that strategy makes sense depends largely on your prediction task. For example, if you were trying to predict a diagnosis like depression, it seems like an open question as to whether each individual slice contains information about the diagnosis; I would expect that only some regions are informative, but that’s just a guess. In contrast, I would expect that treating the individual slices as independent examples would work okay for a task like predicting chronological age, considering that each individual slice might show a pattern of atrophy that is predictive of age. But, again, that’s just a guess.