Why is copying the data array not standard when using get_fdata?

JohannesWiesner · August 26, 2019, 2:00pm

I wrote a function that standardizes an image array ‘cluster-wise’ based on extracted regions. I was surprised to find out that the get_fdata() method seems to only return a reference to the image array and not a real copy of the image array. I found out about this because after running my function my original image array was also manipulated. I find this behavior a little bit unintuitive since get_fdata or get_data intuitively sound like you would create a copy of the data array. If I would write a get_data function my first idea would probably be to write it with copying as the default and referencing as a special case (just like in pandas, where copying is always standard unless you provide inplace = True). For a real copy you apparently have to use copy_data_array = np.asarray(img.dataobj) which is not as straightforward to read. Does anyone know why copying the image data array is not standard in get_fdata()?

effigies · August 26, 2019, 3:51pm

Can you clarify what the source of your image data is? Is it being created in memory or loaded from a file? These will have slightly different behaviors.

Assuming that you’ve loaded from file, then the way this actually works is that get_fdata() does return a copy of the image data, but that copy is cached as an _fdata_cache object. Further calls will just return _fdata_cache, so if you modify the returned value, that will persist.

If you don’t want to use caching, you can say:

img.get_fdata(caching="unchanged")

If it’s already cached, then you can uncache:

img.uncache()

If you’ve created the file yourself, then yeah, you’re probably going to need to make a copy, as nibabel will only make a copy if you’re changing the dtype, in an effort to avoid multiplying your memory consumption.

JohannesWiesner · August 27, 2019, 9:51am

The image is loaded from disk. I then run the following function with the image and a mask image as arguments:

def cluster_standardize(stat_img,mask_img,ignore_zeros=True):
    
    region_imgs,indices = connected_regions(stat_img,
                                      extract_type='connected_components',
                                      min_region_size=1,
                                      smoothing_fwhm=None)
    
    stat_img_data = np.asarray(stat_img.dataobj)
    
    for idx,region_img in enumerate(iter_img(region_imgs)):
        
        region_img_indices = np.where(region_img.get_fdata() != 0 )
        
        # get the size of each region 
        region_size = len(region_img_indices[0])
        stat_img_data_region_values = stat_img_data[region_img_indices]
        
        # if region consists of more than one voxel calculate the mean and
        # standard deviation from all voxels within that region
        if region_size > 1:
            stat_img_data_region_mean = stat_img_data_region_values.mean()
            stat_img_data_region_std = stat_img_data_region_values.std()
        
        # if region consists of only one voxel, standardize that voxel
        # according to mean and standard deviation of whole brain
        else:
            stat_img_data_masked = apply_mask(stat_img,mask_img)
            
            if ignore_zeros is True:
                stat_img_data_masked = stat_img_data_masked[np.nonzero(stat_img_data_masked)]
                
            stat_img_data_region_mean = stat_img_data_masked.mean()
            stat_img_data_region_std = stat_img_data_masked.std()
        
        # z-standardize region values
        stat_img_data_region_values_std = ((stat_img_data_region_values - stat_img_data_region_mean) / stat_img_data_region_std)
        
        # replace original values with z-standardized values
        stat_img_data[region_img_indices] = stat_img_data_region_values_std
        
    return new_img_like(stat_img,stat_img_data)

The critical line is:

stat_img_data = np.asarray(stat_img.dataobj)

The error happened when this line was:

stat_img_data = stat_img.get_fdata()

When using .get_fdata() inside the function the original data array from the input image was also changed.

effigies · August 27, 2019, 11:49am

I see. Yes, that makes sense that would cause you issues. You can definitely use the strategies for manipulating the cache mentioned above. I might also consider creating a new data object, rather than modifying the contents of the get_fdata() result:

new_img_data = np.zeros_like(stat_img.get_fdata())

...

new_img_data[region_img_indices] = stat_img_data_region_values_std

This feels cleaner to me, as there’s less chance of original values remaining in the output, for example, if the set of non-zero voxels in the two images differs.