Attention with anonymization and gzipped files - you might be publishing subject information

I was curious if some of you already encountered this issue or are aware about it, but your compressed NIfTI images might contain subject information. I was lucky enough to catch it in time, but while preparing to publish my data via OpenNeuro.org, I almost published the name of my subjects as well, hidden in the compressed NIfTI file.

So, what’s the problem?

Let’s take a BOLD image from the OpenNeuro.org dataset ds000108. A good example is the functional image sub-01_ses-test_task-fingerfootlips_bold.nii.gz.

If you double click on the compressed NIfTI file, you can see the following:

open_gzip

And if you extract the data in the .nii.gz file manually, the uncompressed file name will be 8_finger_foot_lips.nii. In itself, this doesn’t seem so problematic.

But in my case, the filename in the compressed NIfTI image was still the one given from the scanner, i.e. scan sequence and subject name.

How did I miss that?

  1. Checking the header information of the compressed NIfTI file (i.e. with fslhd or with nibabel) gives only back the name of the compressed file as it’s filename, i.e. sub-01_ses-test_task-fingerfootlips_bold.nii.gz.

  2. Also, if you gunzip the NIfTI image, the filename will be the one from the compressed file, i.e. sub-01_ses-test_task-fingerfootlips_bold.nii.gz. If you than compress it again with gzip, the original hidden name is no longer there.

  3. If you use gzip or gunzip with the -l flag (i.e. list compressed file contents), you get the following output:

    gzip -l sub-01_ses-test_task-fingerfootlips_bold.nii.gz 
       compressed   uncompressed  ratio  uncompressed_name
         24454931       45220192  45.9%  sub-01_ses-test_task-fingerfootlips_bold.nii
    

    So, also here, the uncompressed_name is indicated as sub-01_ses-test_task-fingerfootlips_bold.nii. Which is not true!

But, if you unzip the NIfTI image, using gunzip with the --name flag (i.e. save or restore the original name and time stamp):

gunzip --name sub-01_ses-test_task-fingerfootlips_bold.nii.gz

You will get the original file name 8_finger_foot_lips.nii.

Where does the problem come from?

I think the problem is due to renaming the compressed NIfTI file. For example, if you run

cp sub-01_ses-test_task-fingerfootlips_bold.nii.gz new_file.nii.gz

The new_file.nii.gz will nonetheless still contain the old filename 8_finger_foot_lips.nii.

How to handle this issue?

At the moment, I see only two strategies. Either you rename the NIfTI image, before compressing them. Or you can also gunzip all your .nii.gz files, and compress them again with gzip. This will overwrite the original name.

Did anybody of you already encounter the same issue? How did you handle it?

2 Likes

You can also create the files with the --no-name flag.

Demonstration:

$ echo test > test.txt
$ gzip --no-name test.txt
$ mv test.txt.gz test2.txt.gz
$ gunzip --name test2.txt.gz 
$ ls test*.txt 
test2.txt
1 Like

Thanks for reporting this!

Luckily I was not able to replicate it for files created by dcm2niix:

root@b070f3a309ca:/d/data/tom/sourcedata/11_MPRAGE_EnchancedContrast# dcm2niix -b y -z y .
Chris Rorden's dcm2niiX version v1.0.20180404 (OpenJPEG build) GCC4.8.4 (64-bit Linux)
Found 176 DICOM image(s)
Warning: Empty protocol name(s) (0018,1030)
Warning: Unable to append protocol name (0018,1030) to filename (it is empty).
Convert 176 DICOM as ./___20170306094534_19 (224x224x176x1)
compress: "/usr/bin/pigz" -n -f -6 "./___20170306094534_19.nii"
Conversion required 5.647209 seconds (0.538827 for core code).
root@b070f3a309ca:/d/data/tom/sourcedata/11_MPRAGE_EnchancedContrast# ls -al *.nii*
-rwxr-xr-x 1 root root 8606261 May  1 16:30 ___20170306094534_19.nii.gz
root@b070f3a309ca:/d/data/tom/sourcedata/11_MPRAGE_EnchancedContrast# mv ___20170306094534_19.nii.gz anon.nii.gz
root@b070f3a309ca:/d/data/tom/sourcedata/11_MPRAGE_EnchancedContrast# gunzip --name anon.nii.gz
root@b070f3a309ca:/d/data/tom/sourcedata/11_MPRAGE_EnchancedContrast# ls -al *.nii*
-rwxr-xr-x 1 root root 17662304 May  1 16:30 anon.nii

However if I use gzip directly I do find the same behaviour

root@b070f3a309ca:/d/data/tom/sourcedata/11_MPRAGE_EnchancedContrast# gzip anon.nii
root@b070f3a309ca:/d/data/tom/sourcedata/11_MPRAGE_EnchancedContrast# mv anon.nii.gz anon2.nii.gz
root@b070f3a309ca:/d/data/tom/sourcedata/11_MPRAGE_EnchancedContrast# gunzip --name anon2.nii.gz
root@b070f3a309ca:/d/data/tom/sourcedata/11_MPRAGE_EnchancedContrast# ls -al *.nii*
-rwxr-xr-x 1 root root 17662304 May  1 16:32 anon.nii

This means that people using gzip to create .nii.gz files need to be careful but users of dcm2niix (including heudiconv and dcm2bids) should not be affected.

@ChrisGorgolewski Perhaps the BIDS validator should warn for files whose stored names don’t match their current names (minus the .gz)?

3 Likes

@miykael the source of the problem is not the copy (cp) command, rather the compression using gzip or pigz. As @effigies notes you should create the files with the -n or --no-name argument (beware this is case sensitive). If you look at the text generated by dcm2niix it uses the -n command. As noted in the gzip help:

-n --no-name don't save original file name or time stamp

Also, I would suggest using pigz instead of gzip: it provides much faster parallel compression and a slightly faster decompressor.

2 Likes