Run renaming and content-modifying script with Datalad Run

YumekaMengjiaLYU · November 12, 2021, 9:19pm

Hi friends,

I have a script that rename the data directories to BIDS format and insert “IntendedFor” fields in jsons. I hope to datalad run the script to have tracking and provenance.

The problem is, I keep getting permission errors when datalad running my script, even though I did specify the input and output parameters.

datalad run -m "rename files" \
    --input inputs/data \
    --output participants.tsv \
    --output problem_fmapjsons.txt \
    --output inputs/data \
    "python3 code/main.py"

where inputs/data is the BIDS root directory.

The error I keep getting is

PermissionError: [Errno 13] Permission denied: '/gpfs/fs001/cbica/projects/RBC/mengjia_space/HCPD-BIDIFY/inputs/data/sub-0968878/ses-V1/fmap/sub-0968878_ses-V1_dir-PA_run-01_epi.json'

sub-0968878/ses-V1/fmap/sub-0968878_ses-V1_dir-PA_run-01_epi.json is the renamed from the original non-BIDS-conforming filename. For some reason, the renamed directories/files did not seem to get saved/gotten/unlocked during datalad run, which should get/unlock the inputs.

Any input is SO appreciated!!!

YumekaMengjiaLYU · November 12, 2021, 9:21pm

@yarikoptic I would love to have your inputs!!!

yarikoptic · November 13, 2021, 4:54am

odd

my minimal attempt to reproduce worked out ok:

(git-annex)lena:/tmp/testds[master]git
$> echo '1' >> 123        
zsh: permission denied: 123
(dev3) 1 24904 ->1.....................................:Fri 12 Nov 2021 11:50:34 PM EST:.
(git-annex)lena:/tmp/testds[master]git
$> datalad run --input 123 --output 123 bash -c "echo 1 >> 123"
[INFO   ] Making sure inputs are available (this may take some time) 
unlock(ok): 123 (file)
[INFO   ] == Command start (output follows) ===== 
[INFO   ] == Command exit (modification check follows) ===== 
add(ok): 123 (file)                                                                                       
save(ok): . (dataset)

and then

(git-annex)lena:/tmp/testds[master]git
$> datalad run --input . --output . bash -c "mv 123 124; echo 1 >> 124"
[INFO   ] Making sure inputs are available (this may take some time) 
unlock(ok): 123 (file)
[INFO   ] == Command start (output follows) ===== 
[INFO   ] == Command exit (modification check follows) ===== 
delete(ok): 123 (file)                                                                                    
add(ok): 124 (file)                                                                                       
save(ok): . (dataset)                                                                                     
(dev3) 1 24912.....................................:Fri 12 Nov 2021 11:54:02 PM EST:.
(git-annex)lena:/tmp/testds[master]git
$> ls -lta
total 52
drwx------  9 yoh  yoh   4096 Nov 12 23:54 .git/
drwx------  4 yoh  yoh   4096 Nov 12 23:54 ./
lrwxrwxrwx  1 yoh  yoh    108 Nov 12 23:54 124 -> .git/annex/objects/W7/mg/MD5E-s4--f2160c8ffedf48068f2e1137e0a3a7e7/MD5E-s4--f2160c8ffedf48068f2e1137e0a3a7e7
drwx------  2 yoh  yoh   4096 Nov 12 23:49 .datalad/
-rw-------  1 yoh  yoh     55 Nov 12 23:49 .gitattributes
drwxrwxrwt 26 root root 28672 Nov 12 23:49 ../
(dev3) 1 24913.....................................:Fri 12 Nov 2021 11:54:05 PM EST:.
(git-annex)lena:/tmp/testds[master]git
$> cat 124
1
1

in that - datalad run unlocked the file, and thus changes were saved. Also rename in 2nd example also worked out. Do you see unlock commands reported? what version of datalad are you using?

YumekaMengjiaLYU · November 14, 2021, 1:41am

Thanks so much for your reply!! I don’t see any unlock commands – there are only get commands reported.

The version of datalad is 0.14.6

I wonder if this has to do with the amount of data that datalad has to get/unlock. The datalad dataset inputs/data contain ~3T study data for 652 subjects.

Thanks again for your prompt reply!

Best,
Mengjia

YumekaMengjiaLYU · November 14, 2021, 2:29am

I just checked on the last such datalad run job I ran on our cluster.
The run is:

datalad run -m "rename files" \
    --input inputs/data \
    --output participants.tsv \
    --output problem_fmapjsons.txt \
    --output inputs/data \
    "python3 code/main_11_12.py"

It got the following error:

[WARNING] Received an exception CommandError: 'git -c diff.ignoreSubmodules=none rm -- sub-0298758/ses-V1/rfMRI_REST1_PA/LINKED_DATA/PHYSIO/Physio_combined_c495c9a3-7e1f-43aa-9d09-2332738117ec.csv sub-0298758/ses-V1/rfMRI_REST1_PA/LINKED_DATA/PSYCHOPY/REST_HCD0298758_V1_A_run2.mp4 sub-0298758/ses-V1/rfMRI_REST1_PA/LINKED_DATA/PSYCHOPY/REST_HCD0298758_V1_A_run2_design.csv sub-0298758/ses-V1/rfMRI_REST1_PA/sub-0298758_ses-V1_dir-AP_run-01_epi.json sub-0298758/ses-V1/rfMRI_REST1_PA/sub-0298758_ses-V1_dir-AP_run-01_epi.nii.gz sub-0298758/ses-V1/rfMRI_REST1_PA/sub-0298758_ses-V1_dir-PA_run-01_epi.json sub-0298758/ses-V1/rfMRI_REST1_PA/sub-0298758_ses-V1_dir-PA_run-01_epi.nii.gz sub-0298758/ses-V1/rfMRI_REST1_PA/sub-0298758_ses-V1_task-rest_acq-REST1_dir-PA_run-02_bold.json sub-0298758/ses-V1/rfMRI_REST1_PA/sub-0298758_ses-V1_task-rest_acq-REST1_dir-PA_run-02_bold.nii.gz sub-0298758/ses-V1/rfMRI_REST1_PA/sub-0298758_ses-V1_task-rest_acq-REST1_dir-PA_run-02_sbref.json sub-0298758/ses-V1/rfMRI_REST1_PA/sub-0298758_ses-V1_task-rest_acq-REST1_dir-PA_run-02_sbref.nii.gz' failed with exitcode 128 under /gpfs/fs001/cbica/projects/RBC/mengjia_space/hcpd_11_12/inputs/data [out: 'rm 'sub-0298758/ses-V1/rfMRI_REST1_PA/LINKED_DATA/PHYSIO/Physio_combined_c495c9a3-7e1f-43aa-9d09-2332738117ec.csv'
rm 'sub-0298758/ses-V1/rfMRI_REST1_PA/LINKED_DATA/PSYCHOPY/REST_HCD0298758_V1_A_run2.mp4'
rm 'sub-0298758/ses-V1/rfMRI_REST1_PA/LINKED_DATA/PSYCHOPY/REST_HCD0298758_V1_A_run2_design.csv'
rm 'sub-0298758/ses-V1/rfMRI_REST1_PA/sub-0298758_ses-V1_dir-AP_run-01_epi.json'
rm 'sub-0298758/ses-V1/rfMRI_REST1_PA/sub-0298758_ses-V1_dir-AP_run-01_epi.nii.gz'
rm 'sub-0298758/ses-V1/rfMRI_REST1_PA/sub-0298758_ses-V1_dir-PA_run-01_epi.json'
rm 'sub-0298758/ses-V1/rfMRI_REST1_PA/sub-0298758_ses-V1_dir-PA_run-01_epi.nii.gz'
rm 'sub-0298758/ses-V1/rfMRI_REST1_PA/sub-0298758_ses-V1_task-rest_acq-REST1_dir-PA_run-02_bold.json'
rm 'sub-0298758/ses-V1/rfMRI_REST1_PA/sub-0298758_ses-V1_task-rest_acq-REST1_dir-PA_run-02_bold.nii.gz'
rm 'sub-0298758/ses-V1/rfMRI_REST1_PA/sub-0298758_ses-V1_task-rest_acq-REST1_dir-PA_run-02_sbref.json'
rm 'sub-0298758/ses-V1/rfMRI_REST1_PA/sub-0298758_ses-V1_task-rest_acq-REST1_dir-PA_run-02_sbref.nii.gz''] [err: 'fatal: Unable to write new index file'] [cmd.py:run:408].
| Canceling not-yet running jobs and waiting for completion of running.
| You can force earlier forceful exit by Ctrl-C. 
[INFO] Canceled 0 out of 0 jobs. 0 left running. 
Traceback (most recent call last):

Its stdout log file contain many get commands records.

YumekaMengjiaLYU · November 14, 2021, 4:02am

In addition, would you recommend running datalad.api.get/unlock in the python script that would be executed with datalad run? I have the hunch that if I manually datalad.api.get/unlock the necessary files in the script, I might be able to resolve the error. I will try this on a small subset of the data. Thank you!

YumekaMengjiaLYU · November 14, 2021, 4:27am

Sorry to bother you again, but is there a way for me to clone using glob?
datalad clone -d . ~/RBC_RAWDATA/bidsdatasets/HCP_D/HCD00* inputs/data
It does not seem to work on my end.

YumekaMengjiaLYU · November 14, 2021, 11:56pm

I’ve tried to split up the actions (rename, add intendedfor json fields etc) into several datalad run calls as below:

datalad run \
    -i code/HCPD_BIDS.sh \
    -i inputs/data/HCD* \
    --explicit \
    --expand both \
    -o inputs/data/sub* \
    -o participants.tsv \
    -m 'rename directories and files in bulk' \
    "bash code/HCPD_BIDS.sh"

#datalad get sub*/*/fmap/*epi.json
#datalad unlock sub*/*/fmap/*epi.json


datalad run \
    -i code/json_intendedfor.py \
    -i "inputs/data/sub*/*/*/*epi.json" \
    --explicit \
    --expand both \
    -o inputs/data/sub*/*/*/*epi.json \
    -m 'add the intendedfor column in the json files' \
    "python code/json_intendedfor.py"
#datalad get sub*/*/fmap/*epi.json
#datalad unlock sub*/*/fmap/*epi.json

datalad run \
    -i code/participant_tsv.py \
    --explicit \
    -o participants.tsv \
    -m "fill participant.tsv with subject id" \
    "python code/participant_tsv.py"

datalad run \
    -i code/intendedfor_path.py \
    -i "sub*/*/fmap/*epi.json" \
    --explicit \
    --expand inputs \
    -o problem_fmapjsons.txt \
    -m 'collect problem json files without the intendedfor field'\
    "python code/intendedfor_path.py"

After the script finished, I checked the git log and found out that only one datalad run command was executed:

commit 172078ac6a97fd861d2265015a6b7bc321f14ce0 (HEAD -> master)
Date:   Sat Nov 13 22:39:56 2021 -0500

    [DATALAD RUNCMD] collect problem json files without the intendedfor field
    
    === Do not change lines below ===
    {
     "chain": [],
     "cmd": "python code/intendedfor_path.py",
     "dsid": "60d95bed-78d4-4b3a-8a85-7ef1c37f066d",
commit 172078ac6a97fd861d2265015a6b7bc321f14ce0 (HEAD -> master)
Date:   Sat Nov 13 22:39:56 2021 -0500

    [DATALAD RUNCMD] collect problem json files without the intendedfor field
    
    === Do not change lines below ===
    {
     "chain": [],
     "cmd": "python code/intendedfor_path.py",
     "dsid": "60d95bed-78d4-4b3a-8a85-7ef1c37f066d",
     "exit": 0,
     "extra_inputs": [],
     "inputs": [
      "code/intendedfor_path.py",
      "sub*/*/fmap/*epi.json"
     ],
     "outputs": [
      "problem_fmapjsons.txt"
     ],
     "pwd": "."
    }
    ^^^ Do not change lines above ^^^

Any input would be greatly appreciated!!

YumekaMengjiaLYU · November 15, 2021, 12:17am

Dear @yarikoptic ,

Thanks so much for your time and help. I will work on it for a while and I apologize for asking so many questions (which are probably confusing)

Thanks again,
Mengjia

YumekaMengjiaLYU · November 15, 2021, 2:26am

I would like to close this topic but does not seem to have the permission to do so. Thanks again for your help and time~

Best,
Mengjia