Creating datalad dataset with existing directories

Antti_Rantala · March 23, 2018, 1:35pm

I’d like to add datalad to my existing directory structure, where I have separate folders for the code and the data. For future use I’d like to add different data sources to my work and also keep track of the changes to my current dataset. Datalad seems to be a fine tool for this. The problem I have is creating a dataset over the existing directories. My directory structure is a bit like this

I can create a dataset with datalad create --force ./project. The problem is that I’d like to create a subdataset of the original folder. When I try to do datalad create --force -d ./project ./project/data/original , I get

[ERROR ] CommandError: command ‘[‘git’, ‘–work-tree=.’, ‘submodule’, ‘status’]’ failed with exitcode 139
| Failed to run [‘git’, ‘–work-tree=.’, ‘submodule’, ‘status’] under ‘/PATH/project’. Exit code=139. out= err=/share/apps/git-annex/6.20180227/git-core/git-submodule: line 979: 49578 Segmentation fault git ${wt_prefix:±C “$wt_prefix”} ${prefix:±-super-prefix “$prefix”} submodule–helper status ${GIT_QUIET:±-quiet} ${cached:±-cached} ${recursive:±-recursive} “$@”
| [cmd.py:run:520] [subdatasets.py:_parse_git_submodules:114] (InvalidGitRepositoryError)

Shouldn’t this be the way to create these subdatasets? I’m using datalad 0.9.2 and git-annex 6.20180227.

EDIT:

I found that the force create doesn’t add anything to the dataset. So I did datalad add -r project, and after this I was able to create the subdataset. However with datalad subdatasets, in the project folder, I don’t get anything. Shouldn’t I get the subdataset that I just created? Also the datalad ls shows only
. [annex] master ✗ 2018-03-23/15:42:55 ✗
In the project folder and
. dir
in the data folder.

eknahm · March 23, 2018, 2:06pm

I don’t yet understand what is happening. Could you please run
datalad subdatasets -r

Antti_Rantala · March 23, 2018, 2:24pm

I get nothing when I run that.

With datalad plugin wtf I get
Dataset information
===================
path: PATH/project
repo: AnnexRepo

and

submodule.original.active: true
submodule.original.url: PATH/project/original

The last one might be from a previous test, where I created the original dataset in the project folder and not in the data folder.

I’m not sure if this is related to this but when I try to run a script in the code folder I get (restmegenv)
datalad run python code/testProcess.py
run(impossible): /PATH/project (dataset) [unsaved modifications present, cannot detect changes by command]

but with datalad save I get

(restmegenv)[rantala2@login2]/scratch/nbe/restmeg/test_everything% datalad save
save(notneeded): /PATH/project (dataset)

I removed one extra folder before this however, and I guess that is why the run fails. But why doesn’t the save notice that change?

Antti_Rantala · March 23, 2018, 4:19pm

Ok, I removed the all of hidden folders and files and tried again. This time everything seemed to work. In the process I however lost all files in the repo (all that was left were symbolic links to hidden folders I’d removed). Luckily this was just a small test directory. Is there some official way to remove datalad from a directory? Just to re-iterate the correct way to create these over an existing directory structure is:

datalad create --force project
cd project/data
datalad create --force -d …/ data/original
cd …
datalad add .
cd data/original
datalad add .

Is this correct?

yarikoptic · March 30, 2018, 3:38am

Is there some official way to remove datalad from a directory?

Do you mean something like datalad remove --nocheck directory/ where you are trying to remove dataset disregarding the fact that data files might no longer be available elsewhere?

yarikoptic · March 30, 2018, 3:50am

I just saw that the original ERROR message includes a Segmentation fault… not good! although not clear if that is git-submodule or what . could you still replicate it?

yarikoptic · March 30, 2018, 3:57am

commands look good. But should work even without explicit cd anywhere:

yoh@hopa:/tmp> mkdir -p project/{code,data/{original,processed}}/subdir

*yoh@hopa:/tmp> echo 2 > project/data.dat

*yoh@hopa:/tmp> echo 1 > project/data/original/subdir/datafile.dat

yoh@hopa:/tmp> datalad create -f ./project/
[INFO   ] Creating a new annex repo at /tmp/project 
create(ok): /tmp/project (dataset)                                               

yoh@hopa:/tmp> datalad create --force -d ./project ./project/data/original 
[INFO   ] Creating a new annex repo at /tmp/project/data/original 
add(ok): data/original (dataset) [added new subdataset]                          
add(notneeded): .gitmodules (file) [already included in the dataset]             
add(notneeded): data/original (dataset) [nothing to add from /tmp/project/data/original]
save(ok): /tmp/project (dataset)                                                 
create(ok): data/original (dataset)
action summary:
  add (notneeded: 2, ok: 1)
  create (ok: 1)
  save (ok: 1)

yoh@hopa:/tmp> datalad ls -r project/
project                 [annex]  master  ✗ 2018-03-29/23:55:16  ✗
project/data/original   [annex]  master  ✗ 2018-03-29/23:55:16  ✗

yoh@hopa:/tmp> datalad add -r -m "initial data added" project
add(ok): /tmp/project/data/original/subdir/datafile.dat (file)                   
add(ok): /tmp/project/data/original (dataset)
add(ok): /tmp/project/data.dat (file)                                            
add(ok): /tmp/project (dataset)
save(ok): /tmp/project/data/original (dataset)
save(ok): /tmp/project (dataset)
action summary:
  add (ok: 4)
  save (ok: 2)

yoh@hopa:/tmp> datalad ls -r project/
project                 [annex]  master  ✗ 2018-03-29/23:55:26  ✓
project/data/original   [annex]  master  ✗ 2018-03-29/23:55:25  ✓

Antti_Rantala · April 27, 2018, 2:27pm

Hi,

Sorry for the long hiatus. I couldn’t reproduce the problem again. My question about removing datalad was about removing the datalad from the folder without removing anything from the filesystem. If for some reason I would like to not use datalad for the project anymore how would I do that? Datalad remove removes the folders from the filesystem and same happens with the datalad uninstall. If I just remove .datalad and .git folders I lose all of the data too, since what is left is just symbolic links. Is the correct way first to unlock everything (datalad unlock -r) and then remove the hidden folders from each dataset?

eknahm · April 28, 2018, 12:20pm

That would be sad, of course, but if you must check this out:

% git annex uninit --help
git-annex uninit - de-initialize git-annex and clean out repository

Antti_Rantala · July 24, 2018, 1:13pm

I think I’ll have to remove datalad from this project . I’m not sure if it’s the IO speed trouble with the remote disk or the size/amount of the files, but datalad save takes hours to finish, and git is running at 100 %CPU for all of that. The tool had promise but for one reason or other it is really slow on my system.

eknahm · July 25, 2018, 7:28am

Without knowing details about your setup it is hard to say anything. If your “remote disk” is very slow that might explain this. Unlike a plain file system, where a “save” would just transport the data once, git-annex additionally needs to checksum each fiel in order to put it into the annex. So if you have huge files on a slow (network?) drive this can be very slow.

However, decentralized data management enables you to NOT use shared network drives in some situations, specifically avoiding this problem.

A potentially related issue would be if you have several (hundreds) subdatasets that are all modified or have new content. In that case it might make a difference how exactly you are calling save in order to optimize runtime.

Antti_Rantala · July 25, 2018, 1:21pm

The total dataset is large and so are the files. Over 3TB and about 1GB per file. I have one subdataset, so that isn’t the problem. Network drive is the word I was searching for. The shared network drive is given by the university and is needed so that I can access the data from a remote computing cluster.

eknahm · July 25, 2018, 2:29pm

That would explain the runtime. Assuming you have a full-speed gigabit connection to that computing cluster it would take at least 8 hours to transfer all the data to the client (your machine) to perform the hashing. This is a very ineffective use of git-annex. There is no way to perform client-side hashing without having transferred the data at least once, but hashing needs to be done when initially adding a file to a dataset, as it is the hash (only) that is being added to git, not the actual file content. Depending on the type of network drive (NFS or CIFS or something else?) a single transfer might not even be enough.

For any such procedure, it is most effective to run it as close to (or possibly right on) the file server as possible.

If you can provide more info on your use case (connection speeds, filesystems, operating systems, data access patterns, i.e. where does the data live, what machines needs to access the data in order to perform what kind of read/write operations), it might be possible to give more useful advice.