Creating datalad dataset with existing directories

datalad

#1

I’d like to add datalad to my existing directory structure, where I have separate folders for the code and the data. For future use I’d like to add different data sources to my work and also keep track of the changes to my current dataset. Datalad seems to be a fine tool for this. The problem I have is creating a dataset over the existing directories. My directory structure is a bit like this

project
|->code
|->data
| | -> original
| | -> processed

I can create a dataset with datalad create --force ./project. The problem is that I’d like to create a subdataset of the original folder. When I try to do datalad create --force -d ./project ./project/data/original , I get

[ERROR ] CommandError: command ‘[‘git’, ‘–work-tree=.’, ‘submodule’, ‘status’]’ failed with exitcode 139
| Failed to run [‘git’, ‘–work-tree=.’, ‘submodule’, ‘status’] under ‘/PATH/project’. Exit code=139. out= err=/share/apps/git-annex/6.20180227/git-core/git-submodule: line 979: 49578 Segmentation fault git ${wt_prefix:±C “$wt_prefix”} ${prefix:±-super-prefix “$prefix”} submodule–helper status ${GIT_QUIET:±-quiet} ${cached:±-cached} ${recursive:±-recursive} “$@”
| [cmd.py:run:520] [subdatasets.py:_parse_git_submodules:114] (InvalidGitRepositoryError)

Shouldn’t this be the way to create these subdatasets? I’m using datalad 0.9.2 and git-annex 6.20180227.

EDIT:

I found that the force create doesn’t add anything to the dataset. So I did datalad add -r project, and after this I was able to create the subdataset. However with datalad subdatasets, in the project folder, I don’t get anything. Shouldn’t I get the subdataset that I just created? Also the datalad ls shows only
. [annex] master ✗ 2018-03-23/15:42:55 ✗
In the project folder and
. dir
in the data folder.


#2

I don’t yet understand what is happening. Could you please run
datalad subdatasets -r


#3

I get nothing when I run that.

With datalad plugin wtf I get
Dataset information
===================
path: PATH/project
repo: AnnexRepo

and

submodule.original.active: true
submodule.original.url: PATH/project/original

The last one might be from a previous test, where I created the original dataset in the project folder and not in the data folder.

I’m not sure if this is related to this but when I try to run a script in the code folder I get (restmegenv)
datalad run python code/testProcess.py
run(impossible): /PATH/project (dataset) [unsaved modifications present, cannot detect changes by command]

but with datalad save I get

(restmegenv)[rantala2@login2]/scratch/nbe/restmeg/test_everything% datalad save
save(notneeded): /PATH/project (dataset)

I removed one extra folder before this however, and I guess that is why the run fails. But why doesn’t the save notice that change?


#4

Ok, I removed the all of hidden folders and files and tried again. This time everything seemed to work. In the process I however lost all files in the repo (all that was left were symbolic links to hidden folders I’d removed). Luckily this was just a small test directory. Is there some official way to remove datalad from a directory? Just to re-iterate the correct way to create these over an existing directory structure is:

datalad create --force project
cd project/data
datalad create --force -d …/ data/original
cd …
datalad add .
cd data/original
datalad add .

Is this correct?


#5

Is there some official way to remove datalad from a directory?

Do you mean something like datalad remove --nocheck directory/ where you are trying to remove dataset disregarding the fact that data files might no longer be available elsewhere?


#6

I just saw that the original ERROR message includes a Segmentation fault… not good! although not clear if that is git-submodule or what . could you still replicate it?


#7

commands look good. But should work even without explicit cd anywhere:

yoh@hopa:/tmp> mkdir -p project/{code,data/{original,processed}}/subdir

*yoh@hopa:/tmp> echo 2 > project/data.dat

*yoh@hopa:/tmp> echo 1 > project/data/original/subdir/datafile.dat

yoh@hopa:/tmp> datalad create -f ./project/
[INFO   ] Creating a new annex repo at /tmp/project 
create(ok): /tmp/project (dataset)                                               

yoh@hopa:/tmp> datalad create --force -d ./project ./project/data/original 
[INFO   ] Creating a new annex repo at /tmp/project/data/original 
add(ok): data/original (dataset) [added new subdataset]                          
add(notneeded): .gitmodules (file) [already included in the dataset]             
add(notneeded): data/original (dataset) [nothing to add from /tmp/project/data/original]
save(ok): /tmp/project (dataset)                                                 
create(ok): data/original (dataset)
action summary:
  add (notneeded: 2, ok: 1)
  create (ok: 1)
  save (ok: 1)

yoh@hopa:/tmp> datalad ls -r project/
project                 [annex]  master  ✗ 2018-03-29/23:55:16  ✗
project/data/original   [annex]  master  ✗ 2018-03-29/23:55:16  ✗

yoh@hopa:/tmp> datalad add -r -m "initial data added" project
add(ok): /tmp/project/data/original/subdir/datafile.dat (file)                   
add(ok): /tmp/project/data/original (dataset)
add(ok): /tmp/project/data.dat (file)                                            
add(ok): /tmp/project (dataset)
save(ok): /tmp/project/data/original (dataset)
save(ok): /tmp/project (dataset)
action summary:
  add (ok: 4)
  save (ok: 2)

yoh@hopa:/tmp> datalad ls -r project/
project                 [annex]  master  ✗ 2018-03-29/23:55:26  ✓
project/data/original   [annex]  master  ✗ 2018-03-29/23:55:25  ✓


#8

Hi,

Sorry for the long hiatus. I couldn’t reproduce the problem again. My question about removing datalad was about removing the datalad from the folder without removing anything from the filesystem. If for some reason I would like to not use datalad for the project anymore how would I do that? Datalad remove removes the folders from the filesystem and same happens with the datalad uninstall. If I just remove .datalad and .git folders I lose all of the data too, since what is left is just symbolic links. Is the correct way first to unlock everything (datalad unlock -r) and then remove the hidden folders from each dataset?


#9

That would be sad, of course, but if you must check this out:

% git annex uninit --help
git-annex uninit - de-initialize git-annex and clean out repository