I’m a beginner with datalad. I’ve been reading docs and playing around for a month or two and cannot figure out how to create the type of datalad dataset I’d like to create.
I have some existing http-accessible data sets that I would like to convert into a datalad dataset. I have tried datalad download-URL and it seems to work locally, but when I then push that to GitHub, clone, and try ‘datalad get’, the remote files are not downloaded. I just see symlinks to files that do not exist locally.
Put differently, I’d like to use datalad as a wrapper to existing web-based data files. Then, if some files are on host A and others on host B, users should be able to access them all easily in one place via datalad. I don’t see this useage example in the handbook, but it seems like datalad supports it based on the documentation.
Somewhat related - if I ‘datalad download-url’ a CSV file, and then try to drop that file, it remains in the repository. I see
drop (notneeded: 1)
And when I push the dataset, the CSV is pushed to GitHub. I’d like the CSV to not be pushed, and it should be fetched from the original http location used with download-url when 3rd parties close and ‘get’ the data.
Good strategy is to post actual commands you used so we can see what needs to be adjusted in your invocations. Otherwise - we just need to guess:
To push to GitHub, make sure you push git-annex branch as well. datalad push does that for you
Csv in git and not git-annex - most likely you created dataset using -c text2git, so csv which is a text file was added to git.
But my guesses can be wrong
Also I would recommend to apt install neurodebian and then upgrade datalad from it
Thanks for the suggestions.
apt install neurodebian doesn’t do anything on a stock 20.04 Ubuntu install. However, I am now running datalad 0.14.5 through ‘conda install datalad’. This seems to be a better upgrade method for non-neuro users (I’m a climate scientist, not a neuroscientist).
Here are the specific commands I’m running, trying to create a new datalad dataset that provides access to existing web-accessible data.
Create a datalad dataset
datalad create -D "test" test
# fetch a file
datalad download-url -m "Download sample data" "https://dataverse01.geus.dk/api/access/datafile/:persistentId?persistentId=doi:10.22008/promice/data/ice_discharge/d/v02/V0WDHH" -O "test.csv"
# generate a new file
datalad run -m "Last data point" "(head -n1 test.csv; tail -n1 test.csv) > latest.csv"
git remote add origin email@example.com:cryo-data/testing.git
git push -u origin main
Everything looks good here… the CSV files aren’t really uploaded to GitHub
datalad clone firstname.lastname@example.org:cryo-data/testing.git
datalad get * # nothing happens
With v 0.14 I now get the following error message when I run
datalad get, which may be helpful. Although after searching online I was still not able to decipher it (I’m new to git-annex too).
get(ok): test.csv (file) [from web…]
[ERROR ] not available; (Note that these git remotes have annex-ignore set: origin) [get(/home/kdm/tmp/datalad/testing/latest.csv)]
get(error): latest.csv (file) [not available; (Note that these git remotes have annex-ignore set: origin)]
get (error: 1, ok: 1)
Aha! I have made progress :).
datalad get * does get the ‘test.csv’ from the website as expected.
get the ‘latest.csv’ file generated with
datalad run. I can re-create it with
datalad re-run though.
Is there a way to have
rerun? Or would it be better to have a Makefile in the repository and tell people to
datalad clone and then
make, and make can both
run as needed?
My use case may involved 1000s of files and a minor or major modification to each file, for example, re-projecting something geospatial. It would be nice to be able to use the shell wild-card features of the
datalad get command to get subsets of the final files which are created by 1) downloading and 2) running a simple command. Selective re-running based on unix wild-cards seems more complicated. Is this use-case supported?
so - you got it working, great. As for rerun – just do it
$> datalad rerun 181c61d2afb20d4e6166b0b4f8cdb02c72909c04
[INFO ] run commit 181c61d; (Last data point)
[WARNING] no content present; cannot unlock [unlock(/tmp/testing/latest.csv)]
[INFO ] == Command start (output follows) =====
[INFO ] == Command exit (modification check follows) =====
add(ok): latest.csv (file)
add (ok: 1)
remove (ok: 1)
save (notneeded: 1)
$> head latest.csv
Date,Discharge [Gt yr-1]
for “automagically” doing
get a file – datalad-run external special remote · Issue #2850 · datalad/datalad · GitHub
but overall note – datalad (even having run and rerun) is not a workflow manager. So indeed you might like to use make or alike to automate mass conversions etc. May be eventually
datalad-run special remote could assist.
I’m a bit confused by
Given use-cases like:
What are those if not ‘workflow manager’ (or in the case of the first link, what is the difference between ‘workflow manager’ and ‘data management workflow’? :)).
sorry for using ambiguous semantics. By “workflow manager” I meant something like
nipype, etc or even a dvc (see e.g. handbook chapter) – a platform which allows to establish a DAG of dependencies between inputs/outputs to allow for efficient recomputation upon changes to any particular
datalad run does establish a DAG of changes (and even input/output relationships if provided by the users) within git history, it is not intended to duplicate/replace functionality of full fledged “workflow managers” mentioned above. It is primarily to provide convenient mechanism to
- ensure transitioning between clean (committed) states;
- record provenance (command, inputs/outputs) of the results from running a command in a loved by us
git native medium (commit), thus allowing retrospection with tools at hands (e.g.,
- give a chance (but not necessarily a guarantee) to make a particular part of the history (one commit or more) to be
rerun (thus somewhat extending git’s "cherry-pick"ing from pure “done manually and reflected as a patch” to “done by script by a command”).
“data management workflow” is higher level concept describing approaches on how to manage data (sharing, tracking changes etc).
I hope this helps.