How to create new datalad dataset with existing http-accessible data

Hello,

I’m a beginner with datalad. I’ve been reading docs and playing around for a month or two and cannot figure out how to create the type of datalad dataset I’d like to create.

I have some existing http-accessible data sets that I would like to convert into a datalad dataset. I have tried datalad download-URL and it seems to work locally, but when I then push that to GitHub, clone, and try ‘datalad get’, the remote files are not downloaded. I just see symlinks to files that do not exist locally.

Put differently, I’d like to use datalad as a wrapper to existing web-based data files. Then, if some files are on host A and others on host B, users should be able to access them all easily in one place via datalad. I don’t see this useage example in the handbook, but it seems like datalad supports it based on the documentation.

Thank you,

Ken

datalad 0.12.4
Ubuntu 20.04

Somewhat related - if I ‘datalad download-url’ a CSV file, and then try to drop that file, it remains in the repository. I see

action summary:
drop (notneeded: 1)

And when I push the dataset, the CSV is pushed to GitHub. I’d like the CSV to not be pushed, and it should be fetched from the original http location used with download-url when 3rd parties close and ‘get’ the data.

Good strategy is to post actual commands you used so we can see what needs to be adjusted in your invocations. Otherwise - we just need to guess:

To push to GitHub, make sure you push git-annex branch as well. datalad push does that for you

Csv in git and not git-annex - most likely you created dataset using -c text2git, so csv which is a text file was added to git.

But my guesses can be wrong :wink:

Also I would recommend to apt install neurodebian and then upgrade datalad from it

Hi Yaroslav,

Thanks for the suggestions. apt install neurodebian doesn’t do anything on a stock 20.04 Ubuntu install. However, I am now running datalad 0.14.5 through ‘conda install datalad’. This seems to be a better upgrade method for non-neuro users (I’m a climate scientist, not a neuroscientist).

Here are the specific commands I’m running, trying to create a new datalad dataset that provides access to existing web-accessible data.

Create a datalad dataset

#+BEGIN_SRC bash
datalad create -D "test" test
cd test

# fetch a file
datalad download-url -m "Download sample data"  "https://dataverse01.geus.dk/api/access/datafile/:persistentId?persistentId=doi:10.22008/promice/data/ice_discharge/d/v02/V0WDHH" -O "test.csv"

# generate a new file
datalad run -m "Last data point" "(head -n1 test.csv; tail -n1 test.csv) > latest.csv"
#+END_SRC

Publish dataset

#+BEGIN_SRC bash
git remote add origin git@github.com:cryo-data/testing.git
git push -u origin main
datalad push
#+END_SRC

Everything looks good here… the CSV files aren’t really uploaded to GitHub

Clone dataset

#+BEGIN_SRC bash
cd ~/tmp/datalad
datalad clone git@github.com:cryo-data/testing.git
datalad get * # nothing happens
#+END_SRC

With v 0.14 I now get the following error message when I run datalad get, which may be helpful. Although after searching online I was still not able to decipher it (I’m new to git-annex too).

get(ok): test.csv (file) [from web…]
[ERROR ] not available; (Note that these git remotes have annex-ignore set: origin) [get(/home/kdm/tmp/datalad/testing/latest.csv)]
get(error): latest.csv (file) [not available; (Note that these git remotes have annex-ignore set: origin)]
action summary:
get (error: 1, ok: 1)

Aha! I have made progress :). datalad get * does get the ‘test.csv’ from the website as expected.

I can’t get the ‘latest.csv’ file generated with datalad run. I can re-create it with datalad re-run though.

Is there a way to have get run rerun? Or would it be better to have a Makefile in the repository and tell people to datalad clone and then make, and make can both get and run as needed?

My use case may involved 1000s of files and a minor or major modification to each file, for example, re-projecting something geospatial. It would be nice to be able to use the shell wild-card features of the datalad get command to get subsets of the final files which are created by 1) downloading and 2) running a simple command. Selective re-running based on unix wild-cards seems more complicated. Is this use-case supported?

Thank you,

-k.

so - you got it working, great. As for rerun – just do it :wink:

(git-annex)lena:/tmp/testing[main]
$> datalad rerun 181c61d2afb20d4e6166b0b4f8cdb02c72909c04
[INFO   ] run commit 181c61d; (Last data point) 
[WARNING] no content present; cannot unlock [unlock(/tmp/testing/latest.csv)] 
remove(ok): latest.csv
[INFO   ] == Command start (output follows) ===== 
[INFO   ] == Command exit (modification check follows) ===== 
add(ok): latest.csv (file)                                                                                
action summary:                                                                                           
  add (ok: 1)
  remove (ok: 1)
  save (notneeded: 1)

$> head latest.csv 
Date,Discharge [Gt yr-1]
2021-06-18,498.782

for “automagically” doing rerun to get a file – datalad-run external special remote · Issue #2850 · datalad/datalad · GitHub

but overall note – datalad (even having run and rerun) is not a workflow manager. So indeed you might like to use make or alike to automate mass conversions etc. May be eventually datalad-run special remote could assist.

I’m a bit confused by

Given use-cases like:

What are those if not ‘workflow manager’ (or in the case of the first link, what is the difference between ‘workflow manager’ and ‘data management workflow’? :)).

sorry for using ambiguous semantics. By “workflow manager” I meant something like snakemake, make, nipype, etc or even a dvc (see e.g. handbook chapter) – a platform which allows to establish a DAG of dependencies between inputs/outputs to allow for efficient recomputation upon changes to any particular input. Although datalad run does establish a DAG of changes (and even input/output relationships if provided by the users) within git history, it is not intended to duplicate/replace functionality of full fledged “workflow managers” mentioned above. It is primarily to provide convenient mechanism to

  1. ensure transitioning between clean (committed) states;
  2. record provenance (command, inputs/outputs) of the results from running a command in a loved by us git native medium (commit), thus allowing retrospection with tools at hands (e.g., git log);
  3. give a chance (but not necessarily a guarantee) to make a particular part of the history (one commit or more) to be rerun (thus somewhat extending git’s "cherry-pick"ing from pure “done manually and reflected as a patch” to “done by script by a command”).

“data management workflow” is higher level concept describing approaches on how to manage data (sharing, tracking changes etc).
I hope this helps.