What is the preferred strategy for creating and updating an archive.7z in a RIA store?

psadil · April 26, 2023, 6:57pm

Summary of what happened:

I am evaluating the RIA Store model for a dataset but am not clear on details about a workflow that uses the optional 7zipped archive.

Command used (and if a helper script was used, a link to the helper script or the command generated):

Given a dataset, I know that I can create a RIA sibling and add annexed files to it via something like the following

ria=[somepath]
alias=mydata
datalad create-sibling-ria -s ria-backup --alias ${alias} --new-store-ok "ria+file://${ria}"
datalad push --to ria-backup

After that, my impression is that the recommended way to create the archive is

datalad export-archive-ora -d . ${ria}/alias/${alias}/archives/archive.7z

This creates a RIA store that looks something like the following

[...]
├── 825
│  └── 647de-74c1-4a38-8163-e03cf23c1814
│     ├── annex
│     │  └── objects
│     │     ├── 0p
│     │     │  └── mp
│     │     │     └── SHA256E-s197785421--bfe1f8cc2daab0b7758579a8a1a787e2283f7e47fe49c37ea5ae83766992e83c.nii.gz
│     │     │        └── SHA256E-s197785421--bfe1f8cc2daab0b7758579a8a1a787e2283f7e47fe49c37ea5ae83766992e83c.nii.gz
[...]
│     ├── archives
│     │  └── archive.7z
[...]
├── alias
│  └── mydata -> ../825/647de-74c1-4a38-8163-e03cf23c1814
├── error_logs
└── ria-layout-version

But I’m confused about how I’d update the RIA store.

What happens after I annex more files in the original dataset, or modify previously annexed files? That is, does the archive.7z need to be recreated from scratch?
How should I drop the regular annex in the RIA store? Is there a tool for deduplicating the RIA store so that the only copy of annexed files are stored in archive.7z ?
- From experimenting, it seems like I can delete 825/647de-74c1-4a38-8163-e03cf23c1814/annex/objects, but that seems risky because there isn’t a guarantee that files are actually stored inside archive.7z.

Version:

git annex version
git-annex version: 10.20230407
build flags: Assistant Webapp Pairing FsEvents TorrentParser MagicMime Benchmark Feeds Testsuite S3 WebDAV
dependency versions: aws-0.24 bloomfilter-2.0.1.0 cryptonite-0.30 DAV-1.3.4 feed-1.3.2.1 ghc-9.4.4 http-client-0.7.13.1 persistent-sqlite-2.13.1.1 torrent-10000.1.3 uuid-1.3.15 yesod-1.6.2.1
key/value backends: SHA256E SHA256 SHA512E SHA512 SHA224E SHA224 SHA384E SHA384 SHA3_256E SHA3_256 SHA3_512E SHA3_512 SHA3_224E SHA3_224 SHA3_384E SHA3_384 SKEIN256E SKEIN256 SKEIN512E SKEIN512 BLAKE2B256E BLAKE2B256 BLAKE2B512E BLAKE2B512 BLAKE2B160E BLAKE2B160 BLAKE2B224E BLAKE2B224 BLAKE2B384E BLAKE2B384 BLAKE2BP512E BLAKE2BP512 BLAKE2S256E BLAKE2S256 BLAKE2S160E BLAKE2S160 BLAKE2S224E BLAKE2S224 BLAKE2SP256E BLAKE2SP256 BLAKE2SP224E BLAKE2SP224 SHA1E SHA1 MD5E MD5 WORM URL X*
remote types: git gcrypt p2p S3 bup directory rsync web bittorrent webdav adb tahoe glacier ddar git-lfs httpalso borg hook external
operating system: darwin aarch64
supported repository versions: 8 9 10
upgrade supported from repository versions: 0 1 2 3 4 5 6 7 8 9 10

datalad.__version__
Out[2]: '0.18.3'

Environment (Docker, Singularity, custom installation):

❯ mamba env export

name: datalad-demo
channels:
  - conda-forge
dependencies:
  - bzip2=1.0.8=h3422bc3_4
  - ca-certificates=2022.12.7=h4653dfc_0
  - libcxx=16.0.2=h4653b0c_0
  - libexpat=2.5.0=hb7217d7_1
  - libffi=3.4.2=h3422bc3_5
  - libsqlite=3.40.0=h76d750c_1
  - libzlib=1.2.13=h03a7124_4
  - ncurses=6.3=h07bb92c_1
  - openssl=3.1.0=h53f4e23_2
  - p7zip=16.02=hbdafb3b_1001
  - pip=23.1.1=pyhd8ed1ab_0
  - python=3.11.3=h1456518_0_cpython
  - readline=8.2=h92ec313_1
  - setuptools=67.7.2=pyhd8ed1ab_0
  - tk=8.6.12=he1e0b03_0
  - tzdata=2023c=h71feb2d_0
  - wheel=0.40.0=pyhd8ed1ab_0
  - xz=5.2.6=h57fd34a_0

Data formatted according to a validatable standard? Please provide the output of the validator:

na

Relevant log outputs (up to 20 lines):

na

Screenshots / relevant information:

Thanks!

cmo · April 27, 2023, 10:40am

Hi @psadil , thank you for the question. I am looking into it and will get back to you shortly (meanwhile, I have created an issue in our knowledge base: What is the preferred strategy for creating and updating an archive.7z in a RIA store? · Issue #47 · psychoinformatics-de/knowledge-base · GitHub).

cmo · April 27, 2023, 1:29pm

Hi @psadil. I have looked more deeply into the issue, dug through the source and made a few tests. Here is what I found:

In general, datalad export-archive-ora always invokes 7z with the update flag, i.e. withu. That means, that any content that exists in the target archive will be preserved, new content might be added, and no content should be deleted. Now, this holds, if nothing goes wrong during writing the archive, e.g. no blackouts, etc. To reduce the risk that something goes wrong during the 7z-operations, I would probably create a copy of the archive.7z-files that is contained in the RIA-store, update the copy, and move it to its final location afterward. That might not be possible due to space limitations on your system though. And it might be overly cautious.

Having said that, let me try to answer your questions, although I will below motivate and suggest the use of 7zip directly instead of datalad export-archive-ora.

Question 1:

Q:

What happens after I annex more files in the original dataset, or modify previously annexed files? That is, does the archive.7z need to be recreated from scratch?

A:

There are a few aspects to annexing more files.

Annexing more files, or modifying previously annexed files will usually create additional entries in the local annex key store (in .git/annex/objects).
If you push the dataset to the RIA-store, those new entries will be stored in the respective directory in the ria store, i.e. in <XXX>/<X...X>/annex/objects.
You don’t need to recreate the archive from scratch. If you want to update the archive with the command line shown above, you can do that. As mentioned earlier, export-archive-ora will update the target 7z-file. So the content that existed will be retained and new content will be added. Nothing will be removed.
Please note, that this should only be done, if the local dataset and the RIA-store are in sync, i.e. right after you pushed your dataset and before anybody else pushes a dataset. The reason is that export-archive-ora uses the local dataset to determine what should go into the archive. If the states of the local dataset and of the RIA-store do not match, weird things might happen.

Question 2:

Q:

How should I drop the regular annex in the RIA store? Is there a tool for deduplicating the RIA store so that the only copy of annexed files are stored in archive.7z ?
- From experimenting, it seems like I can delete 825/647de-74c1-4a38-8163-e03cf23c1814/annex/objects, but that seems risky because there isn’t a guarantee that files are actually stored inside archive.7z.

A:

The regular annex in the RIA-store can just be deleted, if you are sure that all its content is in the archive.7z file.
- If you delete annex objects that are not in the archive-file, those objects can not be retrieved, for example via datalad get, anymore from the RIA-store. So, do not delete objects until you are sure that they are contained in archive.7z.

Recommendation

Due to the possible inconsistency between the local annex object-store of the dataset and the annex object store in respective RIA-store, I would not advice to use the command line, that you posted in the original post. It should work, if you know exactly who modifies the RIA store at what time.

I talked to the datalad team and the recommendation was to use 7z directly in update mode on the RIA-store.

So here is what I would probably do:

Lock the RIA-store for “maintenance”
If you can afford it space-wise, create a backup copy of archive.7z. That can be in a temporary space.
Update archive.7z with the content of <dataset-uuid[3:]>/annex/objects, using the update-flag, i.e. u, when modifying the archive.
Check that the current content of <dataset-uuid[3:]>/annex/objects is contained in the archive. If not use, the backup to recover, adjust and try again.
Remove all content in <dataset-uuid[3:]>/annex/objects.
Unlock the RIA-store.
Delete the backup copy of the archive

Just my 2 cents, hope that helps

psadil · April 27, 2023, 2:13pm

That is great! Thank you for the detailed explanation!

Thinking out loud, I wonder whether this workflow could be integrated into DataLad. I can imagine something that would allow a push that targets a RIA store to update the archive.7z without also filling up <dataset-uuid[3:]>/annex/objects – or, if not push, then new some sort of new export-[...] subcommand. Update: Strategies for a simpler RIA-Store & 7 Zip archive workflow · Issue #7376 · datalad/datalad · GitHub

cmo · April 28, 2023, 4:46am

Thanks for the suggestion. I think it is an interesting idea and we have actually thought about “stream lining” the archival process (and some other aspects of RIA stores).

I saw you created a related issue on github. Thanks a lot.