Processing time of "datalad" is slower than time of "git"?

ivis-tsukioka · August 24, 2022, 10:21am

Hi,Datalad.

I’m developer from Japan who fork “G-Node/gogs” and DataLad.
I wanna ask you about peformance of DataLad.

I compared processing time of two series of processing using 1 MBytes File.
I think that two series of processing have same ability .
① using “datalad” CLI
1. datalad save
2. datalad update
3. datalad push

② using “git” CLI
1. git add
2. git commit
3. git pull
4. git push

I expected that two processing time is almost the same because two series of processing have same ability .
But my expectations are largely wrong.
I acquired average processing time.
Using “datalad” CLI, average processing time is 9.513s
Using “git” CLI, average processing time is 2.636s
This result indcated that processing of “datalad” is slower than processing of “git” .
But, I don’t understand that processing time is difference between “datalad” and “git”.
Do you know why datalad processing is slow?

I prepared test enviroment below.

① GIN Server.
Machin spec is below.( If you need memory spec and storage info, I will show you later )

◇GIN version and commit
Live branch which tags gin-live-2020-10-24

◇Operating system
Linux ubuntu 5.15.0-41-generic #44~20.04.1-Ubuntu SMP Fri Jun 24 13:27:29 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux

◇Database
postgres

② JupyterHub Container
Machin spec is below.( If you need memory spec and storage info, I will show you later )

◇Operating system
Linux version 5.4.0-110-generic (buildd@ubuntu) (gcc version 9.4.0 (Ubuntu 9.4.0-1ubuntu1~20.04.1)) #124-Ubuntu SMP Thu Apr 14 19:46:19 UTC 2022

test common condition
File size : 1 MBytes
number of trials : 50

test case 1 (① using “datalad” CLI)

I use only datalad CLI.
I measured processing time in a series of processing using “datalad” CLI
1. datalad save
2. datalad update
3. datalad push

My test code(test_code1.ipynb) is here.

test_code1.ipynb

import os
import papermill as pm
from colorama import Fore
from IPython.display import clear_output
import glob
import time
import csv

%cd ~/

# path to testdata
experiment_path = '/home/jovyan/experiments/benchmark/datalad_cmd-git'

# get testdata
filelist = []
search_str = experiment_path + '/*.txt'
files = glob.glob(search_str)
for file in files:
    filelist.append(file)
    
# exec Test
results = {}
for filepath in filelist:
    save_path = filepath
    time_sta = time.time()
    !datalad save --to-git $save_path
    !datalad update $save_path -s gin --how 'merge'
    !datalad push --to gin $save_path
    time_end = time.time()

   # get elapsed_time
    elapsed_time = time_end- time_sta
    fileNM = os.path.basename(save_path)
    results[fileNM] = elapsed_time

# output to CSV
outputpath = experiment_path + "/datalad_cmd-git.csv"
with open(outputpath, 'x') as f:
    writer = csv.writer(f)
    for k, v in results.items():
        writer.writerow([k, v])
    f.close()

test case 2(② using “git” CLI)

I use only git CLI.
I measured processing time in a series of processing using “git” CLI
1. git add
2. git commit
3. git pull
4. git push

My test code(test_code2.ipynb) is here.

import os
import papermill as pm
from colorama import Fore
from IPython.display import clear_output
import glob
import time
import csv

%cd ~/
experiment_path = '/home/jovyan/experiments/benchmark/git-git'

filelist = []
search_str = experiment_path + '/*.txt'
files = glob.glob(search_str)
for file in files:
    filelist.append(file)

# exec Test
results = {}
for filepath in filelist:
    save_path = filepath
    time_sta = time.time()
    !git add $save_path
    !git commit
    !git pull gin
    !git push gin
    time_end = time.time()
    # get elapsed_time
    elapsed_time = time_end- time_sta
    fileNM = os.path.basename(save_path)
    results[fileNM] = elapsed_time

# output to CSV
outputpath = experiment_path + "/git-git.csv"
with open(outputpath, 'x') as f:
    writer = csv.writer(f)
    for k, v in results.items():
        writer.writerow([k, v])
    f.close()

yarikoptic · August 24, 2022, 11:55am

Do you know why datalad processing is slow?

as any tool which uses other tools underneath, “processing time” of datalad likely to exceed that one of git indeed. We are trying to keep overhead small but , oh well, there is overhead. To answer “why” specifically for your case - you would need to look at specific invocations of git (and possibly git-annex) datalad does underneath. You can see that if running with -l debug. You can also set DATALAD_LOG_TIMESTAMP=1 env var and get log prefixed with time stamps.

ivis-tsukioka · August 30, 2022, 12:05am

You can see that if running with -l debug .

I get debug log with timestamp for below “datalad” command.

datalad save
datalad update
datalad push

In datalad command, many processing is executed in executing command !!!
In “datalad save”, executed processing is NOT ONLY “git add” and “git commit” but also “git config”, “git rev-parse”, “git ls-file”, “git diff” and “git symbolic-ref”.
In “datalad update”, executed processing is NOT ONLY “git pull(git fetch and merge)” but also “git config”, “git symbolic-ref”, “git rev-parse”, “git annex verison” and “git annex merge”.
In “datalad push”, executed processing is NOT ONLY “git push” but also “git config”, “git ls-tree”, “git push --dry-run”, “git symbolic-ref”, “git annex verison”, “git annex finedref”, “git annex wanted”, “git annex copy”, “git fetch”, “git for-each-ref”, “git annex sync” and “git branch”.

But, I think that executed other “git” ot “git annex” commands need for reconciling data and meta data between local and remote repository to handling git and git-annex content in operation using datalad.
Is my thought right ?

In repsitory using datalad CLI, should I use only datala CLI but not mix usage of git, git-annex and datalad to maintain reliable data and meta data in repository ?

yarikoptic · August 30, 2022, 1:05am

Primarily those other commands are executed to get often an extended assesment of the state of the repository. E.g. git commands would not care to commit untracked files, datalad does. We also deal with a number of oddities of git/git-annex behavior so often need to know their versions to figure out how to react (hence that annex version call) etc, so I think your thought is right indeed

As for the 2nd question: you can use git/git-annex directly if you like/need to do so. There should be no “datalad idiosyncrasy” really in the results. It is just that if you would be working with hierarchies of datasets, such direct use of git/git-annex could become tedious a little too fast. I guess you could make use of git submodule foreach or datalad foreach-dataset (has parallelization etc) in those cases. Only better use datalad create over manual mkdir/git init/annex init since we would also take care about establishing dataset uuid etc.