Hi,Datalad.
I’m developer from Japan who fork “G-Node/gogs” and DataLad.
I wanna ask you about peformance of DataLad.
I compared processing time of two series of processing using 1 MBytes File.
I think that two series of processing have same ability .
① using “datalad” CLI
1. datalad save
2. datalad update
3. datalad push
② using “git” CLI
1. git add
2. git commit
3. git pull
4. git push
I expected that two processing time is almost the same because two series of processing have same ability .
But my expectations are largely wrong.
I acquired average processing time.
Using “datalad” CLI, average processing time is 9.513s
Using “git” CLI, average processing time is 2.636s
This result indcated that processing of “datalad” is slower than processing of “git” .
But, I don’t understand that processing time is difference between “datalad” and “git”.
Do you know why datalad processing is slow?
I prepared test enviroment below.
① GIN Server.
Machin spec is below.( If you need memory spec and storage info, I will show you later )
◇GIN version and commit
Live branch which tags gin-live-2020-10-24
◇Operating system
Linux ubuntu 5.15.0-41-generic #44~20.04.1-Ubuntu SMP Fri Jun 24 13:27:29 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
◇Database
postgres
② JupyterHub Container
Machin spec is below.( If you need memory spec and storage info, I will show you later )
◇Operating system
Linux version 5.4.0-110-generic (buildd@ubuntu) (gcc version 9.4.0 (Ubuntu 9.4.0-1ubuntu1~20.04.1)) #124-Ubuntu SMP Thu Apr 14 19:46:19 UTC 2022
test common condition
File size : 1 MBytes
number of trials : 50
test case 1 (① using “datalad” CLI)
I use only datalad CLI.
I measured processing time in a series of processing using “datalad” CLI
1. datalad save
2. datalad update
3. datalad push
My test code(test_code1.ipynb) is here.
test_code1.ipynb
import os
import papermill as pm
from colorama import Fore
from IPython.display import clear_output
import glob
import time
import csv
%cd ~/
# path to testdata
experiment_path = '/home/jovyan/experiments/benchmark/datalad_cmd-git'
# get testdata
filelist = []
search_str = experiment_path + '/*.txt'
files = glob.glob(search_str)
for file in files:
filelist.append(file)
# exec Test
results = {}
for filepath in filelist:
save_path = filepath
time_sta = time.time()
!datalad save --to-git $save_path
!datalad update $save_path -s gin --how 'merge'
!datalad push --to gin $save_path
time_end = time.time()
# get elapsed_time
elapsed_time = time_end- time_sta
fileNM = os.path.basename(save_path)
results[fileNM] = elapsed_time
# output to CSV
outputpath = experiment_path + "/datalad_cmd-git.csv"
with open(outputpath, 'x') as f:
writer = csv.writer(f)
for k, v in results.items():
writer.writerow([k, v])
f.close()
test case 2(② using “git” CLI)
I use only git CLI.
I measured processing time in a series of processing using “git” CLI
1. git add
2. git commit
3. git pull
4. git push
My test code(test_code2.ipynb) is here.
import os
import papermill as pm
from colorama import Fore
from IPython.display import clear_output
import glob
import time
import csv
%cd ~/
experiment_path = '/home/jovyan/experiments/benchmark/git-git'
filelist = []
search_str = experiment_path + '/*.txt'
files = glob.glob(search_str)
for file in files:
filelist.append(file)
# exec Test
results = {}
for filepath in filelist:
save_path = filepath
time_sta = time.time()
!git add $save_path
!git commit
!git pull gin
!git push gin
time_end = time.time()
# get elapsed_time
elapsed_time = time_end- time_sta
fileNM = os.path.basename(save_path)
results[fileNM] = elapsed_time
# output to CSV
outputpath = experiment_path + "/git-git.csv"
with open(outputpath, 'x') as f:
writer = csv.writer(f)
for k, v in results.items():
writer.writerow([k, v])
f.close()