Distributed datasets with DataLad

Table of contents

You can find this webpage at:

https://wgpages.netlify.app/datalad

What is DataLad?

DataLad is a version control system for your data. It is built on top of Git and git-annex, and is available both as a command-line tool and as a Python API.

Git

Git a version control system designed to keep track of software projects and their history, to merge edits from multiple authors, to work with branches (distinct project copies) and merge them into the main projects. Since Git was designed for version control of text files, it can also be applied to writing projects, such as manuscripts, theses, website repositories, etc.

I assume that most attendees are familiar with Git, but we can certainly do a quick command-line Git demo.

Git can also keep track of binary (non-text) files and/or of large data files, but putting binary and/or large files under version control and especially modifying them will inflate the size of the repositories.

git-annex

Git-annex was built on top of Git and was designed to share and synchronize large file in a distributed fashion. The file content is managed separately from the dataset’s structure / metadata – the latter is kept under Git version control, while files are stored in separate directories. If you look inside a git-annex repository, you will see that files are replaced with symbolic links, and in fact you don’t have to have the actual data stored locally, e.g. if you want to reduce the disk space usage.

DataLad

DalaLad builds on top of Git and git-annex, retaining all their features, but adds few other functions:

Datasets can be nested, and most DalaLad commands have a --recursive option that will traverse subdatasets and do “the right thing”.
DalaLad can run commands on data, and if a dataset is not present locally, DalaLad will automatically get the required input files from a remote repository.
DataLad can keep track of data provenance, e.g. datalad download-url will download files, add them to the repository, and keep a record of data origin.
Few other features.

As you will see in this workshop, most DataLad workflows involve running all three – git, git annex, and datalad – commands, so we’ll be using the functionality of all three layers.

Installation

On a Mac with Homebrew installed:

brew upgrade
brew install git-annex
brew install datalad

With pip (Python’s package manager) use one of these two:

pip install datalad          # if you don't run into permission problems
pip install --user datalad   # to force installation into user space

With conda:

conda install -c conda-forge datalad
conda update -c conda-forge datalad

DataLad also needs Git and git-annex, if these are not installed. For mote information, visit the official installation guide .

On a cluster you can install DataLad into your $HOME directory:

module load git-annex   # need this each time you use DalaLad
module load python
virtualenv --no-download ~/datalad-env
source ~/datalad-env/bin/activate
pip install --no-index --upgrade pip
pip install datalad
deactivate
alias datalad=$HOME/datalad-env/bin/datalad   # best to add this line to your ~/.bashrc file

Alternatively, you can install DalaLad into your group’s /project directory:

module load git-annex   # need this each time you use DalaLad
module load python
cd ~/projects/def-sponsor00/shared
virtualenv --no-download datalad-env
source datalad-env/bin/activate
pip install --no-index --upgrade pip
pip install datalad
deactivate
chmod -R og+rX datalad-env

Then everyone in the group can activate DalaLad with:

module load git-annex   # need this each time you use DalaLad
alias datalad=/project/def-sponsor00/shared/datalad-env/bin/datalad   # best to add this line to your ~/.bashrc file

Initial configuration

All these settings go into ~/.gitconfig:

git config --global --add user.name "First Last"      # your name
git config --global --add user.email name@domain.ca   # your email address
git config --global init.defaultBranch main

Basics

Create a new dataset

Note: Some files in your dataset will be stored as plain files, some files will be put in the annex, i.e. they will be replaced with their symbolic links and might not be even stored locally. Annexed files cannot be modified directly (more on this later). The command datalad run-procedure --discover shows you a list of available configurations. On my computer they are:

text2git: do not put anything that is a text file in the annex, i.e. process them with regular Git

yoda: configure a dataset according to the yoda principles

noannex: put everything under regular Git control

cd ~/tmp
datalad create --description "our first dataset" -c text2git test   # use `text2git` configuration
cd test
ls
git log

Add some data

Let’s use some file examples from the official DalaLad handbook :

mkdir books
wget -q https://sourceforge.net/projects/linuxcommand/files/TLCL/19.01/TLCL-19.01.pdf -O books/theLinuxCommandLine.pdf
wget -q https://homepages.uc.edu/~becktl/byte_of_python.pdf -O books/aByteOfPython.pdf
ls books
datalad status
datalad save -m "added a couple of books on Linux and Python"
ls books
git log -n 1        # check last commit
git log -n 1 -p     # check last commit in details
git config --global alias.one "log --graph --date-order --date=short --pretty=format:'%C(cyan)%h %C(yellow)%ar %C(auto)%s%+b %C(green)%ae'"
git one             # custom alias
git log --oneline   # a short alternative

Let’s add another couple of books using a built-in downloading command:

datalad download-url https://github.com/progit/progit2/releases/download/2.1.154/progit.pdf \
    --dataset . -m "added a reference book about git" -O books/proGit.pdf
datalad download-url http://www.tldp.org/LDP/Bash-Beginners-Guide/Bash-Beginners-Guide.pdf \
    --dataset . -m "added bash guide for beginners" -O books/bashGuideForBeginners.pdf
ls books
tree
datalad status   # nothing to be saved
git log          # `datalad download-url` took care of that
git annex whereis books/proGit.pdf   # show the available copies (including the URL source)
git annex whereis books              # show the same for all books

Create and commit a short text file:

cat << EOT > notes.txt
We have downloaded 4 books.
EOT
datalad save -m "added notes.txt"
git log -n 1      # see the last commit
git log -n 1 -p   # and its file changes

Notice that the text file was not annexed: there is no symbolic link. This means that we can modify it easily:

echo Text files are not in the annex.>> notes.txt
datalad save -m "edited notes.txt"

Subdatasets

Let’s clone a remote dataset and store it locally as a subdataset:

datalad clone --dataset . https://github.com/datalad-datasets/machinelearning-books   # get its structure
tree
du -s machinelearning-books             # not much data there (large files were not downloaded)
cd machinelearning-books
datalad status --annex   # if all files were present: 9 annex'd files (74.4 MB recorded total size)
datalad status --annex all   # check how much data we have locally: 0.0 B/74.4 MB present/total size
datalad status --annex all A.Shashua-Introduction_to_Machine_Learning.pdf   # 683.7 KB

Ok, this file is not too large, so we can download it easily:

datalad get A.Shashua-Introduction_to_Machine_Learning.pdf
datalad status --annex all   # now we have 683.7 KB/74.4 MB present/total size
open A.Shashua-Introduction_to_Machine_Learning.pdf   # it should open
datalad drop A.Shashua-Introduction_to_Machine_Learning.pdf   # delete the local copy
git log    # this particular dataset's history (none of our commands show here: we did not modify it)
cd ..

Running scripts

cd ../machinelearning-books
git annex find --not --in=here     # show remote files
mkdir code
cat << EOT > code/titles.sh
for file in \$(git annex find --not --in=here); do
    echo \$file | sed 's/^.*-//' | sed 's/.pdf//' | sed 's/_/ /g'
done
EOT
cat code/titles.sh
datalad save -m "added a short script to write a list of book titles"
datalad run -m "create a list of books" "bash code/titles.sh > list.txt"
cat list.txt
git log   # the command run record went into the log

Now we will modify and rerun this script:

datalad unlock code/titles.sh   # move the script out of the annex to allow edits
cat << EOT > code/titles.sh
for file in \$(git annex find --not --in=here); do
    title=\$(echo \$file | sed 's/^.*-//' | sed 's/.pdf//' | sed 's/_/ /g')
	echo \"\$title\"
done
EOT
datalad save -m "correction: enclose titles into quotes" code/titles.sh
git log -n 5   # note the hash of the last commit
datalad rerun ba90706
more list.txt
datalad diff --from ba90706 --to f88e2ce   # show the filenames only

Finally, let’s extract the title page from one of the books, A.Shashua-Introduction_to_Machine_Learning.pdf. First, let’s open the book itself:

open A.Shashua-Introduction_to_Machine_Learning.pdf   # this book is not here!

The book is not here … That’s not a problem for DalaLad, as it can process a file that is stored remotely (as long as it is part of the dataset) 🡲 it will automatically get the required input file.

datalad run -m "extract the title page" \
  --input "A.Shashua-Introduction_to_Machine_Learning.pdf" \
  --output "title.pdf" \
  "convert -density 300 {inputs}[0] -quality 90 {outputs}"
git log
git annex find --in=here     # show local files: it downloaded the book, extracted the first page
open title.pdf

Five workflows

two users on a shared cluster filesystem working with the same dataset,
one user, one dataset spread over multiple drives, with data redundancy,
publish a dataset on GitHub with annexed files in a special private remote,
publish a dataset on GitHub with publicly-accessible annexed files on Nextcloud, and
(if we have time) managing multiple Git repos under one dataset

Workflow 1: two users on a shared cluster filesystem working with the same dataset

For simplicity, let’s assume both users share the same GID (group ID), i.e. they are from the same research group. Extending this workflow to multiple users with different GISs can be done via Access control lists (ACLs) .

Start with Git

First, let’s consider a shared Git-only (no git-annex, no DalaLad) repository in /project, and how the two users can both push to it.

user001
git config --global user.name "First User"
git config --global user.email "user001@westdri.ca"
git config --global init.defaultBranch main
cd /project/def-sponsor00
git init --bare --shared collab
ls -l | grep collab   # note the group permissions and the SGID (recursive)

cd
git clone /project/def-sponsor00/collab
cd collab
dd if=/dev/urandom of=test1 bs=1024 count=$(( RANDOM + 1024 ))
dd if=/dev/urandom of=test2 bs=1024 count=$(( RANDOM + 1024 ))
dd if=/dev/urandom of=test3 bs=1024 count=$(( RANDOM + 1024 ))
git add test*
git commit -m "added test{1..3}"
git push

user002
git config --global user.name "Second User"
git config --global user.email "user002@westdri.ca"
git clone /project/def-sponsor00/collab
cd collab
echo "making some changes" > readme.txt
git add readme.txt
git commit -m "added readme.txt"
git push

Add DataLad datasets

user001
module load git-annex   # need this each time you use DalaLad
alias datalad=/project/def-sponsor00/shared/datalad-env/bin/datalad   # best to add this line to your ~/.bashrc file

chmod -R u+wX ~/collab
/bin/rm -rf /project/def-sponsor00/collab ~/collab

cd /project/def-sponsor00
git init --bare --shared collab
ls -l | grep collab   # note the group permissions and the SGID (recursive)

cd
datalad create --description "my collab" -c text2git collab   # create a dataset using `text2git` template
cd collab
dd if=/dev/urandom of=test1 bs=1024 count=$(( RANDOM + 1024 ))
dd if=/dev/urandom of=test2 bs=1024 count=$(( RANDOM + 1024 ))
dd if=/dev/urandom of=test3 bs=1024 count=$(( RANDOM + 1024 ))
datalad save -m "added test1,2,3"
git remote add origin /project/def-sponsor00/collab
# git push --set-upstream origin main      # if we were using Git only
datalad push --to origin --data anything   # transfer all annexed content
git annex whereis test1                    # 1 copy
datalad push --to origin --data anything   # I find that I need to run it twice to actually transfer annexed data
git annex whereis test1                    # 2 copies (here and origin)
du -s /project/def-sponsor00/collab

After making sure there is a remote copy, you can drop a local copy:

datalad drop test1
git annex whereis test1                    # 1 copy

To get this file at any time in the future, you would run:

datalad get test1

Let’s actually drop all files for which there is a remote copy, to save on local disk space:

for file in $(git annex find --in=origin); do
    datalad drop $file
done
git annex whereis test*      # only remote copies left
datalad status --annex all   # check local data usage
du -s .

To allow other users to write to the DalaLad repo, setting git init --shared ... on /project/def-sponsor00/collab is not sufficient, as it does not set proper permissions for /project/def-sponsor00/collab/annex. We have to do it manually:

cd /project/def-sponsor00/collab
chmod -R g+ws annex   # so that user002 could push with datalad

user002
module load git-annex   # need this each time you use DalaLad
alias datalad=/project/def-sponsor00/shared/datalad-env/bin/datalad   # best to add this line to your ~/.bashrc file

git config --global --add safe.directory /project/60105/collab           # allow git to work with files from other users
git config --global --add safe.directory /project/def-sponsor00/collab   # do the same for the linked version

/bin/rm -rf collab
datalad clone --description "user002's copy" /project/def-sponsor00/collab collab
cd collab
du -s .
git annex find --in=here   # show local files (none at the moment)

datalad get test1     # download the file
dd if=/dev/urandom of=test4 bs=1024 count=$(( RANDOM + 1024 ))
datalad save -m "added test4"
git log
git remote -v    # our remote is origin
datalad push --to origin --data anything
git annex find --in=origin     # test4 now in both places

echo "I started working with this dataset as well" > notes.txt
git add notes.txt
git commit -m "added notes.txt"
datalad push

Now user001 can see the two new files:

user001
module load git-annex   # need this each time you use DalaLad
alias datalad=/project/def-sponsor00/shared/datalad-env/bin/datalad   # best to add this line to your ~/.bashrc file

cd ~/collab
datalad update --how merge   # download most recent data from origin
cat notes.txt                # it is here
git annex find --in=here     # none of the annexed content (e.g. test4) is here

datalad get test4            # but we can get it easily
git annex find --in=here

datalad status --annex all         # check how much data we have locally: present/total space
git annex find --lackingcopies 0   # show files that are stored only in one place
git annex whereis test*            # show location for all files

You also see information about what is stored in “user002’s copy”. However, you should take it with a grain of salt. For example, if user002 drops some files locally and does not run datalad push, origin (and hence user001) will have no knowledge of that fact.

Workflow 2: one user, one dataset spread over multiple drives, with data redundancy

Initially I created this scenario with two external USB drives. In the interest of time, I simplified it to a single external drive, but it can easily be extended to any number of drives.

First, let’s create an always-present dataset on the computer that will also keep track of all data stored in its clone on a removable USB drive:

cd ~/tmp
datalad create --description "Central location" -c text2git distributed
cd distributed
git config receive.denyCurrentBranch updateInstead   # allow clones to update this dataset
mkdir books
wget -q https://sourceforge.net/projects/linuxcommand/files/TLCL/19.01/TLCL-19.01.pdf -O books/theLinuxCommandLine.pdf
wget -q https://homepages.uc.edu/~becktl/byte_of_python.pdf -O books/aByteOfPython.pdf
datalad save -m "added a couple of books"
ls books
du -s .   # 4.9M stored here

Create a clone on a portable USB drive:

cd /Volumes/t7
datalad clone --description "t7" ~/tmp/distributed distributed
cd distributed
du -s .   # no actual data was copied, just the links
git remote rename origin central
cd books
wget -q https://github.com/progit/progit2/releases/download/2.1.154/progit.pdf -O proGit.pdf
wget -q http://www.tldp.org/LDP/Bash-Beginners-Guide/Bash-Beginners-Guide.pdf -O bashGuideForBeginners.pdf
datalad save -m "added two more books"
git log                          # we have history from both drives (all 4 books)
git annex find --in=here         # but only 2 books are stored here
git annex find --not --in=here   # and 2 books are stored not here
for book in $(git annex find --not --in=here); do
    git annex whereis $book      # show their location: they are in central
done
datalad push --to central --data nothing   # push metadata to central

Operations from the central dataset:

cd ~/tmp/distributed
git annex find --in=here           # show local files: 2 books
git annex find --not --in=here     # show remote files: 2 books
datalad status --annex all         # check local data usage: 4.6 MB/17.6 MB present/total size
git annex find --lackingcopies 0   # show files that are stored only in one place
git annex whereis books/*          # show location

Let’s mount t7 and get one of its books:

datalad get books/bashGuideForBeginners.pdf   # try getting this book from a remote => error
... get(error): books/bashGuideForBeginners.pdf (file) [not available]

git remote    # nothing: central does not know where the remotes are stored
datalad siblings add -d . --name t7 --url /Volumes/t7/distributed
git remote    # now it knows where to find the remotes
datalad get books/bashGuideForBeginners.pdf   # successful!

Now unmount t7.

git annex whereis books/bashGuideForBeginners.pdf   # 2 copies (here and t7)
open books/bashGuideForBeginners.pdf

Let’s remove this the local copy of this book:

datalad drop books/bashGuideForBeginners.pdf   # error: tried to t7 for remaining physical copies
datalad drop --reckless availability books/bashGuideForBeginners.pdf   # do not check remotes (potentially dangerous)
git annex whereis books/bashGuideForBeginners.pdf   # only 1 copy left on t7

Letting remotes know about central changes

Let’s add a DalaLad book to central:

cd ~/tmp/distributed
datalad download-url http://handbook.datalad.org/_/downloads/en/stable/pdf/ \
    --dataset . -m "added the DataLad Handbook" -O books/datalad.pdf

The remote knows nothing about this new book. Let’s push this update out! Make sure to mount t7 and then run the following:

cd /Volumes/t7/distributed
git config receive.denyCurrentBranch updateInstead   # allow clones to update this dataset
cd ~/tmp/distributed
datalad push --to t7 --data nothing                  # push metadata, but not the data

Alternatively, we could update from the USB drive:

cd /Volumes/t7/distributed
datalad update -s central --how=merge

Now let’s check things from t7’s perspective:

cd /Volumes/t7/distributed
ls books/                             # datalad.pdf is there
git annex whereis books/datalad.pdf   # it is in central only (plus on the web)

Data redundancy

Now imagine that we want to backup all files that are stored in a single location, and always have a second copy on the other drive.

cd /Volumes/t7/distributed
for file in $(git annex find --lackingcopies 0); do
    datalad get $file
done
datalad push --to central --data nothing   # update the central
git annex find --lackingcopies 0           # still two files have only 1 copy
git annex find --in=here                   # but they are both here already ==> makes sense

Let’s go to central and do the same:

cd ~/tmp/distributed
for file in $(git annex find --lackingcopies 0); do
    datalad get $file
done
git annex find --lackingcopies 0   # none: now all files have at least two copies
git annex whereis                  # here where everything is

The file books/datalad.pdf is in two locations, although one of them is the web. You can correct that manually: go to t7 and run get there.

Try dropping a local file:

datalad drop books/theLinuxCommandLine.pdf   # successful, since t7 is also mounted
datalad get books/theLinuxCommandLine.pdf    # get it back

Set the minimum number of copies and try dropping again:

git annex numcopies 2
datalad drop books/theLinuxCommandLine.pdf   # can't: need minimum 2 copies!

Workflow 3: publish a dataset on GitHub with annexed files in a special private remote

At some stage, you might want to publish a dataset on GitHub that contains some annexed data. The problem is that annexed data could be large, and you can quickly run into problems with GitHub’s storage/bandwidth limitations. Moreover, free accounts on GitHub do not support working with annexed data.

With DalaLad, however, you can host large/annexed files elsewhere and still have the dataset published on GitHub. This is done with so-called special remotes. The published dataset on GitHub stores the information about where to obtain the annexed file contents when you run datalad get.

Special remotes can point to Amazon S3, Dropbox, Google Drive, WebDAV, sftp servers, etc.

Let’s create a small dataset with an annexed file:

cd ~/tmp
chmod -R u+wX publish && /bin/rm -r publish

datalad create --description "published dataset" -c text2git publish
cd publish
dd if=/dev/urandom of=test1 bs=1024 count=$(( RANDOM + 1024 ))
datalad save -m "added test1"

Next, we can set up a special remote on the Alliance’s Nextcloud service. DataLad talks to special remotes via rclone protocol, so we need to install it (along with git-annex-remote-rclone utility) and then configure an rclone remote of type WebDAV:

brew install rclone
brew install git-annex-remote-rclone
rclone config
  new remote
  Name: nextcloud
  Type of storage: 46 / WebDAV
  URL: https://nextcloud.computecanada.ca/remote.php/webdav/
  Vendor: 1 / Nextcloud
  User name: razoumov
  Password: type and confirm your password
  no bearer_token
  no advanced config
  keep this remote
  quit

Inside our dataset we set a nextcloud remote on which we’ll write into the directory annexedData:

git annex initremote nextcloud type=external externaltype=rclone encryption=none target=nextcloud prefix=annexedData
git remote -v
datalad siblings
datalad push --to nextcloud --data anything

If you want to share your annexedData folder with another CCDB user, log in to https://nextcloud.computecanada.ca with your CC credentials, click “share” on annexedData, then optionally type in the name/username of the user to share with.

Next, we publish on dataset on GitHub. The following command creates an empty repository called testPublish on GitHub and sets a publication dependency: all new annexed content will automatically go to Nextcloud when we push to GitHub.

datalad create-sibling-github -d . testPublish --publish-depends nextcloud
datalad siblings   # +/- indicates the presence/absence of a remote data annex at this remote
datalad push --to github

dd if=/dev/urandom of=test2 bs=1024 count=$(( RANDOM + 1024 ))
datalad save -m "added test2"
datalad push --to github   # automatically pushes test2 to nextcloud!

Imagine we are another user trying to download the dataset. In this demo I will use the same credentials, but in principle this could be another researcher (at least for reading only):

user001
module load git-annex   # need this each time you use DalaLad
alias datalad=/project/def-sponsor00/shared/datalad-env/bin/datalad   # best to add this line to your ~/.bashrc file
datalad clone https://github.com/razoumov/testPublish.git publish     # note that access to nextcloud is not enabled yet
cd publish
du -s .                       # the annexed file is not here
git annex whereis --in=here   # no annexed file stored locally
git annex whereis test*       # two copies: "published dataset" and nextcloud
datalad update --how merge    # if you need to update the local copy (analogue of `git pull`)

rclone config   # set up exactly the same configuration as before
datalad siblings -d . enable --name nextcloud   # enable access to this special remote
datalad siblings                                # should now see nextcloud
datalad get test1
git annex whereis --in=here                     # now we have a local copy

dd if=/dev/urandom of=test3 bs=1024 count=$(( RANDOM + 1024 ))
datalad save -m "added test3"
datalad push --to origin              # push non-annexed files to GitHub

datalad push --to nextcloud           # push annexed files
datalad push --to origin              # update GitHub of this

Back in the original “published dataset” on my laptop:

datalad update --how merge
ls                        # now can see test3
datalad get test3
git annex whereis test3   # it is here

Workflow 4: publish a dataset on GitHub with publicly-accessible annexed files on Nextcloud

Starting from scratch, let’s push some files to

cd ~/tmp
chmod -R u+wX publish && /bin/rm -r publish

dd if=/dev/urandom of=test1 bs=1024 count=$(( RANDOM + 1024 ))
rclone copy test1 nextcloud:    # works since we've already set up the `nextcloud` remote in rclone

Log in to https://nextcloud.computecanada.ca with your CC credentials, on test1 click “share” followed by “share link” and “copy link”. Add /download to the copied link to form something like https://nextcloud.computecanada.ca/index.php/s/YeyNrjJfpQQ7WTq/download.

datalad create --description "published dataset" -c text2git publish
cd publish

cat << EOF > list.csv
file,link
test1,https://nextcloud.computecanada.ca/index.php/s/YeyNrjJfpQQ7WTq/download
EOF

datalad addurls --fast list.csv '{link}' '{file}'   # --fast means do not download, just add URL
git annex whereis test1   # one copy (web)

Later, when needed, we can download this file with datalad get test1.

datalad create-sibling-github -d . testPublish2   # create am empty repo on GitHub
datalad siblings   # +/- indicates the presence/absence of a remote data annex at this remote
datalad push --to github

user001
module load git-annex   # need this each time you use DalaLad
alias datalad=/project/def-sponsor00/shared/datalad-env/bin/datalad   # best to add this line to your ~/.bashrc file
chmod -R u+wX publish && /bin/rm -r publish
datalad clone https://github.com/razoumov/testPublish2.git publish    # "remote origin not usable by git-annex"
cd publish
git annex whereis test1   # one copy (web)
datalad get test1
git annex whereis test1   # now we have a local copy

Workflow 5: (if we have time) managing multiple Git repos under one dataset

Create a new dataset and inside clone a couple of subdatasets:

cd ~/tmp
datalad create -c text2git envelope
cd envelope
 # let's clobe few regular Git (not DataLad!) repos
datalad clone --dataset . https://github.com/razoumov/radiativeTransfer.git projects/radiativeTransfer
datalad clone --dataset . https://github.com/razoumov/sharedSnippets projects/sharedSnippets
git log   # can see those two new subdatasets

Go into one of these subdatasets, modify a file, and commit it to GitHub:

cd projects/sharedSnippets
>>> add an empty line to mpiContainer.md
git status
git add mpiContainer.md
git commit -m "added another line to mpiContainer.md"
git push

This directory is still a pure GitHub repository, i.e. there no DalaLad files.

Let’s clone out entire dataset to another location:

cd ~/tmp
datalad install --description "copy of envelope" -r -s envelope copy   # `clone` has no recursive option
cd copy
cd projects/sharedSnippets
git log   # cloned as of the moment of that dataset's creation; no recent update there yet

Recursively update all child Git repositories:

git remote -v     # remote is origin = GitHub
cd ../..          # to ~/tmp/copy
git remote -v     # remote is origin = ../envelope
 # pull recent changes from "proper origin" for each subdataset
datalad update -s origin --how=merge --recursive
cd projects/sharedSnippets
git log