4 Git support

4.1 Introduction

The PKM has built-in support for Git. It aims to provide user with a way to synchronize PKM with one or more Git repositories on remote Git servers over HTTP/HTTPS/SSH with as much freedom as possible. It can run a list of any Git commands in a request using the standard git command line interface (CLI). It takes care of authentication with the remote Git server (limited to one server per command at a time) for 'clone', 'fetch', 'pull' and 'push' commands. The PKM can store, in a kind of “wallet” (in MongoDB user’s custom data, encrypted with AES256), the Git user’s credentials either completely or partially (e.g. without the password). The PKM server manages the on-disk Git working trees (the workspaces) and Git directories (the .git directories) in Directory git-root (configurable in pkm_config.json). The PKM synchronizes the PKM Files in the MongoDB database with those in the Git working trees on the local file system and vice versa. The PKM manages some read-only bits in the PKM files to speed up synchronization (a dirty bit) and help identifying unmerged files (unmerged bit), see Section 3.2.2.

Figure 5 below outlines how the support for Git works:

Figure 5: Built-in support for Git distributed version-control system

The Git program runs under the supervision of the PKM server. The PKM provides the user’s credentials more or less through a pipe (see Section 4.3, Issue 3). It gets the textual results of Git through a pipe and the versioned files from the local file system. Communication with the remote Git servers is left to the standard Git program.

4.2 Terminology

This section is about Git-related terminology extensively used in this document.

Figure 6 below synthesizes the terminology of Git:

A Git working tree is an on-disk workspace for the developer. Each working tree has an associated Git directory, which is where Git actually stores all the information relative to the repository and the history. In most situations, the Git directory is a subdirectory named '.git' of the working tree. However, Git can also manage Git directories outside of the working tree. Besides, a Git working tree can have some secondary working trees, which the git worktree add command can create. The former working tree designates the main working tree while the latter are the linked working trees. The data model in the PKM presented in Section 3.2.13 for the Git working trees reflects this organization.

4.3 Safety & Security

Below are the main issues about safety and security problems that may arise when running Git on a server environment and that we have faced while implementing the support for Git in the PKM.

Issue 1. The first issue is that a Git command can go through parent directories searching for the Git working tree, possibly escaping from the project root directory on the host server. The solution that we have found is to make the PKM run 'git rev-parse –show-toplevel', which can fail, before running any Git command. When it succeeds, we check that the Git working tree is really bound to the directory git-root/<project-name>, i.e. the directory which contains all Git working trees of a project.

Issue 2. The second issue is that some Git commands can have system-wide paths passed as command line options or parameters. Concretely, we had to find a solution to avoid path out-of-bound situations, for instance with the 'clone' command. In fact, for most commands, Git already checks that the paths actually target files in the Git working tree where the command runs. For other commands like 'clone', we have made the PKM parse such commands and then check the value of each options and arguments that Git interprets as a path before spawning Git. Currently only the following commands are parsed: 'clone', 'fetch', 'pull', 'push', 'config' and 'worktree'. Additional commands of which we are not aware may require special attention.

Issue 3. The third issue is that some commands requires authentication to a remote server, like 'clone'. Indeed, we have had to ensure that a request to the PKM server is not blocked (resulting in no response from the PKM server) because the Git program is waiting credentials forever. This happens because the authentication of Git (one prompt for the user name and one prompt for the password) is interactive. Redirecting the input of Git to provide it the user’s credentials is tricky because Git reads the credentials from the terminal rather a pipe. The solution that we have found is to make the PKM use the sshpass command to provide Git (or ssh) with the credentials. Indeed, when the PKM detects that a command may need the user’s credential, Git runs under the supervision of sshpass that runs itself under the supervision of the PKM server. Because sshpass cannot provide more than one credential for each Git command, the PKM checks the command options to ensure that the command only targets one remote server. When the remote server is just named, i.e. no URL is provided on the command line (e.g. 'origin'), the PKM also looks at each remote in the Git local configuration (aka. .git/config). The PKM then checks if it has credentials available (either in the request body or in the user’s “wallet”) for the remote server before actually spawning Git. For HTTP/HTTPS, the PKM provides the user’s name through the URL by rewriting the URL on the fly. As Git may record the given HTTP/HTTPS URLs in the Git config remotes, the PKM make the HTTP/HTTPS URLs anonymous again in the Git config remotes.

4.4 User’s credential “wallet”

Each user has a kind of “wallet” which stores the user’s credential related to Git, either completely or partially (e.g. without the password). This “wallet” is in property git_user_credentials of PKM user (see Section 3.2.1).

PkmUser
{
  "name": "name",
  "password": null,
  "first_name": "first_name",
  "last_name": "last_name",
  "email": "email",
  "phone": "phone",
  "roles" :
  [
    { "db": "db", "role": "role" }
  ],
  "git_user_credentials":
  [
    {
      "git_remote_url: "string“,
      "git_user_name: "string",
      "git_password": "string",
      "git_ssh_private_key": "string"
    }
  ]
}

Only user can view and modify their own credentials. It is plain text in the REST API, thus it deserves securing the REST API with HTTPS, which PKM server supports too. It is AES256 encrypted in MongoDB database. The decryption key (32 bytes) is stored permanently in a file named 'secret' (configurable in File pkm_config.json) on the server disk (read-only for the PKM system user and forbidden for the group and others) or on same docker volume like the one for File pkm_config.json.

4.5 Operations

The PKM REST API has support for the following operations related to the Git support:

Run a Git command sequence

POST /git/run/{dbName}?asynchronous=…
{
    "git_commands": [ [ "command", "arg1", "arg2", … ], … ],
    "options": {
        "dont_delete_pkm_files": false,
        "dont_delete_git_working_tree_files": false,
        "git_user_credentials": […]
    }
}

This operation runs the git command sequence passed in request body in the project with the given name {dbName}. The operation returns in the response body a Git job which client can poll when in asynchronous mode (i.e. asynchronous=true) using the property id of the job as a job identifier. When in asynchronous mode, the job is enqueued in a job queue. The default behavior is to run jobs synchronously.

Note that this operation uses both the user’s credentials from the request body and the ones from the user’s credential “wallet”, which can be complete or partial. It merges the user’s credentials giving priority to the ones in the request body.

An overview of the algorithm for the run operation is the following:

Scan the Git working trees on the disk, then make anonymous the HTTP/HTTPS URLs in the Git config remotes
Get the list of all on-disk filenames
Get the Git user’s credentials
Synchronize on-disk files with PKM:
- Get all dirty files from the database and write them on the file system,
- then delete files on the file system supported by the PKM which are not in the database (*)
Run all the Git commands on the file system
Scan the Git working trees on the disk, then make anonymous the HTTP/HTTPS URLs in the Git config remotes
Update the Git working trees metadata in the PKM
Get the list of all on-disk filenames
Get the list of versioned filenames
Get the list of unmerged filenames
Synchronize PKM with on-disk files:
- Put all files from the Git directory on the file system into the database,
- then delete files in the database which are no longer in the Git working directory (*)

Steps 1 to 4 retrieve the user’s credentials and prepare the disk storage for the execution of the Git commands. These steps are essentially about dumping the dirty files from the PKM on the disk, and possibly deleting the files that are no longer in the PKM. Step 5 is about running the Git commands, whose effect is making some directories and files on the disk to appear or disappear. Steps 6 to 11 update the PKM which is essentially filling the PKM with the on-disk files, and deleting files from the PKM which are no longer on the disk.

The actual implementation of this algorithm has some tuning options (dont_delete_pkm_files and dont_delete_git_working_tree_files), to support a lazy synchronization between the PKM (the database) and the Git working trees, to avoid deleting files on one side which have disappeared on the other side, see (*) in the algorithm.

Poll a Git job

GET /git/job/{jobId}

When a Git job is running asynchronously, this operation allows polling the Git job with the given identifier {jobId} until job completion. The job is dequeued when client polls a completed job (either finished or failed).

Access to Git config file

GET /git/config/{dbName}/{gitWorkingTree}

PUT /git/config/{dbName}/{gitWorkingTree}
{
    "git_config": "content…"
}

These operations provide Read/Write access to the Git local configuration (aka. .git/config)

Access to files in Git working trees

GET /git/files/{dbName}/{gitWorkingTree}

GET /git/files/{dbName}/{gitWorkingTree}/{filename}

POST or PUT /git/files/{dbName}/{gitWorkingTree}
[
    PKM files…
]

DELETE /git/files/{dbName}/{gitWorkingTree}

DELETE /git/files/{dbName}/{gitWorkingTree}/{filename}

These operations provide Read/Write access to the files in Git working trees. They are the only way to access .gitignore files in the Git working trees.

Show the status of Git working trees, i.e. the Git working trees metadata

GET /git/working_trees/{dbName}

GET /git/working_trees/{dbName}/{gitWorkingTree}

These operations return some metadata about Git working trees. They are convenient for quickly inspecting the status of the Git working trees without running Git commands.

Delete Git working trees

DELETE /git/working_trees/{dbName}?dontDeletePkmFiles=…

DELETE /git/working_trees/{dbName}/{gitWorkingTree}?dontDeletePkmFiles=…

These operations allow deleting Git working trees created with 'clone'. These operations can also delete the corresponding files in the PKM. There is no other way to delete Git working trees, and the only way to create a new Git working tree is to run a clone command.