User:Glauxosdever/Version Control Software

From OSDev Wiki
Jump to navigation Jump to search

This article is a stub! This page or section is a stub. You can help the wiki by accurately contributing to it.

Version Control Software (VCS) is fundamental when it comes to managing your source code repository. It basically allows you (singular or plural) to work on several features in parallel and to revert your code to a previous known working state. Most OSDev projects use git as their VCS and github or gitlab for repository hosting. But what happens if your goal is to write an OS that is different to anything that can run git (so no git for your OS), but you also want to do meaningful stuff with your repository on your OS, and/or self-host your repository on a server that is running your OS? There are two possible answers: 1) reimplement git, or 2) write your own VCS. If you picked 1), you should read about the git internals. If you picked 2), you should continue reading this page.

Roadmap

There are two possible approaches to writing a VCS for your OS.

Approach A

Initially, use git or any other ready VCS. Then, when your OS is ready to be self-hosting, write a VCS that runs on your OS and switch all development to it. This will have the drawback that your new repository will not hold the past history of the period that you used git or any other ready VCS.

Approach B

Write a VCS that runs on your host machine before starting developing your OS. Then, when your OS is ready to be self-hosting, write a compatible VCS that runs on your OS. Switching development to it should be (mostly) seamless.

Types of VCS

TODO

Writing a Basic VCS

In this section, we will write a very basic VCS. It will not even remotely support any remote functionality (pun intended), but it will be enough to get you started. We will write it in Python, since Python is a good programming language for prototyping stuff.

The Internals

We will create a snapshot-based VCS, i.e. we won't store the diff of the files in each commit. Instead, we will store the full files in order for retrieving source trees to be an efficient operation. You may of course want to change that later.

We will store the repository on the filesystem (you could use a database instead if you like it). The root directory of the repository will be .repo, which will contain the subdirectories branches, commits and files. As you probably guessed, branches will contain information about branches. However, commits will only contain information about the contents of commits, i.e. no author, time, and such information -- that's in order to deduplicate space when two commits are similar in contents. files will only contain information about the file contents, i.e. no author, time, and such information -- again, that's in order to deduplicate space when two files are similar in contents.

Additionally, the working directory for each branch will be in a directory at the same level as the .repo directory, which will be named after the branch.

Branches

The branches directory will contain subdirectories named after the names of the branches. For example, if we create a branch "master", a subdirectory master will be created. Inside, we will have a commits file that will contain, per line per commit, a commit hash and a timestamp.

The commits file may look like this:

d97cf57b44b1b43d8d66e360c6d0c8a66a04ee18e092a4cad4684e26c86fbf41 1547387622
ba04414fa9bf63b8a39f658483edc0f6b8a8c759f707ec6846134986db9dc929 1547388056

Commits

The commits directory will contain subdirectories named after the hashes of the commits. For example, if we create a commit "d97cf57b44b1b43d8d66e360c6d0c8a66a04ee18e092a4cad4684e26c86fbf41", a subdirectory d97cf57b44b1b43d8d66e360c6d0c8a66a04ee18e092a4cad4684e26c86fbf41 will be created. Inside, we will have a branches file that will list the branches where this commit it exists. Also, we will have a files file that will list, per line per file, its hash and its path.

The branches file may look like this:

master
feature1
feature2

The files file may look like this:

7d2b62d9a6f9a0ebfdde033e3cfb197f4d46998c22853861eab603cbd1246920 Makefile
487ea42328a8738a12b378dce4329e2432a0938ff12832ec932fa88c948ca015 src/test.c

A commit hash will be derived using the SHA256 algorithm, where the input will be the concatenation of all paths sorted in alphabetical order and all file contents in the same order as the paths. You may want to change that possibly (but, if you include the timestamp, you won't be able to deduplicate similar commits).

  • Note: Real VCSs (e.g. git) don't put all hashes in one directory. Instead, they split the hashes into directory levels. That's in order to minimise directory search times.
  • Note: Git includes the timestamp in the input string for deriving the commit hash.

Files

The files directory will contain subdirectories named after the hashes of the file revisions. Inside, we will have a commits file that will list the commits where this file revision exists. Also, the data file will contain the contents of the file as is appears during that revision.

The commits file may look like this:

d97cf57b44b1b43d8d66e360c6d0c8a66a04ee18e092a4cad4684e26c86fbf41
ba04414fa9bf63b8a39f658483edc0f6b8a8c759f707ec6846134986db9dc929
  • Note: Real VCSs employ compression for the file contents, in order to decrease disk space usage.

The Code

So we get the following .py files...

utils.py

This file contains four functions as described in the comments. It's still nothing VCS-specific.

# This function reads <filename> into a list of lines.
# The trailing '\n' character of each line is discarded.
def readlines(filename):
    fp = open(filename, "r")
    lines = fp.read().splitlines()
    return lines

# This function writes a list of lines into <filename>.
# The trailing '\n' character of each line is added.
def writelines(filename, lines):
    fp = open(filename, "w")
    for line in lines:
        fp.write(line + "\n") 
    fp.close()

# This function reads <filename> into a list of lists of words.
def readwordlines(filename):
    lines = readlines(filename)
    wordlines = [line.split(" ") for line in lines]
    return wordlines

# This function writes a list of lists of words into <filename>.
def writewordlines(filename, wordlines):
    lines = [" ".join(wordline) for wordline in wordlines]
    writelines(filename, lines)

init.py

This file contains code to initialise an empty repository. You should make it executable.

#! /usr/bin/python

import os

import utils

def main():
    # Create necessary directories.
    os.mkdir(".repo", 0755)
    os.mkdir(".repo/branches", 0755)
    os.mkdir(".repo/commits", 0755)
    os.mkdir(".repo/files", 0755)

if __name__ == "__main__":
    main()

branch.py

This file contains code to create an empty branch. You should also make it executable.

#! /usr/bin/python

import os
import sys

import utils

def main():
    # Get branch name.
    branch = sys.argv[1]
    # Here you should probably check if the branch name is allowed.
    # For example, you may want to disallow certain names that have
    # a special meaning in your internal representation.

    # Create the .repo/branches/<branch> directory.
    os.mkdir(".repo/branches/" + branch, 0755)

    # Create an empty .repo/branches/<branch>/commits file.
    fp = open(".repo/branches/" + branch + "/commits", "w")
    fp.close()

    # Create the <branch> directory, if it doesn't already exist.
    try:
        os.mkdir(branch, 0755)
    except OSError:
        # Directory already exists.
        pass

if __name__ == "__main__":
    main()

commit.py

Now stuff is getting complex. I hope I have explained it well in the comments...

You should also make this file executable.

#! /usr/bin/python

import hashlib
import os
import sys
import time

import utils

def main():
    # Get branch name.
    branch = sys.argv[1]

    # Get current time.
    now = int(time.time())

    # Get the list of all files in the branch source tree.
    filepaths = []
    for cur, dirs, files in os.walk(branch):
        for f in files:
            # Get the paths without the beginning "<branch>/".
            # For example, the path "master/src/main.c" becomes "src/main.c".
            filepaths.append((cur + "/" + f).split("/", 1)[1])

    # Sort the paths as it's important for them to be in a specific order
    # for the SHA256 hash function. Imagine if you got two different hashes
    # for the same directory structure; that would be broken!
    filepaths.sort()

    # Get the list of all file contents.
    # Normally, the paths should be traversed sequentially, thus the contents
    # should be in the correct order for the SHA256 hash function.
    # TODO: I don't know if this is the case, you should verify it
    #       if you are serious about using this code to get started.
    filecontents = []
    for filepath in filepaths:
        fp = open(branch + "/" + filepath, "r")
        filecontents.append(fp.read())
        fp.close()

    # Get file paths string.
    filepathsstr = ""
    for filepath in filepaths:
        filepathsstr += filepath

    # Get file contents string.
    filecontentsstr = ""
    for filecontent in filecontents:
        filecontentsstr += filecontent

    # Derive the commit hash.
    commithash = hashlib.sha256(filepathsstr + filecontentsstr).hexdigest()

    # Create the .repo/commits/<commit> directory. If the directory already
    # exists, we will just update the list of branches where it is present.
    commitexists = False
    try:
        os.mkdir(".repo/commits/" + commithash, 0755)
    except OSError:
        commitexists = True

    if commitexists:
        # Update the .repo/commits/<commit>/branches file, in case <branch>
        # is not found in it.
        commitbranches = utils.readlines(".repo/commits/" + commithash + "/branches")
        found = False
        for i in range(0, len(commitbranches)):
            if commitbranches[i] == branch:
                found = True
                break

        if not found:
            commitbranches.append(branch)
            utils.writelines(".repo/commits/" + commithash + "/branches", commitbranches)

    else:
        # Loop through paths and contents.
        filehashes = []
        for i in range(0, len(filepaths)):
            # Create the file hash. We will use the file contents.
            filehash = hashlib.sha256(filecontents[i]).hexdigest()

            # Append the file hash to the list of file hashes.
            filehashes.append(filehash)

            # Create the .repo/files/<file> directory. If the file already exists,
            # we will just update the list of commits where it is present.
            fileexists = False
            try:
                os.mkdir(".repo/files/" + filehash, 0755)
            except OSError:
                # File already exists.
                fileexists = True

            if fileexists:
                # Update the .repo/files/<file>/commits file.
                filecommits = utils.readlines(".repo/files/" + filehash + "/commits")
                filecommits.append(commithash)
                utils.writelines(".repo/files/" + filehash + "/commits", filecommits)
            else:
                # Create the .repo/files/<file>/data file.
                fp = open(".repo/files/" + filehash + "/data", "w")
                fp.write(filecontents[i])
                fp.close()

                # Create the .repo/files/<file>/commits file.
                filecommits = [commithash]
                utils.writelines(".repo/files/" + filehash + "/commits", filecommits)

        # Create the .repo/commits/<commit>/files file.
        commitfiles = []
        for i in range(0, len(filepaths)):
            commitfiles.append([filehashes[i], filepaths[i]])
        utils.writewordlines(".repo/commits/" + commithash + "/files", commitfiles)

        # Create the .repo/commits/<commit>/branches file.
        commitbranches = [branch]
        utils.writelines(".repo/commits/" + commithash + "/branches", commitbranches)

    # Update the .repo/branches/<branch>/commits file.
    branchcommits = utils.readwordlines(".repo/branches/" + branch + "/commits")
    branchcommits.append([commithash, "{}".format(now)])
    utils.writewordlines(".repo/branches/" + branch + "/commits", branchcommits)

if __name__ == "__main__":
    main()

fsck.py

This one is very important. It does an integrity check of the repository (also useful when testing if the VCS that will run on your OS is compatible with this one).

It is also quite complex, I hope it is correct and that I explained stuff well in the comments. Again, make it executable.

#! /usr/bin/python

import hashlib
import os

import utils

def fsckfile(f):
    # Get file data.
    fp = open(".repo/files/" + f + "/data", "r")
    data = fp.read()
    fp.close()

    # Check if the file hash is correct.
    filehash = hashlib.sha256(data).hexdigest()
    if filehash != f:
        print("Integrity check failed: File " + f + " should have hash " + filehash)

def checkcommitinfile(commit, f):
    # Get the list of commits that <f> lists. <commit> should be listed.
    commits = utils.readlines(".repo/files/" + f + "/commits")
    found = False
    for c in commits:
        if c == commit:
            found = True
            break

    if not found:
        print("Integrity check failed: File " + f + " is listed in commit " + commit + ", but file " + f + " doesn't list commit " + commit)

def fsckcommit(commit):
    # Get the list of files that are listed in <commit>.
    files = utils.readwordlines(".repo/commits/" + commit + "/files")

    # Get the list of file hashes and paths based on files above.
    filehashes = []
    filepaths = []
    for f in files:
        filehashes.append(f[0])
        filepaths.append(f[1])

    # Get file paths string.
    filepathsstr = ""
    for filepath in filepaths:
        filepathsstr += filepath

    # Get file contents string. Also detect if some files are missing.
    filecontentsstr = ""
    missing = False
    for filehash in filehashes:
        try:
            fp = open(".repo/files/" + filehash + "/data")
            filecontentsstr += fp.read()
            fp.close()
        except IOError:
            print("Integrity check failed: Commit " + commit + " lists file " + filehash + ", but file " + filehash + " doesn't exist")
            missing = True

    if not missing:
        # Check if the commit hash is correct.
        commithash = hashlib.sha256(filepathsstr + filecontentsstr).hexdigest()
        if commithash != commit:
            print("Integrity check failed: Commit " + commit + " should have hash " + commithash)

        # Check if the relation of this commit and its files is bidirectional.
        for filehash in filehashes:
            checkcommitinfile(commit, filehash)

def checkbranchincommit(branch, commit):
    # Get the list of branches that <commit> lists. <branch> should be listed.
    branches = utils.readlines(".repo/commits/" + commit + "/branches")
    found = False
    for b in branches:
        if b == branch:
            found = True
            break

    if not found:
        print("Integrity check failed: Commit " + commit + " is listed in branch " + branch + ", but commit " + commit + " doesn't list branch " + branch)

def fsckbranch(branch):
    # Get the list of commits that are listed in <branch>.
    commits = utils.readwordlines(".repo/branches/" + branch + "/commits")

    # Get the list of commit hashes based on commits above.
    commithashes = []
    for commit in commits:
        commithashes.append(commit[0])
    
    # Check if the relation of this branch and its commits is bidirectional.
    # Also detect if any commits are missing.
    for commithash in commithashes:
        if os.path.exists(".repo/commits/" + commithash):
            checkbranchincommit(branch, commithash)
        else:
            print("Integrity check failed: Branch " + branch + " lists commit " + commithash + ", but commit " + commithash + " doesn't exist")

def main():
    # Check integrity for each file.
    for cur, dirs, files in os.walk(".repo/files"):
        for d in dirs:
            fsckfile(d)

    # Check integrity for each commit.
    for cur, dirs, files in os.walk(".repo/commits"):
        for d in dirs:
            fsckcommit(d)

    # Check integrity for each branch.
    for cur, dirs, files in os.walk(".repo/branches"):
        for d in dirs:
            fsckbranch(d)

if __name__ == "__main__":
    main()

merge.py

TODO