Version Control

Steven J Zeil

Last modified: Dec 26, 2023
Contents:

Abstract

Version control (a.k.a. version management is concerned with the management of change in the software artifacts being developed.

In this lesson we look at the kind of practical problems that arise during software development and that can be addressed by proper version control.


1 Issues in Version Control

The issues addressed by version control are:

2 Background - Changing the Code Base


ed

One of the earliest Unix text editors, ed applies a series of editing commands like ‘a’ to append to th end of a file, ‘i’ to insert a line at the current location, ‘d’ to delete the current line, etc.


diff

diff compares two files line by line, listing the differences between them.

diff --ed file1 file2 > file12.diff
echo w file2 >> file12.diff
#
#  many days later...
#
ed file1 < file12.diff

would “rebuild” file2 from file1 and the diff.


patch

patch takes a slightly more sophisticated approach to the idea of applying a diff output to a file

diff file1 file2 > file12.diff
  ⋮
patch file1 file12.diff

2.1 Integrating Changes

Suppose that we have two patch files created from the same base file file1

patch -o file2a file1 patchA
patch -o file2b file1 patchB

Change integration is the problem of combining both sets of changes to form a desired file file2.

3 Approaches and Tools

Version Control Systems

If we could extend patch multiple files at once, we could, in theory, patch an entire software system to move it from version 1 to version 2, then patch it again to move to version 3, etc.

Version control systems keep a master or main copy of the code in a repository. They provide a mechanism by which each different developer can check out copies of the code into their own working directory and, later, check in or commit their changed files to the repository.

Repositories and Working Directories

Version control systems distinguish between

3.1 Where are the repositories?

 

  1. The Local Model

    The earliest VCSs (rcs, sccs) assumed that all developers worked in their own directories on a single shared file system. The repository was kept in a separate (often hidden) directory. Checking out and committing code was largely a matter of ordinary file operations like copying and renaming.

     

  2. The Centralized Model

    As it become more common for developers to be working on separate machines connected via a network, version control systems moved to a centralized model (CVS, Subversion) where a central repository was kept on a machine somewhere on the network. For a developer to check out code, files would be transferred from the central repository machine to the machine and working directory of the developer.

     

  3. The Distributed Model

    As storage became cheaper, the idea of storing multiple versions of the code on a developer’s machine came to seem more reasonable. Modern VCSs (git, mercurial) use a distributed model, where each developer has not only a working directory but also their own copy of the repository with all of the new and old versions of the code. The various repositories are periodically instructed to synchronize with one another over the network, so that every developer is, at least approximately, up to date.

    Among other advantages, this approach adds robustness. In the local and centralized models, a corrupted repository or a failed hard drive can lose everything. In the distributed model, you can often recover from such mishaps because multiple developers can merge their repositories to recover the entire project history.

     

    Most teams who work with distributed VCSs will still designate one repository as the central origin of their copies, simply because this simplifies communication among the team. Instead of each of N developers trying to synchronize individually with each of the N-1 other developers, each developer simply synchronizes with the central origin repository.

    A popular approach is to host a team’s origin repository on a service such as GitHub or GitLab.

4 Issues and Approaches

Earlier we identified the primary issues addressed by version control as managing

Let’s talk about how these are managed under version control.

4.1 History

The primary operations affecting history are checking files out of the repository and checking changes in (more commonly, in newer systems, called “committing”) to the repository.

When we commit

When we check out files:

In older version control systems, the revision numbers were assigned in ascending order.

Older version control systems would commit individual files. This meant that often-changed files would wind up with revision numbers much, much larger than those of more stable files. As a result, answering a question like “which are the file revision numbers corresponding to last month’s 1.1.0 release?” was nearly impossible. The answer would be different for each file. The revision numbers themselves turn out to have, in practice, very little meaning.

In newer version control systems, all files in the working directory are committed as a group. Consequently there is always a single answer to the question “which are the file revision numbers corresponding to last month’s 1.1.0 release?” The idea of keeping the revision numbers simple, increasing them on each commit, has also disappeared. the revision numbers are now long hexadecimal numbers, often computed as a hash of the working directory contents.

The implementation of checking in and out has traditionally focused on conserving storage space by saving diffs of successive versions rather than the full text of each file version.

The repository could usually contain the current version of the file plus enough diffs/patches to move back to any prior revision.

 

4.2 Collaboration

I had not been working long at one of my first programming jobs when, taking a break from editing a file of FORTRAN code, I asked one of the other programmers what he was working on. He told me, and the subsequent conversation went something like…

“Wait, you’re editing foo.f4? But I’m editing foo.f4.”

“Oh…Well, why don’t you go ahead and save your changes?”

“Noooo. Why don’t you save your changes first?”

The point being, of course, that whoever saves last, wins, because their changes will overwrite anything saved earlier.

This sort of thing happens all the time when you have multiple programmers working on the same set of code files.

In the early local-model VCSs, this scenario was prevented by a locking protocol. When a developer checks out files, the developer receives, by default, an unlocked (read-only) copy of the files. If that developer wishes to edit one of the files, they must specifically request a locked (writable) copy. The repository only permits one locked copy of any particular file at a time. When the developer with a locked file commits a changed version of that file, the file is unlocked once more. Any developer can then request a locked version.

The locking mechanism could prevent simultaneous edits of files, if everyone played by the rules. But, in practice, if John has a locked copy of a file and gets called into an afternoon-long meeting, and Jane wants to edit that file, the temptation to circumvent the locking system is often irresistable. If John leaves at the end of the day without committing his changes, and then calls in sick the next day, the it may feel absolutely necessary to circumvent the locking system. And it’s not hard to do. Changing a file from read-only to writable is just a matter of changing permissions, and every programmer should know how to do that. Of course, the repository doesn’t permit commits of an unlocked file, but that just means that Jane will wait until John commits, then request the lock and commit her own changes, probably overwriting all of John’s work.

Imagine how much worse this scenario would be in a large open-source project where many people are working as volunteers, often working furiously for a day or two and then disappearing for weeks at a time.

The very notion of a single lock over all project participants is tricky to implement in a network context, even in the centralized model. And it’s simply not possible in a distributed model where each developer has their own copy of the repository.

So, beginning with the centralized model VCSs, a different approach has been taken to managing collaboration. We accept that simultaneous edits are simply going to be a fact of life. The VCS focuses on

  1. Detecting when a simultaneous edit has occurred.
  2. Automatically merging the changes when it is safe to do so.
  3. Asking a developer to merge the potentially unsafe changes.

Suppose that developers John and Jane have both checked out version 5 of a system. If Jane first commits her changes to a repository, the changes are accepted and become the basis of version 6 of the system. If John later attempts to commit changes, part of the commit info includes that fact that John is working from the (now outdated) version 5. The VCS sees that John is submitting changes to the old version, and flags this action as a conflict.

What happens next depends on the particular VCS, but a typical scenario is this:

If the merge is unsuccessful, you will be left with files that contain blocks of text indicating both sets of changes, e.g.,

<<<<<<< commit by jave@developer.com
for (int i = 0; i < N-1; ++i)
{
    ++a[i];
}
=======
for (int i = 0; i <= N; ++i)
{
    a[i] += 1;
}
>>>>>>> commit by john@developer.com

These can be resolved in any text editor and, in most cases, will need to be resolved before the code will compile again. You can also search for files containing the characteristic <<<<< marker using grep and similar tools.

Many IDEs provide special side-by-side editors for reviewing and resolving conflicts.

 

This, for example, is the Eclipse editor.

After a successful merge, rerun your tests. Just because the VCS thinks that those changes could be safely combined does not actually mean that they were algorithmically compatible. Even “safe” merges can break the logic of a program.

4.3 Exploration

4.3.1 Branches

Suppose that we have worked through a few revisions and then get an idea that might not pay off.

We can start a branch to explore our idea while others continue work on the main trunk. A branch is a new line of revisions that can be advanced independently of the main sequence of code revisions.

The VCS will provide commands that switch our working directory from one branch to another.

While we are working in our new branch, other members of our team can proceed to work on and commit changes to the original, main branch.

 

4.3.2 Merging

If the idea in the branch does not pay off, the branch can simply be abandoned.

If you decide to adopt the changes in the branch, you can elect to merge it back into the trunk.

 

The VCS will provide special commands to initiate a merge.

After a merge

 

4.3.3 Merge Commits

Any merge has the potential to detect conflicts just like the ones we described in the previous section.

In fact, conflicts are much more likely during merging than during normal commits, simply because merging typically deals with changes from multiple commits at one time.

If you go too long before merging, you risk winding up in merge hell, faced with massive amounts of conflicting code, of which you yourself have only written a small portion of.

This scenario is scary enough that, historically, many users of local and centralized VCSs would use branching only if absolutely necessary. And, faced with a nasty set of conflicts, developers have been known to result to drastic measures as described in the cartoon on the right.

However, in modern VCSs, it’s no longer possible to avoid branching.

4.3.4 The Dirty Secret of Distributed Version Control

In distributed version control systems like git, each developer has their own local copy of the repository.

Even if you never create a new branch, your main branch in your local repository is a distinct branch from the main branch in all the other developer’s repositories and in the central origin repository (if you have one).

  • “Synchronizing” repositories is actually a process of merging their corresponding branches.

So branching and merging is pervasive in distributed VCSs. We might as well embrace it.

4.3.5 Avoiding Merge Hell

The simplest way to avoid merge hell is to never go too long between merges.

  1. If you are working the the same branch as your teammates, commit your own changes frequently, and check out their commits even more frequently.

    Do not let days go by while you pile up more and more changes to the code.

    In git, start every work session with a pull and end it with a commit and possibly a push.

  2. If you are working in a separate branch and are not ready to merge it into the main branch, then merge the main branch into yours frequently.

    Over time, a long-running branch can get so far out of sync with changes being made to the trunk that the final merge becomes difficult or even impossible.

    • An effective strategy for combating this is to periodically merge the main trunk into the branch
      • the reverse of the “normal” merge direction

 


from The System