Inside Git Stash

Modesty Zhang

5.00/5 (2 votes)

Mar 9, 2024

CPOL

9 min read

6324

Illuminate git internal on object model via a `git stash` implementation, including push, apply, pop, drop, list and clear, all with isomorphic-git

When evaluating git integration solutions for a CDE (Cloud Development Environments), I’m impressed by the technical aspects of isomorphic-git. The codebase is well-organized, documentation is in-sync, and test suite is abundant. The project boasts a robust feature set and delivers great performance. It seemed to fit the project needs perfectly until a feature gap for stash was discovered.

The feature request for git stash has tagged 'help needed' for years, apparently bridging the gap would benefit more people. A deep dive into the git object model and the intricate details of isomorphic-git is required to add the stash functionality to the project.

Now that having a PR ready, this post shares the learnings and findings as well as the implementation techniques with the community, hoping to benefit those who are looking for a similar solution or are simply interested in the object model and data structure in general.

Stash vs Branch

Git object model employs Merkle tree to ensure data integrity, enable efficient storage and retrieval, and facilitate data verification and synchronization. This recursive tree structure, composed of hashes, represents snapshots of arbitrarily large hierarchical data sets – namely, our project's working directory’s revision history.

In Git, both branch and stash function as "refs" (references) pointing to commits. They share the ability to be manipulated via Git commands for navigating commit history. However, their commonality are limited.

Conceptually, branches are references to the terminal points with a commit tree, they are integrated intrinsically into the project's permanent history. In contrast, stashes reside within a separate namespace (refs/stash). Git generates a commit object to encapsulate staged and working directory modifications when you create a stash, but this commit neither integrates into the commit history nor associates with a specific branch. Furthermore, refs/stash entries are local and excluded from git push operations.

A stash is a specialized ref that preserves a work-in-progress (WIP) state as a commit. This allows restoration to that state without altering the branch history. In essence, it is a dangling commit possessing two parents: the commit at HEAD when the stash was created and the index state at that moment. If unstaged changes are included via stash push (the default), a third parent is added to represent a snapshot of the working directory tree at the time of stash creation.

A stash can be seen as a temporary branch that is not meant to be merged into the main branch. It is useful for saving changes that are not ready to be committed or pushed, or for switching between branches without losing work. Unlike branches, stashes are not named and are stored in a stack-like structure, where the most recent stash is at the top. Stashes have their own way to be accessed, including stash list, stash show, and stash apply commands.

While branches and stashes are both implemented as refs, their intended use and impact on the repository are distinct. Implementing git stash does not benefit from the functionality of branch. Instead, direct manipulation of low-level Git objects is required to generate parent commits, trees, and reflogs. This task involves interaction with loose objects, but not packfiles, which are optimized for object transport with efficient storage and indexing.

Before we start to implement our own, let's inspect what regular git client creates for git stash. We can poke into what's inside refs/stash from command line:

$ cat .git/refs/stash

0f47ba5d10ed30779c078d31b3fc47331e4bb98b <-- current stash commit SHA

From the commit SHA, we can find the stash commit object:

$ git cat-file -t 0f47ba5d10ed30779c078d31b3fc47331e4bb98b

commit <-- type of the object

and the commit object has two parents:

$ git cat-file -p 0f47ba5d10ed30779c078d31b3fc47331e4bb98b

tree a0d51b4516951595dc4fd6ecb9beb50da495e210

parent b72fb56d1f53d4120ee6fd206ea1e77ee4071c5f

parent 41d7e412a828f5c6838f6d064881803316afeea3

author modesty <pdf2json@github.com> 1707931505 -0800

committer modesty <pdf2json@github.com> 1707931505 -0800

It points to a tree and has two parents, each represents index and working directory changes, respectively. Our task is to create a same structured commit object for `git stash` in isomorphic-git, it's the core for other stash operations, including apply, drop, pop, list and clear, etc.

Git Objects: Recap

Now that we understand the unique nature of stashes, a concise review of Git object types will illuminate their internal representation. Stashes leverage three of the four fundamental Git object types: commits, trees, and blobs (we will omit tags for this discussion).

Recall the relationship between commits and trees: a tree can be conceptualized as a directory snapshot, embedding both subdirectories (represented as tree objects) and references to any number of file content blobs. Blob objects contain compressed file content, their SHA-1 hashes serving as references within their parent tree. This "children do not know parent" model makes sharing object very efficient, if a file exists in multiple branches (which is common case), only its SHA is referenced, not the file itself.

To examine the internal structure of a stash, we can inspect its associated tree object:

$ git cat-file -p a0d51b4516951595dc4fd6ecb9beb50da495e210 <-- the tree SHA above

100644 blob 1bff7bfa72823e7cd90a6e9304156571e53b0988 .eslintrc.json
100644 blob 3a15fc6e4f3d3b359f9f2900fd6e1d61824fe5c2 .gitignore
040000 tree ddfdb61937565b864aaa210a187890a7dfefdf0d .husky
100644 blob bcea5db4c05c21cbd19ea6a5fae4cb6c71e8dc19 README.md
100644 blob cb00c3f18ee24b4e75abdf9883fd6f284f048ec2 build.properties
100644 blob 0594052072a108a4446678e35e035bb0955ffb23 package-lock.json
100644 blob 3ef98a63860e011e1b573ab36302542d68bc320a package.json
040000 tree 8c4401ac77ef562385f401d90feeca7a9d6ed304 packages
100644 blob 35259cddd2bb0c6f17e40edbaa03f4feba19d142 pom.xml
040000 tree 806243fa28d66c2904f8d4eb0623a0718383f63a scripts
100644 blob 3ac4170137102fd0d14e9e2e345a794f28b21b4f tsconfig.base.json
100644 blob 2e7ef9708ca1cd337a8a57ab101b986a7a2ed8a2 webpack.base.js

The inherent elegance of the Merkle tree model is evident in its efficient and concise representation of filesystem directory contents. Each tree entry encapsulates file system permissions, filename, and the SHA-1 hash of any file or subdirectory it contains.

Below is a simplified diagram illustrating the relationships between these Git object types:

Commit Object
│
├─ Tree Object 1 ─ Blob Object (File content)
│  │
│  ├─ Tree Object 2 ─ Blob Object (File content)
│  │
│  └─ Blob Object (File content)
│
└─ Parent Commit Object (Previous commit)
   │
   └─ Tree Object ─ Blob Object (File content)

Crucially, the Git object model does not explicitly store "diffs" or "logs." When executing commands like git show, git log, or git status, the output is dynamically generated by analyzing and comparing trees and their constituent objects. This fundamental principle is essential to the implementation of git stash: by constructing objects and establishing relationships in accordance with the Git object model, we can seamlessly introduce new stash functionality.

Create a Stash

Implementing git stash within isomorphic-git necessitates direct utilization of low-level APIs. Functions such as writeCommit (distinct from the higher-level commit), writeTree, TREE, WORKDIR, STAGE, and crucially, walk facilitate this process.

Below are the general steps involved in generating a stash, which by default encapsulates tracked changes from both the index (staged) and working directory (unstaged):

Obtain the HEAD of the current branch: This commit serves as the first parent of the stash commit, embodying the repository state at the time of stash creation.
```
// prepare the stash commit: first parent is the current branch HEAD
const headCommit = await GitRefManager.resolve({fs, gitdir, ref: 'HEAD'})
```

Construct a tree representing index changes: If staged alterations exist, generate a commit object rooted in the index tree. This commit is the second parent of the stash commit.

const indexTree = await writeTreeChanges
                  (fs, dir, gitdir, [ TREE({ref: 'HEAD'}),'stage'])
if (indexTree) {
  // this indexTree will be the tree of the next commit object
  // create a commit from the index tree, which has one parent, 
  // the current branch HEAD
  const stashCommitOne = await stashMgr.writeStashCommit({
    message: `stash-Index: WIP on ${branch} - ${new Date().toISOString()}`,
    tree: indexTree, // stashCommitTree
    parent: stashCommitParents,
  })
  stashCommitParents.push(stashCommitOne)
  stashCommitTree = indexTree
}

Construct a tree representing working directory changes: If unstaged modifications to tracked files are present, generate a commit object with the working tree as its root. This commit forms the third parent of the stash commit.

const workingTree = await writeTreeChanges(fs, dir, gitdir, [workDirCompareBase,'workdir'])

if (workingTree) {
  // create a commit from the working directory tree, 
  // which has one parent, either the one we just had, or the headCommit
  const workingHeadCommit = await stashMgr.writeStashCommit({
    message: `stash-WorkDir: WIP on ${branch} - ${new Date().toISOString()}`,
    tree: workingTree,
    parent: [stashCommitParents[stashCommitParents.length - 1]],
  })

  stashCommitParents.push(workingHeadCommit)
  stashCommitTree = workingTree
}

Create the stash commit: With all parent commits prepared, assemble the final stash commit.

// create another commit from the tree, 
// which has three parents: HEAD and the commit we just made:
const stashCommit = await stashMgr.writeStashCommit({
  message: message.trim() || `stash: WIP on ${branch} - ${new Date().toISOString()}`,
  tree: stashCommitTree,
  parent: stashCommitParents,
})

Record the stash in .git/refs/stash: Persist the newly created stash in the Git object store.
```
// next, write this commit into .git/refs/stash:
await stashMgr.writeStashRef(stashCommit)
```

Append a reflog entry: document the stash creation commits within the reflog for historical tracking.

// write the stash commit to the logs
await stashMgr.writeStashReflogEntry({
  stashCommit,
  message: `WIP on ${branch}: ${headCommit.substring(0, 7)} ${headMsg}`,
})

Restore the working directory to a clean state: Revert the working directory to its pre-stash condition.

// finally, go back to a clean working directory

await checkout({fs, dir, gitdir, ref: branch, track: false,
  force: true, // force checkout to discard changes
})

Apply a Stash

While a stash is represented internally as a commit, simply checking out the associated commit tree does not fully replicate the intended stash behavior. Notably, a standard checkout would leave the working directory in a detached HEAD state and neglect to restore modifications to the index (staging area). A tailored approach is necessary for comprehensive stash application.

The core strategy is straightforward: we analyze the stash commit and its parent commit trees, applying any detected changes to the working directory and index as appropriate. Here's the procedural breakdown:

Retrieve parent commits from the stash ref: Access the parent commits associated with the stash commit stored within .git/refs/stash.

// get the stash commit object
const stashCommit = await stashMgr.readStashCommit()
const { parent: stashParents = null } = stashCommit.commit ? stashCommit.commit : {}
if (!stashParents || !Array.isArray(stashParents)) {
  return // no stash found
}

Iterate over parent commits and apply differences: For each parent commit, compare its tree representation with the preceding parent. Apply any identified changes to either the working directory or staging area.

// compare the stash commit tree with its parent commit
for (let i = 0; i < stashParents.length - 1; i++) {
  const applyingCommit = await _readCommit({fs, cache: {}, gitdir, oid: stashParents[i + 1]})
  const wasStaged = applyingCommit.commit.message.startsWith('stash-Index')
  await applyTreeChanges(fs, dir, gitdir, stashParents[i + 1], stashParents[i], wasStaged)
}

That's it!

We can skip other common git stash operations, like pop (it's really apply + drop), drop (remove the one refs from .git/refs/stash and the top entry in reflogs), list (read and format reflogs), and clear (remove both .git/refs/stash and reflogs), nothing interesting about them.

However, notice the invocation of writeTreeChanges when we create either index or work directory trees? plus applyTreeChanges above? These are the essence of stash, also where I spent my most time on, let's take a closer look.

Tree Walker

At the heart of our stash implementation lies a customized tree comparison mechanism. This enables us to generate trees representing the index (staged changes) and working directory changes, essential for encapsulating the stash state. Isomorphic-git's walk API provides the foundation for this functionality.

1. Tree Entry Filtering

We strategically employ the WalkerIterate callback to filter out irrelevant tree entries, ensuring the resulting tree reflects the intended stash contents. The isStage variable helps discern between index and working directory tree processing, while the changedEntries variable captures the detected differences.

// if parent is skipped, skip the children
// API ref: type WalkerIterate = (walk: WalkerIterateCallback, children: 
//          IterableIterator<Array<WalkerEntry>>) => Promise<Array<any>>;: // API 
// API ref: type WalkerIterateCallback = 
//          (entries: Array<WalkerEntry>) => Promise<Array<any>>;

const iterate = async (walk, children) => {
  const filtered = []
  for (const child of children) {
    const [head, stage] = child
    if (isStage) {
      if (stage) {
        // for deleted file in work dir, it also needs to be added on stage
        if (await fs.exists(`${dir}/${stage.toString()}`)) {
          filtered.push(child)
        } else {
          changedEntries.push([null, stage]) // record the change (deletion) 
                                             // while stop the iteration
        }
      }
     } else {
      if (head) {
        // for deleted file in workdir, "stage" (workdir in our case) will be undefined
        if (!stage) {
          changedEntries.push([head, null]) // record the change (deletion) 
                                            // while stop the iteration
        } else {
          filtered.push(child)              // workdir, tracked only
        }
      }
    }
  }

  return filtered.length ? Promise.all(filtered.map(walk)) : []
}

2. Analyzing Changes with Tree Mapping

The WalkerMap callback pinpoints changes at the individual file and subtree level. Like in WalkerIterate, head and stage variables represent the trees being compared. It leverages the changedEntries structure to track modifications. This callback returns a TreeEntry-like object, a crucial building block for tree construction.

// transform WalkerEntry objects into the desired format
// API ref: type WalkerMap = 
//          (filename: string, entries: Array<(WalkerEntry|null)>) => Promise<any>;

const map = async (filepath, [head, stage]) => {
  if (filepath === '.' ||
    (await GitIgnoreManager.isIgnored({ fs, dir, gitdir, filepath }))
  ) {
    return
  }

  if (stage) {
    if ( !head ||
      ((await head.oid()) !== (await stage.oid()) &&
      (await stage.oid()) !== undefined)
    ) {
      changedEntries.push([head, stage])
    }

    return {
      mode: await stage.mode(),
      path: filepath,
      oid: await stage.oid(),
      type: await stage.type(),
    }
  }
}

3. Composing Tree Structures

The WalkerReduce callback facilitates the hierarchical composition of trees. It assembles TreeEntry instances and their children, ultimately determining the final tree structure.

// combine mapped entries with their parent results
// API ref: type WalkerReduce = (parent: any, children: Array<any>) => Promise<any>;
  const reduce = async (parent, children) => {
    children = children.filter(Boolean) // Remove undefined entries
    if (!parent) {
      return children.length > 0 ? children : undefined
    } else {
      parent.children = children
      return parent
    }
  }

4. Validation and Flattening

An additional recursive process verifies that each TreeEntry possesses a valid oid. This step handles children entries for tree-type objects, writing them using the _writeTree function. Blob entries undergo validation through the checkAndWriteBlob function. This stage prepares the entries for seamless integration with the writeTree API.

async function processTreeEntries(fs, dir, gitdir, entries) {
  // make sure each tree entry has valid oid
  async function processTreeEntry(entry) {
    if (entry.type === 'tree') {
      if (!entry.oid) {
        // Process children entries if the current entry is a tree
        const children = await Promise.all(entry.children.map(processTreeEntry))
        // Write the tree with the processed children
        entry.oid = await _writeTree({
          fs,
          gitdir,
          tree: children,
        })
        entry.mode = 0o40000 // directory
      }
    } else if (entry.type === 'blob') {
      entry.oid = await checkAndWriteBlob(
        fs,
        gitdir,
        dir,
        entry.path,
        entry.oid
      )
      entry.mode = 0o100644 // file
    }

    // remove path from entry.path
    entry.path = entry.path.split('/').pop()
    return entry
  }

  return Promise.all(entries.map(processTreeEntry))
}

5. Tree Generation

The walk function, configured with our custom callbacks, orchestrates the tree generation process. The final step involves invoking writeTree to physically create the tree objects that will be associated with stash commits.

const entries = await _walk({
    fs,
    cache: {},
    dir,
    gitdir,
    trees,
    map,
    reduce,
    iterate,
  })

  if (changedEntries.length === 0 || entries.length === 0) {
    return null // no changes found to stash
  }

  const processedEntries = await processTreeEntries(fs, dir, gitdir, entries)

  const treeEntries = processedEntries.filter(Boolean).map(entry => ({
    mode: entry.mode,
    path: entry.path,
    oid: entry.oid,
    type: entry.type,
  }))

  return _writeTree({ fs, gitdir, tree: treeEntries })

That's all to create the tree objects that would associate with stash commits.

Applying a stash necessitates another tree walk operation. However, the iterate and reduce callbacks are not required here. Instead, the map callback directly handles applying the changes. It encapsulates the logic for deletions, additions, and updates to both files (blobs) and directories (trees).

export async function applyTreeChanges(
  fs,
  dir,
  gitdir,
  stashCommit,
  parentCommit,
  wasStaged
) {
  return _walk({
    fs,
    cache: {},
    dir,
    gitdir,
    trees: [TREE({ ref: parentCommit }), TREE({ ref: stashCommit })],
    map: async (filepath, [parent, stash]) => {
      if (
        filepath === '.' ||
        (await GitIgnoreManager.isIgnored({ fs, dir, gitdir, filepath }))
      ) {
        return
      }
      const type = stash ? await stash.type() : await parent.type()
      if (type !== 'tree' && type !== 'blob') {
        return
      }

      const currentFilepath = join(dir, filepath)

      // deleted tree or blob
      if (!stash && parent) {
        if (type === 'tree') {
          await fs.rmdir(currentFilepath)
        } else if (type === 'blob') {
          await fs.rm(currentFilepath)
        }
        return
      }

      const oid = await stash.oid()
      if (!parent || (await parent.oid()) !== oid) {
        // only apply changes if changed from the parent commit or 
        // doesn't exist in the parent commit
        if (type === 'tree') {
          await fs.mkdir(currentFilepath)
        } else {
          const { blob } = await readBlob({ fs, dir, gitdir, oid })
          await fs.write(currentFilepath, blob)
          // await fs.chmod(currentFilepath, await stash.mode())
        }
        if (wasStaged) {
          await add({ fs, dir, gitdir, filepath })
        }
      }
      return {
        path: filepath,
        oid: await stash.oid(),
        type: await stash.type(),
      } // return the applied tree entry
    },
  })
}

Conclusion

Git's stash functionality provides a compelling use case for the Merkle tree, demonstrating its versatility beyond data validation and synchronization into the realm of difference tracking and state preservation. Git's efficient data organization, indexing, querying, and data compression techniques – all facilitated by Merkle trees – underscore the model's power for managing large, complex datasets at scale. Furthermore, the concept of multiple-parent commits ingeniously enables the stash workflow without disrupting core Git operations.

Potential avenues for refinement and enhancement include:

Conflict Resolution: Incorporating conflict detection and resolution mechanisms for stash apply and pop.
Streamlined Implementation: Exploring optimizations, such as combining tree walks for the default stash use case.
Specialized Stash Modes: Supporting advanced scenarios like "stash staged only" and "stash with untracked files" while retaining the flexibility of separate tree walks.

This exploration into the internals of Git's stash implementation has been highly instructive. I anticipate continuing these investigations and refinements beyond the initial pull request.

History

9^th March, 2024: Initial version