dirstate-v2: Add internal documentation
authorSimon Sapin <simon.sapin@octobus.net>
Fri, 01 Oct 2021 12:17:09 +0200
changeset 48166 e8a576de703f
parent 48165 d467e44f71d7
child 48167 1b0f8aafedea
dirstate-v2: Add internal documentation It can be viewed by running `hg help internals.dirstate-v2` Since that command rewraps paragraphs, the source text is written with semantic line breaks: https://sembr.org/ Differential Revision: https://phab.mercurial-scm.org/D11546
mercurial/help.py
mercurial/helptext/internals/dirstate-v2.txt
rust/hg-core/src/dirstate_tree/on_disk.rs
tests/test-help.t
--- a/mercurial/help.py	Fri Oct 01 12:27:17 2021 +0200
+++ b/mercurial/help.py	Fri Oct 01 12:17:09 2021 +0200
@@ -365,6 +365,11 @@
             loaddoc(b'config', subdir=b'internals'),
         ),
         (
+            [b'dirstate-v2'],
+            _(b'dirstate-v2 file format'),
+            loaddoc(b'dirstate-v2', subdir=b'internals'),
+        ),
+        (
             [b'extensions', b'extension'],
             _(b'Extension API'),
             loaddoc(b'extensions', subdir=b'internals'),
--- /dev/null	Thu Jan 01 00:00:00 1970 +0000
+++ b/mercurial/helptext/internals/dirstate-v2.txt	Fri Oct 01 12:17:09 2021 +0200
@@ -0,0 +1,375 @@
+The *dirstate* is what Mercurial uses internally to track
+the state of files in the working directory,
+such as set by commands like `hg add` and `hg rm`.
+It also contains some cached data that help make `hg status` faster.
+The name refers both to `.hg/dirstate` on the filesystem
+and the corresponding data structure in memory while a Mercurial process
+is running.
+
+The original file format, retroactively dubbed `dirstate-v1`,
+is described at https://www.mercurial-scm.org/wiki/DirState.
+It is made of a flat sequence of unordered variable-size entries,
+so accessing any information in it requires parsing all of it.
+Similarly, saving changes requires rewriting the entire file.
+
+The newer `dirsate-v2` file format is designed to fix these limitations
+and make `hg status` faster.
+
+User guide
+==========
+
+Compatibility
+-------------
+
+The file format is experimental and may still change.
+Different versions of Mercurial may not be compatible with each other
+when working on a local repository that uses this format.
+When using an incompatible version with the experimental format,
+anything can happen including data corruption.
+
+Since the dirstate is entirely local and not relevant to the wire protocol,
+`dirstate-v2` does not affect compatibility with remote Mercurial versions.
+
+When `share-safe` is enabled, different repositories sharing the same store
+can use different dirstate formats.
+
+Enabling `dirsate-v2` for new local repositories
+------------------------------------------------
+
+When creating a new local repository such as with `hg init` or `hg clone`,
+the `exp-dirstate-v2` boolean in the `format` configuration section
+controls whether to use this file format.
+This is disabled by default as of this writing.
+To enable it for a single repository, run for example::
+
+    $ hg init my-project --config format.exp-dirstate-v2=1
+
+Checking the format of an existing local repsitory
+--------------------------------------------------
+
+The `debugformat` commands prints information about
+which of multiple optional formats are used in the current repository,
+including `dirstate-v2`::
+
+    $ hg debugformat
+    format-variant     repo
+    fncache:            yes
+    dirstate-v2:        yes
+    […]
+
+Upgrading or downgrading an existing local repository
+-----------------------------------------------------
+
+The `debugupgrade` command does various upgrades or downgrades
+on a local repository
+based on the current Mercurial version and on configuration.
+The same `format.exp-dirstate-v2` configuration is used again.
+
+Example to upgrade::
+
+    $ hg debugupgrade --config format.exp-dirstate-v2=1
+
+Example to downgrade to `dirstate-v1`::
+
+    $ hg debugupgrade --config format.exp-dirstate-v2=0
+
+Both of this commands do nothing but print a list of proposed changes,
+which may include changes unrelated to the dirstate.
+Those other changes are controlled by their own configuration keys.
+Add `--run` to a command to actually apply the proposed changes.
+
+Backups of `.hg/requires` and `.hg/dirstate` are created
+in a `.hg/upgradebackup.*` directory.
+If something goes wrong, restoring those files should undo the change.
+
+Note that upgrading affects compatibility with older versions of Mercurial
+as noted above.
+This can be relevant when a repository’s files are on a USB drive
+or some other removable media, or shared over the network, etc.
+
+Internal filesystem representation
+==================================
+
+Requirements file
+-----------------
+
+The `.hg/requires` file indicates which of various optional file formats
+are used by a given repository.
+Mercurial aborts when seeing a requirement it does not know about,
+which avoids older version accidentally messing up a respository
+that uses a format that was introduced later.
+For versions that do support a format, the presence or absence of
+the corresponding requirement indicates whether to use that format.
+
+When the file contains a `exp-dirstate-v2` line,
+the `dirstate-v2` format is used.
+With no such line `dirstate-v1` is used.
+
+High level description
+----------------------
+
+Whereas `dirstate-v1` uses a single `.hg/disrtate` file,
+in `dirstate-v2` that file is a "docket" file
+that only contains some metadata
+and points to separate data file named `.hg/dirstate.{ID}`,
+where `{ID}` is a random identifier.
+
+This separation allows making data files append-only
+and therefore safer to memory-map.
+Creating a new data file (occasionally to clean up unused data)
+can be done with a different ID
+without disrupting another Mercurial process
+that could still be using the previous data file.
+
+Both files have a format designed to reduce the need for parsing,
+by using fixed-size binary components as much as possible.
+For data that is not fixed-size,
+references to other parts of a file can be made by storing "pseudo-pointers":
+integers counted in bytes from the start of a file.
+For read-only access no data structure is needed,
+only a bytes buffer (possibly memory-mapped directly from the filesystem)
+with specific parts read on demand.
+
+The data file contains "nodes" organized in a tree.
+Each node represents a file or directory inside the working directory
+or its parent changeset.
+This tree has the same structure as the filesystem,
+so a node representing a directory has child nodes representing
+the files and subdirectories contained directly in that directory.
+
+The docket file format
+----------------------
+
+This is implemented in `rust/hg-core/src/dirstate_tree/on_disk.rs`
+and `mercurial/dirstateutils/docket.py`.
+
+Components of the docket file are found at fixed offsets,
+counted in bytes from the start of the file:
+
+* Offset 0:
+  The 12-bytes marker string "dirstate-v2\n" ending with a newline character.
+  This makes it easier to tell a dirstate-v2 file from a dirstate-v1 file,
+  although it is not strictly necessary
+  since `.hg/requires` determines which format to use.
+
+* Offset 12:
+  The changeset node ID on the first parent of the working directory,
+  as up to 32 binary bytes.
+  If a node ID is shorter (20 bytes for SHA-1),
+  it is start-aligned and the rest of the bytes are set to zero.
+
+* Offset 44:
+  The changeset node ID on the second parent of the working directory,
+  or all zeros if there isn’t one.
+  Also 32 binary bytes.
+
+* Offset 76:
+  Tree metadata on 44 bytes, described below.
+  Its separation in this documentation from the rest of the docket
+  reflects a detail of the current implementation.
+  Since tree metadata is also made of fields at fixed offsets, those could
+  be inlined here by adding 76 bytes to each offset.
+
+* Offset 120:
+  The used size of the data file, as a 32-bit big-endian integer.
+  The actual size of the data file may be larger
+  (if another Mercurial processis in appending to it
+  but has not updated the docket yet).
+  That extra data must be ignored.
+
+* Offset 124:
+  The length of the data file identifier, as a 8-bit integer.
+
+* Offset 125:
+  The data file identifier.
+
+* Any additional data is current ignored, and dropped when updating the file.
+
+Tree metadata in the docket file
+--------------------------------
+
+Tree metadata is similarly made of components at fixed offsets.
+These offsets are counted in bytes from the start of tree metadata,
+which is 76 bytes after the start of the docket file.
+
+This metadata can be thought of as the singular root of the tree
+formed by nodes in the data file.
+
+* Offset 0:
+  Pseudo-pointer to the start of root nodes,
+  counted in bytes from the start of the data file,
+  as a 32-bit big-endian integer.
+  These nodes describe files and directories found directly
+  at the root of the working directory.
+
+* Offset 4:
+  Number of root nodes, as a 32-bit big-endian integer.
+
+* Offset 8:
+  Total number of nodes in the entire tree that "have a dirstate entry",
+  as a 32-bit big-endian integer.
+  Those nodes represent files that would be present at all in `dirstate-v1`.
+  This is typically less than the total number of nodes.
+  This counter is used to implement `len(dirstatemap)`.
+
+* Offset 12:
+  Number of nodes in the entire tree that have a copy source,
+  as a 32-bit big-endian integer.
+  At the next commit, these files are recorded
+  as having been copied or moved/renamed from that source.
+  (A move is recorded as a copy and separate removal of the source.)
+  This counter is used to implement `len(dirstatemap.copymap)`.
+
+* Offset 16:
+  An estimation of how many bytes of the data file
+  (within its used size) are unused, as a 32-bit big-endian integer.
+  When appending to an existing data file,
+  some existing nodes or paths can be unreachable from the new root
+  but they still take up space.
+  This counter is used to decide when to write a new data file from scratch
+  instead of appending to an existing one,
+  in order to get rid of that unreachable data
+  and avoid unbounded file size growth.
+
+* Offset 20:
+  These four bytes are currently ignored
+  and reset to zero when updating a docket file.
+  This is an attempt at forward compatibility:
+  future Mercurial versions could use this as a bit field
+  to indicate that a dirstate has additional data or constraints.
+  Finding a dirstate file with the relevant bit unset indicates that
+  it was written by a then-older version
+  which is not aware of that future change.
+
+* Offset 24:
+  Either 20 zero bytes, or a SHA-1 hash as 20 binary bytes.
+  When present, the hash is of ignore patterns
+  that were used for some previous run of the `status` algorithm.
+
+* (Offset 44: end of tree metadata)
+
+Optional hash of ignore patterns
+--------------------------------
+
+The implementation of `status` at `rust/hg-core/src/dirstate_tree/status.rs`
+has been optimized such that its run time is dominated by calls
+to `stat` for reading the filesystem metadata of a file or directory,
+and to `readdir` for listing the contents of a directory.
+In some cases the algorithm can skip calls to `readdir`
+(saving significant time)
+because the dirstate already contains enough of the relevant information
+to build the correct `status` results.
+
+The default configuration of `hg status` is to list unknown files
+but not ignored files.
+In this case, it matters for the `readdir`-skipping optimization
+if a given file used to be ignored but became unknown
+because `.hgignore` changed.
+To detect the possibility of such a change,
+the tree metadata contains an optional hash of all ignore patterns.
+
+We define:
+
+* "Root" ignore files as:
+
+  - `.hgignore` at the root of the repository if it exists
+  - And all files from `ui.ignore.*` config.
+
+  This set of files is sorted by the string representation of their path.
+
+* The "expanded contents" of an ignore files is the byte string made
+  by the concatenation of its contents followed by the "expanded contents"
+  of other files included with `include:` or `subinclude:` directives,
+  in inclusion order. This definition is recursive, as included files can
+  themselves include more files.
+
+This hash is defined as the SHA-1 of the concatenation (in sorted
+order) of the "expanded contents" of each "root" ignore file.
+(Note that computing this does not require actually concatenating byte ranges into
+contiguous memory.
+Instead a SHA-1 hasher object can be created and fed separate byte ranges one by
+one.)
+
+The data file format
+--------------------
+
+This is implemented in `rust/hg-core/src/dirstate_tree/on_disk.rs`
+and `mercurial/dirstateutils/v2.py`.
+
+The data file contains two types of data: paths and nodes.
+
+Paths and nodes can be organized in any order in the file, except that sibling
+nodes must be next to each other and sorted by their path. Contiguity lets
+the parent refer to them all by their count with a single pseudo-pointer,
+instead of storing one pseudo-pointer per child node. Sorting allows using
+binary seach to find a child node with a given name in `O(log(n))` byte ranges
+comparisons.
+
+The current implemention writes paths and child node before a given node
+for ease of figuring out the value of pseudo-pointers by the time the are to be
+written, but this is not an obligation and readers must not rely on it.
+
+A path is stored as a byte string anywhere in the file, without delimiter.
+It is refered to by one or more node by a pseudo-pointer to its start, and its
+length in bytes. Since there is no delimiter,
+when a path is a substring of another the same bytes could be reused,
+although the implementation does not exploit this as of this writing.
+
+A node is stored on 43 bytes with components at fixed offsets. Paths and
+child nodes relevant to a node are stored externally and referenced though
+pseudo-pointers.
+
+All integers are stored in big-endian. All pseudo-pointers are 32-bit integers
+counting bytes from the start of the data file. Path lengths and positions
+are 16-bit integers, also counted in bytes.
+
+Node components are:
+
+* Offset 0:
+  Pseudo-pointer to the full path of this node,
+  from the working directory root.
+
+* Offset 4:
+  Length of the full path.
+
+* Offset 6:
+  Position of the last `/` path separator within the full path,
+  in bytes from the start of the full path,
+  or zero if there isn’t one.
+  The part of the full path after this position is the "base name".
+  Since sibling nodes have the same parent, only their base name vary
+  and needs to be considered when doing binary search to find a given path.
+
+* Offset 8:
+  Pseudo-pointer to the "copy source" path for this node,
+  or zero if there is no copy source.
+
+* Offset 12:
+  Length of the copy source path, or zero if there isn’t one.
+
+* Offset 14:
+  Pseudo-pointer to the start of child nodes.
+
+* Offset 18:
+  Number of child nodes, as a 32-bit integer.
+  They occupy 43 times this number of bytes
+  (not counting space for paths, and further descendants).
+
+* Offset 22:
+  Number as a 32-bit integer of descendant nodes in this subtree,
+  not including this node itself,
+  that "have a dirstate entry".
+  Those nodes represent files that would be present at all in `dirstate-v1`.
+  This is typically less than the total number of descendants.
+  This counter is used to implement `has_dir`.
+
+* Offset 26:
+  Number as a 32-bit integer of descendant nodes in this subtree,
+  not including this node itself,
+  that represent files tracked in the working directory.
+  (For example, `hg rm` makes a file untracked.)
+  This counter is used to implement `has_tracked_dir`.
+
+* Offset 30 and more:
+  **TODO:** docs not written yet
+  as this part of the format might be changing soon.
--- a/rust/hg-core/src/dirstate_tree/on_disk.rs	Fri Oct 01 12:27:17 2021 +0200
+++ b/rust/hg-core/src/dirstate_tree/on_disk.rs	Fri Oct 01 12:17:09 2021 +0200
@@ -1,22 +1,6 @@
 //! The "version 2" disk representation of the dirstate
 //!
-//! # File format
-//!
-//! In dirstate-v2 format, the `.hg/dirstate` file is a "docket that starts
-//! with a fixed-sized header whose layout is defined by the `DocketHeader`
-//! struct, followed by the data file identifier.
-//!
-//! A separate `.hg/dirstate.{uuid}.d` file contains most of the data. That
-//! file may be longer than the size given in the docket, but not shorter. Only
-//! the start of the data file up to the given size is considered. The
-//! fixed-size "root" of the dirstate tree whose layout is defined by the
-//! `Root` struct is found at the end of that slice of data.
-//!
-//! Its `root_nodes` field contains the slice (offset and length) to
-//! the nodes representing the files and directories at the root of the
-//! repository. Each node is also fixed-size, defined by the `Node` struct.
-//! Nodes in turn contain slices to variable-size paths, and to their own child
-//! nodes (if any) for nested files and directories.
+//! See `mercurial/helptext/internals/dirstate-v2.txt`
 
 use crate::dirstate_tree::dirstate_map::{self, DirstateMap, NodeRef};
 use crate::dirstate_tree::path_with_basename::WithBasename;
--- a/tests/test-help.t	Fri Oct 01 12:27:17 2021 +0200
+++ b/tests/test-help.t	Fri Oct 01 12:17:09 2021 +0200
@@ -1121,6 +1121,7 @@
        censor        Censor
        changegroups  Changegroups
        config        Config Registrar
+       dirstate-v2   dirstate-v2 file format
        extensions    Extension API
        mergestate    Mergestate
        requirements  Repository Requirements
@@ -3566,6 +3567,13 @@
   Config Registrar
   </td></tr>
   <tr><td>
+  <a href="/help/internals.dirstate-v2">
+  dirstate-v2
+  </a>
+  </td><td>
+  dirstate-v2 file format
+  </td></tr>
+  <tr><td>
   <a href="/help/internals.extensions">
   extensions
   </a>