dirstate-v2: Document flags/mode/size/mtime fields of tree nodes
authorSimon Sapin <simon.sapin@octobus.net>
Mon, 11 Oct 2021 18:23:17 +0200
changeset 48188 77fc340acad7
parent 48187 b669e40fbbd6
child 48189 6e01bcd111d2
dirstate-v2: Document flags/mode/size/mtime fields of tree nodes This file format modification was previously left incomplete because of planned upcoming changes. Not all of these changes have been made yet, but documenting what exists today will help talking more widely about it. Differential Revision: https://phab.mercurial-scm.org/D11625
mercurial/helptext/internals/dirstate-v2.txt
mercurial/pure/parsers.py
rust/hg-core/src/dirstate_tree/on_disk.rs
--- a/mercurial/helptext/internals/dirstate-v2.txt	Wed Sep 08 10:47:10 2021 +0200
+++ b/mercurial/helptext/internals/dirstate-v2.txt	Mon Oct 11 18:23:17 2021 +0200
@@ -371,6 +371,114 @@
   (For example, `hg rm` makes a file untracked.)
   This counter is used to implement `has_tracked_dir`.
 
-* Offset 30 and more:
-  **TODO:** docs not written yet
-  as this part of the format might be changing soon.
+* Offset 30:
+  Some boolean values packed as bits of a single byte.
+  Starting from least-significant, bit masks are::
+
+    WDIR_TRACKED = 1 << 0
+    P1_TRACKED = 1 << 1
+    P2_INFO = 1 << 2
+    HAS_MODE_AND_SIZE = 1 << 3
+    HAS_MTIME = 1 << 4
+
+  Other bits are unset. The meaning of these bits are:
+
+  `WDIR_TRACKED`
+      Set if the working directory contains a tracked file at this node’s path.
+      This is typically set and unset by `hg add` and `hg rm`.
+
+  `P1_TRACKED`
+      set if the working directory’s first parent changeset
+      (whose node identifier is found in tree metadata)
+      contains a tracked file at this node’s path.
+      This is a cache to reduce manifest lookups.
+
+  `P2_INFO`
+      Set if the file has been involved in some merge operation.
+      Either because it was actually merged,
+      or because the version in the second parent p2 version was ahead,
+      or because some rename moved it there.
+      In either case `hg status` will want it displayed as modified.
+
+  Files that would be mentioned at all in the `dirstate-v1` file format
+  have a node with at least one of the above three bits set in `dirstate-v2`.
+  Let’s call these files "tracked anywhere",
+  and "untracked" the nodes with all three of these bits unset.
+  Untracked nodes are typically for directories:
+  they hold child nodes and form the tree structure.
+  Additional untracked nodes may also exist.
+  Although implementations should strive to clean up nodes
+  that are entirely unused, other untracked nodes may also exist.
+  For example, a future version of Mercurial might in some cases
+  add nodes for untracked files or/and ignored files in the working directory
+  in order to optimize `hg status`
+  by enabling it to skip `readdir` in more cases.
+
+  When a node is for a file tracked anywhere,
+  the rest of the node data is three fields:
+
+  * Offset 31:
+    If `HAS_MODE_AND_SIZE` is unset, four zero bytes.
+    Otherwise, a 32-bit integer for the Unix mode (as in `stat_result.st_mode`)
+    expected for this file to be considered clean.
+    Only the `S_IXUSR` bit (owner has execute permission) is considered.
+
+  * Offset 35:
+    If `HAS_MTIME` is unset, four zero bytes.
+    Otherwise, a 32-bit integer for expected modified time of the file
+    (as in `stat_result.st_mtime`),
+    truncated to its 31 least-significant bits.
+    Unlike in dirstate-v1, negative values are not used.
+
+  * Offset 39:
+    If `HAS_MODE_AND_SIZE` is unset, four zero bytes.
+    Otherwise, a 32-bit integer for expected size of the file
+    truncated to its 31 least-significant bits.
+    Unlike in dirstate-v1, negative values are not used.
+
+  If an untracked node `HAS_MTIME` *unset*, this space is unused:
+
+  * Offset 31:
+    12 bytes set to zero
+
+  If an untracked node `HAS_MTIME` *set*,
+  what follows is the modification time of a directory
+  represented with separated second and sub-second components
+  since the Unix epoch:
+
+  * Offset 31:
+    The number of seconds as a signed (two’s complement) 64-bit integer.
+
+  * Offset 39:
+    The number of nanoseconds as 32-bit integer.
+    Always greater than or equal to zero, and strictly less than a billion.
+    Increasing this component makes the modification time
+    go forward or backward in time dependening
+    on the sign of the integral seconds components.
+    (Note: this is buggy because there is no negative zero integer,
+    but will be changed soon.)
+
+  The presence of a directory modification time means that at some point,
+  this path in the working directory was observed:
+
+  - To be a directory
+  - With the given modification time
+  - That time was already strictly in the past when observed,
+    meaning that later changes cannot happen in the same clock tick
+    and must cause a different modification time
+    (unless the system clock jumps back and we get unlucky,
+    which is not impossible but deemed unlikely enough).
+  - All direct children of this directory
+    (as returned by `std::fs::read_dir`)
+    either have a corresponding dirstate node,
+    or are ignored by ignore patterns whose hash is in tree metadata.
+
+  This means that if `std::fs::symlink_metadata` later reports
+  the same modification time
+  and ignored patterns haven’t changed,
+  a run of status that is not listing ignored files
+  can skip calling `std::fs::read_dir` again for this directory,
+  and iterate child dirstate nodes instead.
+
+
+* (Offset 43: end of this node)
--- a/mercurial/pure/parsers.py	Wed Sep 08 10:47:10 2021 +0200
+++ b/mercurial/pure/parsers.py	Mon Oct 11 18:23:17 2021 +0200
@@ -55,7 +55,7 @@
     - p1_tracked: is the file tracked in working copy first parent
     - p2_info: the file has been involved in some merge operation. Either
                because it was actually merged, or because the p2 version was
-               ahead, or because some renamed moved it there. In either case
+               ahead, or because some rename moved it there. In either case
                `hg status` will want it displayed as modified.
 
     # about the file state expected from p1 manifest:
--- a/rust/hg-core/src/dirstate_tree/on_disk.rs	Wed Sep 08 10:47:10 2021 +0200
+++ b/rust/hg-core/src/dirstate_tree/on_disk.rs	Mon Oct 11 18:23:17 2021 +0200
@@ -64,44 +64,24 @@
     uuid: &'on_disk [u8],
 }
 
+/// Fields are documented in the *Tree metadata in the docket file*
+/// section of `mercurial/helptext/internals/dirstate-v2.txt`
 #[derive(BytesCast)]
 #[repr(C)]
 struct TreeMetadata {
     root_nodes: ChildNodes,
     nodes_with_entry_count: Size,
     nodes_with_copy_source_count: Size,
-
-    /// How many bytes of this data file are not used anymore
     unreachable_bytes: Size,
-
-    /// Current version always sets these bytes to zero when creating or
-    /// updating a dirstate. Future versions could assign some bits to signal
-    /// for example "the version that last wrote/updated this dirstate did so
-    /// in such and such way that can be relied on by versions that know to."
     unused: [u8; 4],
 
-    /// If non-zero, a hash of ignore files that were used for some previous
-    /// run of the `status` algorithm.
-    ///
-    /// We define:
-    ///
-    /// * "Root" ignore files are `.hgignore` at the root of the repository if
-    ///   it exists, and files from `ui.ignore.*` config. This set of files is
-    ///   then sorted by the string representation of their path.
-    /// * The "expanded contents" of an ignore files is the byte string made
-    ///   by concatenating its contents with the "expanded contents" of other
-    ///   files included with `include:` or `subinclude:` files, in inclusion
-    ///   order. This definition is recursive, as included files can
-    ///   themselves include more files.
-    ///
-    /// This hash is defined as the SHA-1 of the concatenation (in sorted
-    /// order) of the "expanded contents" of each "root" ignore file.
-    /// (Note that computing this does not require actually concatenating byte
-    /// strings into contiguous memory, instead SHA-1 hashing can be done
-    /// incrementally.)
+    /// See *Optional hash of ignore patterns* section of
+    /// `mercurial/helptext/internals/dirstate-v2.txt`
     ignore_patterns_hash: IgnorePatternsHash,
 }
 
+/// Fields are documented in the *The data file format*
+/// section of `mercurial/helptext/internals/dirstate-v2.txt`
 #[derive(BytesCast)]
 #[repr(C)]
 pub(super) struct Node {
@@ -114,45 +94,6 @@
     children: ChildNodes,
     pub(super) descendants_with_entry_count: Size,
     pub(super) tracked_descendants_count: Size,
-
-    /// Depending on the bits in `flags`:
-    ///
-    /// * If any of `WDIR_TRACKED`, `P1_TRACKED`, or `P2_INFO` are set, the
-    ///   node has an entry.
-    ///
-    ///   - If `HAS_MODE_AND_SIZE` is set, `data.mode` and `data.size` are
-    ///     meaningful. Otherwise they are set to zero
-    ///   - If `HAS_MTIME` is set, `data.mtime` is meaningful. Otherwise it is
-    ///     set to zero.
-    ///
-    /// * If none of `WDIR_TRACKED`, `P1_TRACKED`, `P2_INFO`, or `HAS_MTIME`
-    ///   are set, the node does not have an entry and `data` is set to all
-    ///   zeros.
-    ///
-    /// * If none of `WDIR_TRACKED`, `P1_TRACKED`, `P2_INFO` are set, but
-    ///   `HAS_MTIME` is set, the bytes of `data` should instead be
-    ///   interpreted as the `Timestamp` for the mtime of a cached directory.
-    ///
-    ///   The presence of this combination of flags means that at some point,
-    ///   this path in the working directory was observed:
-    ///
-    ///   - To be a directory
-    ///   - With the modification time as given by `Timestamp`
-    ///   - That timestamp was already strictly in the past when observed,
-    ///     meaning that later changes cannot happen in the same clock tick
-    ///     and must cause a different modification time (unless the system
-    ///     clock jumps back and we get unlucky, which is not impossible but
-    ///     but deemed unlikely enough).
-    ///   - All direct children of this directory (as returned by
-    ///     `std::fs::read_dir`) either have a corresponding dirstate node, or
-    ///     are ignored by ignore patterns whose hash is in
-    ///     `TreeMetadata::ignore_patterns_hash`.
-    ///
-    ///   This means that if `std::fs::symlink_metadata` later reports the
-    ///   same modification time and ignored patterns haven’t changed, a run
-    ///   of status that is not listing ignored   files can skip calling
-    ///   `std::fs::read_dir` again for this directory,   iterate child
-    ///   dirstate nodes instead.
     flags: Flags,
     data: Entry,
 }