internals: document bundle2 format
authorGregory Szorc <gregory.szorc@gmail.com>
Sat, 17 Feb 2018 11:19:52 -0700
changeset 36451 1fa35ca345a5
parent 36450 d478c8cd89d1
child 36452 ab81e5a8fba5
internals: document bundle2 format It seems like a good idea to have thorough documentation of the bundle2 data format, including the format of each part and the capabilities. The added documentation is far from complete. For example, we don't fully capture the semantics of each capability and part. But a start is better than nothing, which was pretty much where we were before. Differential Revision: https://phab.mercurial-scm.org/D2298
contrib/wix/help.wxs
mercurial/help.py
mercurial/help/internals/bundle2.txt
mercurial/help/internals/bundles.txt
tests/test-help.t
--- a/contrib/wix/help.wxs	Mon Feb 26 23:54:40 2018 +0530
+++ b/contrib/wix/help.wxs	Sat Feb 17 11:19:52 2018 -0700
@@ -40,6 +40,7 @@
 
         <Directory Id="help.internaldir" Name="internals">
           <Component Id="help.internals" Guid="$(var.help.internals.guid)" Win64='$(var.IsX64)'>
+            <File Id="internals.bundle2.txt"      Name="bundle2.txt" />
             <File Id="internals.bundles.txt"      Name="bundles.txt" KeyPath="yes" />
             <File Id="internals.censor.txt"       Name="censor.txt" />
             <File Id="internals.changegroups.txt" Name="changegroups.txt" />
--- a/mercurial/help.py	Mon Feb 26 23:54:40 2018 +0530
+++ b/mercurial/help.py	Sat Feb 17 11:19:52 2018 -0700
@@ -197,6 +197,8 @@
     return loader
 
 internalstable = sorted([
+    (['bundle2'], _('Bundle2'),
+     loaddoc('bundle2', subdir='internals')),
     (['bundles'], _('Bundles'),
      loaddoc('bundles', subdir='internals')),
     (['censor'], _('Censor'),
--- /dev/null	Thu Jan 01 00:00:00 1970 +0000
+++ b/mercurial/help/internals/bundle2.txt	Sat Feb 17 11:19:52 2018 -0700
@@ -0,0 +1,677 @@
+Bundle2 refers to a data format that is used for both on-disk storage
+and over-the-wire transfer of repository data and state.
+
+The data format allows the capture of multiple components of
+repository data. Contrast with the initial bundle format, which
+only captured *changegroup* data (and couldn't store bookmarks,
+phases, etc).
+
+Bundle2 is used for:
+
+* Transferring data from a repository (e.g. as part of an ``hg clone``
+  or ``hg pull`` operation).
+* Transferring data to a repository (e.g. as part of an ``hg push``
+  operation).
+* Storing data on disk (e.g. the result of an ``hg bundle``
+  operation).
+* Transferring the results of a repository operation (e.g. the
+  reply to an ``hg push`` operation).
+
+At its highest level, a bundle2 payload is a stream that begins
+with some metadata and consists of a series of *parts*, with each
+part describing repository data or state or the result of an
+operation. New bundle2 parts are introduced over time when there is
+a need to capture a new form of data. A *capabilities* mechanism
+exists to allow peers to understand which bundle2 parts the other
+understands.
+
+Stream Format
+=============
+
+A bundle2 payload consists of a magic string (``HG20``) followed by
+stream level parameters, followed by any number of payload *parts*.
+
+It may help to think of the stream level parameters as *headers* and the
+payload parts as the *body*.
+
+Stream Level Parameters
+-----------------------
+
+Following the magic string is data that defines parameters applicable to the
+entire payload.
+
+Stream level parameters begin with a 32-bit unsigned big-endian integer.
+The value of this integer defines the number of bytes of stream level
+parameters that follow.
+
+The *N* bytes of raw data contains a space separated list of parameters.
+Each parameter consists of a required name and an optional value.
+
+Parameters have the form ``<name>`` or ``<name>=<value>``.
+
+Both the parameter name and value are URL quoted.
+
+Names MUST start with a letter. If the first letter is lower case, the
+parameter is advisory and can safely be ignored. If the first letter
+is upper case, the parameter is mandatory and the handler MUST stop if
+it is unable to process it.
+
+Stream level parameters apply to the entire bundle2 payload. Lower-level
+options should go into a bundle2 part instead.
+
+The following stream level parameters are defined:
+
+compression
+   Compression format of payload data. ``GZ`` denotes zlib. ``BZ``
+   denotes bzip2. ``ZS`` denotes zstandard.
+
+   When defined, all bytes after the stream level parameters are
+   compressed using the compression format defined by this parameter.
+
+   If this parameter isn't present, data is raw/uncompressed.
+
+   This parameter MUST be mandatory because attempting to consume
+   streams without knowing how to decode the underlying bytes will
+   result in errors.
+
+Payload Part
+------------
+
+Following the stream level parameters are 0 or more payload parts. Each
+payload part consists of a header and a body.
+
+The payload part header consists of a 32-bit unsigned big-endian integer
+defining the number of bytes in the header that follow. The special
+value ``0`` indicates the end of the bundle2 stream.
+
+The binary format of the part header is as follows:
+
+* 8-bit unsigned size of the part name
+* N-bytes alphanumeric part name
+* 32-bit unsigned big-endian part ID
+* N bytes part parameter data
+
+The *part name* identifies the type of the part. A part name with an
+UPPERCASE letter is mandatory. Otherwise, the part is advisory. A
+consumer should abort if it encounters a mandatory part it doesn't know
+how to process. See the sections below for each defined part type.
+
+The *part ID* is a unique identifier within the bundle used to refer to a
+specific part. It should be unique within the bundle2 payload.
+
+Part parameter data consists of:
+
+* 1 byte number of mandatory parameters
+* 1 byte number of advisory parameters
+* 2 * N bytes of sizes of parameter key and values
+* N * M blobs of values for parameter key and values
+
+Following the 2 bytes of mandatory and advisory parameter counts are
+2-tuples of bytes of the sizes of each parameter. e.g.
+(<key size>, <value size>).
+
+Following that are the raw values, without padding. Mandatory parameters
+come first, followed by advisory parameters.
+
+Each parameter's key MUST be unique within the part.
+
+Following the part parameter data is the part payload. The part payload
+consists of a series of framed chunks. The frame header is a 32-bit
+big-endian integer defining the size of the chunk. The N bytes of raw
+payload data follows.
+
+The part payload consists of 0 or more chunks.
+
+A chunk with size ``0`` denotes the end of the part payload. Therefore,
+there will always be at least 1 32-bit integer following the payload
+part header.
+
+A chunk size of ``-1`` is used to signal an *interrupt*. If such a chunk
+size is seen, the stream processor should process the next bytes as a new
+payload part. After this payload part, processing of the original,
+interrupted part should resume.
+
+Capabilities
+============
+
+Bundle2 is a dynamic format that can evolve over time. For example,
+when a new repository data concept is invented, a new bundle2 part
+is typically invented to hold that data. In addition, parts performing
+similar functionality may come into existence if there is a better
+mechanism for performing certain functionality.
+
+Because the bundle2 format evolves over time, peers need to understand
+what bundle2 features the other can understand. The *capabilities*
+mechanism is how those features are expressed.
+
+Bundle2 capabilities are logically expressed as a dictionary of
+string key-value pairs where the keys are strings and the values
+are lists of strings.
+
+Capabilities are encoded for exchange between peers. The encoded
+capabilities blob consists of a newline (``\n``) delimited list of
+entries. Each entry has the form ``<key>`` or ``<key>=<value>``,
+depending if the capability has a value.
+
+The capability name is URL quoted (``%XX`` encoding of URL unsafe
+characters).
+
+The value, if present, is formed by URL quoting each value in
+the capability list and concatenating the result with a comma (``,``).
+
+For example, the capabilities ``novaluekey`` and ``listvaluekey``
+with values ``value 1`` and ``value 2``. This would be encoded as:
+
+   listvaluekey=value%201,value%202\nnovaluekey
+
+The sections below detail the defined bundle2 capabilities.
+
+HG20
+----
+
+Denotes that the peer supports the bundle2 data format.
+
+bookmarks
+---------
+
+Denotes that the peer supports the ``bookmarks`` part.
+
+Peers should not issue mandatory ``bookmarks`` parts unless this
+capability is present.
+
+changegroup
+-----------
+
+Denotes which versions of the *changegroup* format the peer can
+receive. Values include ``01``, ``02``, and ``03``.
+
+The peer should not generate changegroup data for a version not
+specified by this capability.
+
+checkheads
+----------
+
+Denotes which forms of heads checking the peer supports.
+
+If ``related`` is in the value, then the peer supports the ``check:heads``
+part and the peer is capable of detecting race conditions when applying
+changelog data.
+
+digests
+-------
+
+Denotes which hashing formats the peer supports.
+
+Values are names of hashing function. Values include ``md5``, ``sha1``,
+and ``sha512``.
+
+error
+-----
+
+Denotes which ``error:`` parts the peer supports.
+
+Value is a list of strings of ``error:`` part names. Valid values
+include ``abort``, ``unsupportecontent``, ``pushraced``, and ``pushkey``.
+
+Peers should not issue an ``error:`` part unless the type of that
+part is listed as supported by this capability.
+
+listkeys
+--------
+
+Denotes that the peer supports the ``listkeys`` part.
+
+hgtagsfnodes
+------------
+
+Denotes that the peer supports the ``hgtagsfnodes`` part.
+
+obsmarkers
+----------
+
+Denotes that the peer supports the ``obsmarker`` part and which versions
+of the obsolescence data format it can receive. Values are strings like
+``V<N>``. e.g. ``V1``.
+
+phases
+------
+
+Denotes that the peer supports the ``phases`` part.
+
+pushback
+--------
+
+Denotes that the peer supports sending/receiving bundle2 data in response
+to a bundle2 request.
+
+This capability is typically used by servers that employ server-side
+rewriting of pushed repository data. For example, a server may wish to
+automatically rebase pushed changesets. When this capability is present,
+the server can send a bundle2 response containing the rewritten changeset
+data and the client will apply it.
+
+pushkey
+-------
+
+Denotes that the peer supports the ``puskey`` part.
+
+remote-changegroup
+------------------
+
+Denotes that the peer supports the ``remote-changegroup`` part and
+which protocols it can use to fetch remote changegroup data.
+
+Values are protocol names. e.g. ``http`` and ``https``.
+
+stream
+------
+
+Denotes that the peer supports ``stream*`` parts in order to support
+*stream clone*.
+
+Values are which ``stream*`` parts the peer supports. ``v2`` denotes
+support for the ``stream2`` part.
+
+Bundle2 Part Types
+==================
+
+The sections below detail the various bundle2 part types.
+
+bookmarks
+---------
+
+The ``bookmarks`` part holds bookmarks information.
+
+This part has no parameters.
+
+The payload consists of entries defining bookmarks. Each entry consists of:
+
+* 20 bytes binary changeset node.
+* 2 bytes big endian short defining bookmark name length.
+* N bytes defining bookmark name.
+
+Receivers typically update bookmarks to match the state specified in
+this part.
+
+changegroup
+-----------
+
+The ``changegroup`` part contains *changegroup* data (changelog, manifestlog,
+and filelog revision data).
+
+The following part parameters are defined for this part.
+
+version
+   Changegroup version string. e.g. ``01``, ``02``, and ``03``. This parameter
+   determines how to interpret the changegroup data within the part.
+
+nbchanges
+   The number of changesets in this changegroup. This parameter can be used
+   to aid in the display of progress bars, etc during part application.
+
+treemanifest
+   Whether the changegroup contains tree manifests.
+
+targetphase
+   The target phase of changesets in this part. Value is an integer of
+   the target phase.
+
+The payload of this part is raw changegroup data. See
+:hg:`help internals.changegroups` for the format of changegroup data.
+
+check:bookmarks
+---------------
+
+The ``check:bookmarks`` part is inserted into a bundle as a means for the
+receiver to validate that the sender's known state of bookmarks matches
+the receiver's.
+
+This part has no parameters.
+
+The payload is a binary stream of bookmark data. Each entry in the stream
+consists of:
+
+* 20 bytes binary node that bookmark is associated with
+* 2 bytes unsigned short defining length of bookmark name
+* N bytes containing the bookmark name
+
+If all bits in the node value are ``1``, then this signifies a missing
+bookmark.
+
+When the receiver encounters this part, for each bookmark in the part
+payload, it should validate that the current bookmark state matches
+the specified state. If it doesn't, then the receiver should take
+appropriate action. (In the case of pushes, this mismatch signifies
+a race condition and the receiver should consider rejecting the push.)
+
+check:heads
+-----------
+
+The ``check:heads`` part is a means to validate that the sender's state
+of DAG heads matches the receiver's.
+
+This part has no parameters.
+
+The body of this part is an array of 20 byte binary nodes representing
+changeset heads.
+
+Receivers should compare the set of heads defined in this part to the
+current set of repo heads and take action if there is a mismatch in that
+set.
+
+Note that this part applies to *all* heads in the repo.
+
+check:phases
+------------
+
+The ``check:phases`` part validates that the sender's state of phase
+boundaries matches the receiver's.
+
+This part has no parameters.
+
+The payload consists of an array of 24 byte entries. Each entry is
+a big endian 32-bit integer defining the phase integer and 20 byte
+binary node value.
+
+For each changeset defined in this part, the receiver should validate
+that its current phase matches the phase defined in this part. The
+receiver should take appropriate action if a mismatch occurs.
+
+check:updated-heads
+-------------------
+
+The ``check:updated-heads`` part validates that the sender's state of
+DAG heads updated by this bundle matches the receiver's.
+
+This type is nearly identical to ``check:heads`` except the heads
+in the payload are only a subset of heads in the repository. The
+receiver should validate that all nodes specified by the sender are
+branch heads and take appropriate action if not.
+
+error:abort
+-----------
+
+The ``error:abort`` part conveys a fatal error.
+
+The following part parameters are defined:
+
+message
+   The string content of the error message.
+
+hint
+   Supplemental string giving a hint on how to fix the problem.
+
+error:pushkey
+-------------
+
+The ``error:pushkey`` part conveys an error in the *pushkey* protocol.
+
+The following part parameters are defined:
+
+namespace
+   The pushkey domain that exhibited the error.
+
+key
+   The key whose update failed.
+
+new
+   The value we tried to set the key to.
+
+old
+   The old value of the key (as supplied by the client).
+
+ret
+   The integer result code for the pushkey request.
+
+in-reply-to
+   Part ID that triggered this error.
+
+This part is generated if there was an error applying *pushkey* data.
+Pushkey data includes bookmarks, phases, and obsolescence markers.
+
+error:pushraced
+---------------
+
+The ``error:pushraced`` part conveys that an error occurred and
+the likely cause is losing a race with another pusher.
+
+The following part parameters are defined:
+
+message
+   String error message.
+
+This part is typically emitted when a receiver examining ``check:*``
+parts encountered inconsistency between incoming state and local state.
+The likely cause of that inconsistency is another repository change
+operation (often another client performing an ``hg push``).
+
+error:unsupportedcontent
+------------------------
+
+The ``error:unsupportedcontent`` part conveys that a bundle2 receiver
+encountered a part or content it was not able to handle.
+
+The following part parameters are defined:
+
+parttype
+   The name of the part that triggered this error.
+
+params
+   ``\0`` delimited list of parameters.
+
+hgtagsfnodes
+------------
+
+The ``hgtagsfnodes`` type defines file nodes for the ``.hgtags`` file
+for various changesets.
+
+This part has no parameters.
+
+The payload is an array of pairs of 20 byte binary nodes. The first node
+is a changeset node. The second node is the ``.hgtags`` file node.
+
+Resolving tags requires resolving the ``.hgtags`` file node for changesets.
+On large repositories, this can be expensive. Repositories cache the
+mapping of changeset to ``.hgtags`` file node on disk as a performance
+optimization. This part allows that cached data to be transferred alongside
+changeset data.
+
+Receivers should update their ``.hgtags`` cache file node mappings with
+the incoming data.
+
+listkeys
+--------
+
+The ``listkeys`` part holds content for a *pushkey* namespace.
+
+The following part parameters are defined:
+
+namespace
+   The pushkey domain this data belongs to.
+
+The part payload contains a newline (``\n``) delimited list of
+tab (``\t``) delimited key-value pairs defining entries in this pushkey
+namespace.
+
+obsmarkers
+----------
+
+The ``obsmarkers`` part defines obsolescence markers.
+
+This part has no parameters.
+
+The payload consists of obsolescence markers using the on-disk markers
+format. The first byte defines the version format.
+
+The receiver should apply the obsolescence markers defined in this
+part. A ``reply:obsmarkers`` part should be sent to the sender, if possible.
+
+output
+------
+
+The ``output`` part is used to display output on the receiver.
+
+This part has no parameters.
+
+The payload consists of raw data to be printed on the receiver.
+
+phase-heads
+-----------
+
+The ``phase-heads`` part defines phase boundaries.
+
+This part has no parameters.
+
+The payload consists of an array of 24 byte entries. Each entry is
+a big endian 32-bit integer defining the phase integer and 20 byte
+binary node value.
+
+pushkey
+-------
+
+The ``pushkey`` part communicates an intent to perform a ``pushkey``
+request.
+
+The following part parameters are defined:
+
+namespace
+   The pushkey domain to operate on.
+
+key
+   The key within the pushkey namespace that is being changed.
+
+old
+   The old value for the key being changed.
+
+new
+   The new value for the key being changed.
+
+This part has no payload.
+
+The receiver should perform a pushkey operation as described by this
+part's parameters.
+
+If the pushey operation fails, a ``reply:pushkey`` part should be sent
+back to the sender, if possible. The ``in-reply-to`` part parameter
+should reference the source part.
+
+pushvars
+--------
+
+The ``pushvars`` part defines environment variables that should be
+set when processing this bundle2 payload.
+
+The part's advisory parameters define environment variables.
+
+There is no part payload.
+
+When received, part parameters are prefixed with ``USERVAR_`` and the
+resulting variables are defined in the hooks context for the current
+bundle2 application. This part provides a mechanism for senders to
+inject extra state into the hook execution environment on the receiver.
+
+remote-changegroup
+------------------
+
+The ``remote-changegroup`` part defines an external location of a bundle
+to apply. This part can be used by servers to serve pre-generated bundles
+hosted at arbitrary URLs.
+
+The following part parameters are defined:
+
+url
+   The URL of the remote bundle.
+
+size
+   The size in bytes of the remote bundle.
+
+digests
+   A space separated list of the digest types provided in additional
+   part parameters.
+
+digest:<type>
+   The hexadecimal representation of the digest (hash) of the remote bundle.
+
+There is no payload for this part type.
+
+When encountered, clients should attempt to fetch the URL being advertised
+and read and apply it as a bundle.
+
+The ``size`` and ``digest:<type>`` parameters should be used to validate
+that the downloaded bundle matches what was advertised. If a mismatch occurs,
+the client should abort.
+
+reply:changegroup
+-----------------
+
+The ``reply:changegroup`` part conveys the results of application of a
+``changegroup`` part.
+
+The following part parameters are defined:
+
+return
+   Integer return code from changegroup application.
+
+in-reply-to
+   Part ID of part this reply is in response to.
+
+reply:obsmarkers
+----------------
+
+The ``reply:obsmarkers`` part conveys the results of applying an
+``obsmarkers`` part.
+
+The following part parameters are defined:
+
+new
+   The integer number of new markers that were applied.
+
+in-reply-to
+   The part ID that this part is in reply to.
+
+reply:pushkey
+-------------
+
+The ``reply:pushkey`` part conveys the result of a *pushkey* operation.
+
+The following part parameters are defined:
+
+return
+   Integer result code from pushkey operation.
+
+in-reply-to
+   Part ID that triggered this pushkey operation.
+
+This part has no payload.
+
+replycaps
+---------
+
+The ``replycaps`` part notifies the receiver that a reply bundle should
+be created.
+
+This part has no parameters.
+
+The payload consists of a bundle2 capabilities blob.
+
+stream2
+-------
+
+The ``stream2`` part contains *streaming clone* version 2 data.
+
+The following part parameters are defined:
+
+requirements
+   URL quoted repository requirements string. Requirements are delimited by a
+   command (``,``).
+
+filecount
+   The total number of files being transferred in the payload.
+
+bytecount
+   The total size of file content being transferred in the payload.
+
+The payload consists of raw stream clone version 2 data.
+
+The ``filecount`` and ``bytecount`` parameters can be used for progress and
+reporting purposes. The values may not be exact.
--- a/mercurial/help/internals/bundles.txt	Mon Feb 26 23:54:40 2018 +0530
+++ b/mercurial/help/internals/bundles.txt	Sat Feb 17 11:19:52 2018 -0700
@@ -63,8 +63,7 @@
 
 ``HG20`` is currently the only defined bundle2 version.
 
-The ``HG20`` format is not yet documented here. See the inline comments
-in ``mercurial/exchange.py`` for now.
+The ``HG20`` format is documented at :hg:`help internals.bundle2`.
 
 Initial ``HG20`` support was added in Mercurial 3.0 (released May
 2014). However, bundle2 bundles were hidden behind an experimental flag
--- a/tests/test-help.t	Mon Feb 26 23:54:40 2018 +0530
+++ b/tests/test-help.t	Sat Feb 17 11:19:52 2018 -0700
@@ -993,6 +993,7 @@
   
       To access a subtopic, use "hg help internals.{subtopic-name}"
   
+       bundle2       Bundle2
        bundles       Bundles
        censor        Censor
        changegroups  Changegroups
@@ -3059,6 +3060,13 @@
   <tr><td colspan="2"><h2><a name="topics" href="#topics">Topics</a></h2></td></tr>
   
   <tr><td>
+  <a href="/help/internals.bundle2">
+  bundle2
+  </a>
+  </td><td>
+  Bundle2
+  </td></tr>
+  <tr><td>
   <a href="/help/internals.bundles">
   bundles
   </a>