internals: document compression negotiation
As part of adding zstd support to all of the things, we'll need
to teach the wire protocol to support non-zlib compression formats.
This commit documents how we'll implement that.
To understand how we arrived at this proposal, let's look at how
things are done today.
The wire protocol today doesn't have a unified format. Instead,
there is a limited facility for differentiating replies as successful
or not. And, each command essentially defines its own response format.
A significant deficiency in the current protocol is the lack of
payload framing over the SSH transport. In the HTTP transport,
chunked transfer is used and the end of an HTTP response body (and
the end of a Mercurial command response) can be identified by a 0
length chunk. This is how HTTP chunked transfer works. But in the
SSH transport, there is no such framing, at least for certain
responses (notably the response to "getbundle" requests). Clients
can't simply read until end of stream because the socket is
persistent and reused for multiple requests. Clients need to know
when they've encountered the end of a request but there is nothing
simple for them to key off of to detect this. So what happens is
the client must decode the payload (as opposed to being dumb and
forwarding frames/packets). This means the payload itself needs
to support identifying end of stream. In some cases (bundle2), it
also means the payload can encode "error" or "interrupt" events
telling the client to e.g. abort processing. The lack of framing
on the SSH transport and the transfer of its responsibilities to
e.g. bundle2 is a massive layering violation and a wart on the
protocol architecture. It needs to be fixed someday by inventing a
proper framing protocol.
So about compression.
The client transport abstractions have a "_callcompressable()"
API. This API is called to invoke a remote command that will
send a compressible response. The response is essentially a
"streaming" response (no framing data at the Mercurial layer)
that is fed into a decompressor.
On the HTTP transport, the decompressor is zlib and only zlib.
There is currently no mechanism for the client to specify an
alternate compression format. And, clients don't advertise what
compression formats they support or ask the server to send a
specific compression format. Instead, it is assumed that non-error
responses to "compressible" commands are zlib compressed.
On the SSH transport, there is no compression at the Mercurial
protocol layer. Instead, compression must be handled by SSH
itself (e.g. `ssh -C`) or within the payload data (e.g. bundle
compression).
For the HTTP transport, adding new compression formats is pretty
straightforward. Once you know what decompressor to use, you can
stream data into the decompressor until you reach a 0 size HTTP
chunk, at which point you are at end of stream.
So our wire protocol changes for the HTTP transport are pretty
straightforward: the client and server advertise what compression
formats they support and an appropriate compression format is
chosen. We introduce a new HTTP media type to hold compressed
payloads. The header of the payload defines the compression format
being used. Whoever is on the receiving end can sniff the first few
bytes route to an appropriate decompressor.
Support for multiple compression formats is advertised on both
server and client. The server advertises a "compression" capability
saying which compression formats it supports and in what order they
are preferred. Clients advertise their support for multiple
compression formats and media types via the introduced "X-HgProto"
request header.
Strictly speaking, servers don't need to advertise which compression
formats they support. But doing so allows clients to fail fast if
they don't support any of the formats the server does. This is useful
in situations like sending bundles, where the client may have to
perform expensive computation before sending data to the server.
Rather than simply advertise a list of supported compression formats,
we introduce an additional "httpmediatype" server capability
advertising which media types the server supports. This means servers
are explicit about what formats they exchange. IMO, this is superior
to inferring support from other capabilities (like "compression").
By advertising compression support on each request in the "X-HgProto"
header and media type and direction at the server level, we are able
to gradually transition existing commands/responses to the new media
type and possibly compression. Contrast with the old world, where we
only supported a single media type and the use of compression was
built-in to the semantics of the command on both client and server.
In the new world, if "application/mercurial-0.2" is supported,
compression is supported. It's that simple.
It's worth noting that we explicitly don't use "Accept,"
"Accept-Encoding," "Content-Encoding," or "Transfer-Encoding" for
content negotiation and compression. People knowledgeable of the HTTP
specifications will say that we should use these because that's
what they are designed to be used for. They have a point and I
sympathize with the argument. Earlier versions of this commit even
defined supported media types in the "Accept" header. However, my
years of experience rolling out services leveraging HTTP has taught
me to not trust the HTTP layer, especially if you are going outside
the normal spec (such as using a custom "Content-Encoding" value to
represent zstd streams). I've seen load balancers, proxies, and other
network devices do very bad and unexpected things to HTTP messages
(like insisting zlib compressed content is decoded and then re-encoded
at a different compression level or even stripping compression
completely). I've found that the best way to avoid surprises when
writing protocols on top of HTTP is to use HTTP as a dumb transport as
much as possible to minimize the chances that an "intelligent" agent
between endpoints will muck with your data. While the widespread use of
TLS is mitigating many intermediate network agents interfering with
HTTP, there are still problems at the edges, with e.g. the origin HTTP
server needing to convert HTTP to and from WSGI and buggy or
feature-lacking HTTP client implementations. I've found the best way to
avoid these problems is to avoid using headers like "Content-Encoding"
and to bake as much logic as possible into media types and HTTP message
bodies. The protocol changes in this commit do rely on a custom HTTP
request header and the "Content-Type" headers. But we used them before,
so we shouldn't be increasing our exposure to "bad" HTTP agents.
For the SSH transport, we can't easily implement content negotiation
to determine compression formats because the SSH transport has no
content negotiation capabilities today. And without a framing protocol,
we don't know how much data to feed into a decompressor. So in order
to implement compression support on the SSH transport, we'd need to
invent a mechanism to represent content types and an outer framing
protocol to stream data robustly. While I'm fully capable of doing
that, it is a lot of work and not something that should be undertaken
lightly. My opinion is that if we're going to change the SSH transport
protocol, we should take a long hard look at implementing a grand
unified protocol that attempts to address all the deficiencies with
the existing protocol. While I want this to happen, that would be
massive scope bloat standing in the way of zstd support. So, I've
decided to take the easy solution: the SSH transport will not gain
support for multiple compression formats. Keep in mind it doesn't
support *any* compression today. So essentially nothing is changing
on the SSH front.
# Copyright 2005, 2006 Benoit Boissinot <benoit.boissinot@ens-lyon.org>
#
# This software may be used and distributed according to the terms of the
# GNU General Public License version 2 or any later version.
'''commands to sign and verify changesets'''
from __future__ import absolute_import
import binascii
import os
import tempfile
from mercurial.i18n import _
from mercurial import (
cmdutil,
commands,
error,
match,
node as hgnode,
util,
)
cmdtable = {}
command = cmdutil.command(cmdtable)
# Note for extension authors: ONLY specify testedwith = 'ships-with-hg-core' for
# extensions which SHIP WITH MERCURIAL. Non-mainline extensions should
# be specifying the version(s) of Mercurial they are tested with, or
# leave the attribute unspecified.
testedwith = 'ships-with-hg-core'
class gpg(object):
def __init__(self, path, key=None):
self.path = path
self.key = (key and " --local-user \"%s\"" % key) or ""
def sign(self, data):
gpgcmd = "%s --sign --detach-sign%s" % (self.path, self.key)
return util.filter(data, gpgcmd)
def verify(self, data, sig):
""" returns of the good and bad signatures"""
sigfile = datafile = None
try:
# create temporary files
fd, sigfile = tempfile.mkstemp(prefix="hg-gpg-", suffix=".sig")
fp = os.fdopen(fd, 'wb')
fp.write(sig)
fp.close()
fd, datafile = tempfile.mkstemp(prefix="hg-gpg-", suffix=".txt")
fp = os.fdopen(fd, 'wb')
fp.write(data)
fp.close()
gpgcmd = ("%s --logger-fd 1 --status-fd 1 --verify "
"\"%s\" \"%s\"" % (self.path, sigfile, datafile))
ret = util.filter("", gpgcmd)
finally:
for f in (sigfile, datafile):
try:
if f:
os.unlink(f)
except OSError:
pass
keys = []
key, fingerprint = None, None
for l in ret.splitlines():
# see DETAILS in the gnupg documentation
# filter the logger output
if not l.startswith("[GNUPG:]"):
continue
l = l[9:]
if l.startswith("VALIDSIG"):
# fingerprint of the primary key
fingerprint = l.split()[10]
elif l.startswith("ERRSIG"):
key = l.split(" ", 3)[:2]
key.append("")
fingerprint = None
elif (l.startswith("GOODSIG") or
l.startswith("EXPSIG") or
l.startswith("EXPKEYSIG") or
l.startswith("BADSIG")):
if key is not None:
keys.append(key + [fingerprint])
key = l.split(" ", 2)
fingerprint = None
if key is not None:
keys.append(key + [fingerprint])
return keys
def newgpg(ui, **opts):
"""create a new gpg instance"""
gpgpath = ui.config("gpg", "cmd", "gpg")
gpgkey = opts.get('key')
if not gpgkey:
gpgkey = ui.config("gpg", "key", None)
return gpg(gpgpath, gpgkey)
def sigwalk(repo):
"""
walk over every sigs, yields a couple
((node, version, sig), (filename, linenumber))
"""
def parsefile(fileiter, context):
ln = 1
for l in fileiter:
if not l:
continue
yield (l.split(" ", 2), (context, ln))
ln += 1
# read the heads
fl = repo.file(".hgsigs")
for r in reversed(fl.heads()):
fn = ".hgsigs|%s" % hgnode.short(r)
for item in parsefile(fl.read(r).splitlines(), fn):
yield item
try:
# read local signatures
fn = "localsigs"
for item in parsefile(repo.vfs(fn), fn):
yield item
except IOError:
pass
def getkeys(ui, repo, mygpg, sigdata, context):
"""get the keys who signed a data"""
fn, ln = context
node, version, sig = sigdata
prefix = "%s:%d" % (fn, ln)
node = hgnode.bin(node)
data = node2txt(repo, node, version)
sig = binascii.a2b_base64(sig)
keys = mygpg.verify(data, sig)
validkeys = []
# warn for expired key and/or sigs
for key in keys:
if key[0] == "ERRSIG":
ui.write(_("%s Unknown key ID \"%s\"\n")
% (prefix, shortkey(ui, key[1][:15])))
continue
if key[0] == "BADSIG":
ui.write(_("%s Bad signature from \"%s\"\n") % (prefix, key[2]))
continue
if key[0] == "EXPSIG":
ui.write(_("%s Note: Signature has expired"
" (signed by: \"%s\")\n") % (prefix, key[2]))
elif key[0] == "EXPKEYSIG":
ui.write(_("%s Note: This key has expired"
" (signed by: \"%s\")\n") % (prefix, key[2]))
validkeys.append((key[1], key[2], key[3]))
return validkeys
@command("sigs", [], _('hg sigs'))
def sigs(ui, repo):
"""list signed changesets"""
mygpg = newgpg(ui)
revs = {}
for data, context in sigwalk(repo):
node, version, sig = data
fn, ln = context
try:
n = repo.lookup(node)
except KeyError:
ui.warn(_("%s:%d node does not exist\n") % (fn, ln))
continue
r = repo.changelog.rev(n)
keys = getkeys(ui, repo, mygpg, data, context)
if not keys:
continue
revs.setdefault(r, [])
revs[r].extend(keys)
for rev in sorted(revs, reverse=True):
for k in revs[rev]:
r = "%5d:%s" % (rev, hgnode.hex(repo.changelog.node(rev)))
ui.write("%-30s %s\n" % (keystr(ui, k), r))
@command("sigcheck", [], _('hg sigcheck REV'))
def sigcheck(ui, repo, rev):
"""verify all the signatures there may be for a particular revision"""
mygpg = newgpg(ui)
rev = repo.lookup(rev)
hexrev = hgnode.hex(rev)
keys = []
for data, context in sigwalk(repo):
node, version, sig = data
if node == hexrev:
k = getkeys(ui, repo, mygpg, data, context)
if k:
keys.extend(k)
if not keys:
ui.write(_("no valid signature for %s\n") % hgnode.short(rev))
return
# print summary
ui.write(_("%s is signed by:\n") % hgnode.short(rev))
for key in keys:
ui.write(" %s\n" % keystr(ui, key))
def keystr(ui, key):
"""associate a string to a key (username, comment)"""
keyid, user, fingerprint = key
comment = ui.config("gpg", fingerprint, None)
if comment:
return "%s (%s)" % (user, comment)
else:
return user
@command("sign",
[('l', 'local', None, _('make the signature local')),
('f', 'force', None, _('sign even if the sigfile is modified')),
('', 'no-commit', None, _('do not commit the sigfile after signing')),
('k', 'key', '',
_('the key id to sign with'), _('ID')),
('m', 'message', '',
_('use text as commit message'), _('TEXT')),
('e', 'edit', False, _('invoke editor on commit messages')),
] + commands.commitopts2,
_('hg sign [OPTION]... [REV]...'))
def sign(ui, repo, *revs, **opts):
"""add a signature for the current or given revision
If no revision is given, the parent of the working directory is used,
or tip if no revision is checked out.
The ``gpg.cmd`` config setting can be used to specify the command
to run. A default key can be specified with ``gpg.key``.
See :hg:`help dates` for a list of formats valid for -d/--date.
"""
with repo.wlock():
return _dosign(ui, repo, *revs, **opts)
def _dosign(ui, repo, *revs, **opts):
mygpg = newgpg(ui, **opts)
sigver = "0"
sigmessage = ""
date = opts.get('date')
if date:
opts['date'] = util.parsedate(date)
if revs:
nodes = [repo.lookup(n) for n in revs]
else:
nodes = [node for node in repo.dirstate.parents()
if node != hgnode.nullid]
if len(nodes) > 1:
raise error.Abort(_('uncommitted merge - please provide a '
'specific revision'))
if not nodes:
nodes = [repo.changelog.tip()]
for n in nodes:
hexnode = hgnode.hex(n)
ui.write(_("signing %d:%s\n") % (repo.changelog.rev(n),
hgnode.short(n)))
# build data
data = node2txt(repo, n, sigver)
sig = mygpg.sign(data)
if not sig:
raise error.Abort(_("error while signing"))
sig = binascii.b2a_base64(sig)
sig = sig.replace("\n", "")
sigmessage += "%s %s %s\n" % (hexnode, sigver, sig)
# write it
if opts['local']:
repo.vfs.append("localsigs", sigmessage)
return
if not opts["force"]:
msigs = match.exact(repo.root, '', ['.hgsigs'])
if any(repo.status(match=msigs, unknown=True, ignored=True)):
raise error.Abort(_("working copy of .hgsigs is changed "),
hint=_("please commit .hgsigs manually"))
sigsfile = repo.wfile(".hgsigs", "ab")
sigsfile.write(sigmessage)
sigsfile.close()
if '.hgsigs' not in repo.dirstate:
repo[None].add([".hgsigs"])
if opts["no_commit"]:
return
message = opts['message']
if not message:
# we don't translate commit messages
message = "\n".join(["Added signature for changeset %s"
% hgnode.short(n)
for n in nodes])
try:
editor = cmdutil.getcommiteditor(editform='gpg.sign', **opts)
repo.commit(message, opts['user'], opts['date'], match=msigs,
editor=editor)
except ValueError as inst:
raise error.Abort(str(inst))
def shortkey(ui, key):
if len(key) != 16:
ui.debug("key ID \"%s\" format error\n" % key)
return key
return key[-8:]
def node2txt(repo, node, ver):
"""map a manifest into some text"""
if ver == "0":
return "%s\n" % hgnode.hex(node)
else:
raise error.Abort(_("unknown signature version"))