# HG changeset patch # User Matt Mackall # Date 1452200277 21600 # Node ID c8d3392f76e14751da518b6a52860686bbadf25e # Parent dad6404ccddb236ef0f1b88c741eec71bf6a5563 encoding: handle UTF-16 internal limit with fromutf8b (issue5031) Default builds of Python have a Unicode type that isn't actually full Unicode but UTF-16, which encodes non-BMP codepoints to a pair of BMP codepoints with surrogate escaping. Since our UTF-8b hack escaping uses a plane that overlaps with the UTF-16 escaping system, this gets extra complicated. In addition, unichr() for codepoints greater than U+FFFF may not work either. This changes the code to reuse getutf8char to walk the byte string, so we only rely on Python for unpacking our U+DCxx characters. diff -r dad6404ccddb -r c8d3392f76e1 mercurial/encoding.py --- a/mercurial/encoding.py Wed Nov 11 21:18:02 2015 -0500 +++ b/mercurial/encoding.py Thu Jan 07 14:57:57 2016 -0600 @@ -516,17 +516,27 @@ True >>> roundtrip("\\xef\\xef\\xbf\\xbd") True + >>> roundtrip("\\xf1\\x80\\x80\\x80\\x80") + True ''' # fast path - look for uDxxx prefixes in s if "\xed" not in s: return s - u = s.decode("utf-8") + # We could do this with the unicode type but some Python builds + # use UTF-16 internally (issue5031) which causes non-BMP code + # points to be escaped. Instead, we use our handy getutf8char + # helper again to walk the string without "decoding" it. + r = "" - for c in u: - if ord(c) & 0xffff00 == 0xdc00: - r += chr(ord(c) & 0xff) - else: - r += c.encode("utf-8") + pos = 0 + l = len(s) + while pos < l: + c = getutf8char(s, pos) + pos += len(c) + # unescape U+DCxx characters + if "\xed\xb0\x80" <= c <= "\xed\xb3\xbf": + c = chr(ord(c.decode("utf-8")) & 0xff) + r += c return r