pycompat: use os.fsencode() to re-encode sys.argv
authorManuel Jacob <me@manueljacob.de>
Wed, 24 Jun 2020 14:44:21 +0200
changeset 44998 f2de8f31cb59
parent 44997 93aa152d4295
child 44999 d1471dbbdd63
pycompat: use os.fsencode() to re-encode sys.argv Historically, the previous code made sense, as Py_EncodeLocale() and fs.fsencode() could possibly use different encodings. However, this is not the case anymore for Python 3.2, which uses the locale encoding as the filesystem encoding (this is not true for later Python versions, but see below). See https://vstinner.github.io/painful-history-python-filesystem-encoding.html for a source and more background information. Using os.fsencode() is safer, as the documentation for sys.argv says that it can be used to get the original bytes. When doing further changes, the Python developers will take care that this continues to work. One concrete case where os.fsencode() is more correct is when enabling Python's UTF-8 mode. Py_DecodeLocale() will use UTF-8 in this case. Our previous code would have encoded it using the locale encoding (which might be different), whereas os.fsencode() will encode it with UTF-8. Since we don’t claim to support the UTF-8 mode, this is not really a bug and the patch can go to the default branch. It might be a good idea to not commit this to the stable branch, as it could in theory introduce regressions.
mercurial/pycompat.py
--- a/mercurial/pycompat.py	Thu Jun 25 22:40:04 2020 +0900
+++ b/mercurial/pycompat.py	Wed Jun 24 14:44:21 2020 +0200
@@ -98,7 +98,6 @@
     import codecs
     import functools
     import io
-    import locale
     import struct
 
     if os.name == r'nt' and sys.version_info >= (3, 6):
@@ -156,16 +155,10 @@
 
     if getattr(sys, 'argv', None) is not None:
         # On POSIX, the char** argv array is converted to Python str using
-        # Py_DecodeLocale(). The inverse of this is Py_EncodeLocale(), which isn't
-        # directly callable from Python code. So, we need to emulate it.
-        # Py_DecodeLocale() calls mbstowcs() and falls back to mbrtowc() with
-        # surrogateescape error handling on failure. These functions take the
-        # current system locale into account. So, the inverse operation is to
-        # .encode() using the system locale's encoding and using the
-        # surrogateescape error handler. The only tricky part here is getting
-        # the system encoding correct, since `locale.getlocale()` can return
-        # None. We fall back to the filesystem encoding if lookups via `locale`
-        # fail, as this seems like a reasonable thing to do.
+        # Py_DecodeLocale(). The inverse of this is Py_EncodeLocale(), which
+        # isn't directly callable from Python code. In practice, os.fsencode()
+        # can be used instead (this is recommended by Python's documentation
+        # for sys.argv).
         #
         # On Windows, the wchar_t **argv is passed into the interpreter as-is.
         # Like POSIX, we need to emulate what Py_EncodeLocale() would do. But
@@ -178,19 +171,7 @@
         if os.name == r'nt':
             sysargv = [a.encode("mbcs", "ignore") for a in sys.argv]
         else:
-
-            def getdefaultlocale_if_known():
-                try:
-                    return locale.getdefaultlocale()
-                except ValueError:
-                    return None, None
-
-            encoding = (
-                locale.getlocale()[1]
-                or getdefaultlocale_if_known()[1]
-                or sys.getfilesystemencoding()
-            )
-            sysargv = [a.encode(encoding, "surrogateescape") for a in sys.argv]
+            sysargv = [fsencode(a) for a in sys.argv]
 
     bytechr = struct.Struct('>B').pack
     byterepr = b'%r'.__mod__