similar: compare between actual file contents for exact identity
authorFUJIWARA Katsunori <foozy@lares.dti.ne.jp>
Fri, 03 Mar 2017 02:57:06 +0900
changeset 31210 e1d035905b2e
parent 31209 dd2364f5180a
child 31211 ecbd378d9a7e
similar: compare between actual file contents for exact identity Before this patch, similarity detection logic (for addremove and automv) depends entirely on SHA-1 digesting. But this causes incorrect rename detection, if: - removing file A and adding file B occur at same committing, and - SHA-1 hash values of file A and B are same This may prevent security experts from managing sample files for SHAttered issue in Mercurial repository, for example. https://security.googleblog.com/2017/02/announcing-first-sha1-collision.html https://shattered.it/ Hash collision itself isn't so serious for core repository functionality of Mercurial, described by mpm as below, though. https://www.mercurial-scm.org/wiki/mpm/SHA1 This patch compares between actual file contents after hash comparison for exact identity. Even after this patch, SHA-1 is still used, because it is reasonable enough to quickly detect existence of "(almost) same" file. - replacing SHA-1 causes decreasing performance, and - replacement of it has ambiguity, yet Getting content of removed file (= rfctx.data()) at each exact comparison should be cheap enough, even though getting content of added one costs much. ======= ============== ===================== file fctx data() reads from ======= ============== ===================== removed filectx in-memory revlog data added workingfilectx storage ======= ============== =====================
mercurial/similar.py
--- a/mercurial/similar.py	Thu Mar 02 21:49:30 2017 -0800
+++ b/mercurial/similar.py	Fri Mar 03 02:57:06 2017 +0900
@@ -35,9 +35,13 @@
     for i, fctx in enumerate(added):
         repo.ui.progress(_('searching for exact renames'), i + len(removed),
                 total=numfiles, unit=_('files'))
-        h = hashlib.sha1(fctx.data()).digest()
+        adata = fctx.data()
+        h = hashlib.sha1(adata).digest()
         if h in hashes:
-            yield (hashes[h], fctx)
+            rfctx = hashes[h]
+            # compare between actual file contents for exact identity
+            if adata == rfctx.data():
+                yield (rfctx, fctx)
 
     # Done
     repo.ui.progress(_('searching for exact renames'), None)