changelog: avoid slicing raw data until needed
authorGregory Szorc <gregory.szorc@gmail.com>
Sun, 06 Mar 2016 15:40:20 -0800
changeset 28495 70c2f8a98276
parent 28494 63653147e9bb
child 28496 b592564a803c
changelog: avoid slicing raw data until needed Before, we were slicing the original raw text and storing individual variables with values corresponding to each field. This is avoidable overhead. With this patch, we store the offsets of the fields at construction time and perform the slice when a property is accessed. This appears to show a very marginal performance win on its own and the gains are so small as to not be worth reporting. However, this patch marks the end of our parsing refactor, so it is worth reporting the gains from the entire series: author(mpm) 0.896565 0.795987 89% desc(bug) 0.887169 0.803438 90% date(2015) 0.878797 0.773961 88% extra(rebase_source) 0.865446 0.761603 88% author(mpm) or author(greg) 1.801832 1.576025 87% author(mpm) or desc(bug) 1.812438 1.593335 88% date(2015) or branch(default) 0.968276 0.875270 90% author(mpm) or desc(bug) or date(2015) or extra(rebase_source) 3.656193 3.183104 87% Pretty consistent speed-up across the board for any revset accessing changelog revision data. Not bad! It's also worth noting that PyPy appears to experience a similar to marginally greater speed-up as well! According to statprof, revsets accessing changelog revision data are now clearly dominated by zlib decompression (16-17% of execution time). Surprisingly, it appears the most expensive part of revision parsing are the various text.index() calls to search for newlines! These appear to cumulatively add up to 5+% of execution time. I reckon implementing the parsing in C would make things marginally faster. If accessing larger strings (such as the commit message), encoding.tolocal() is the most expensive procedure outside of decompression.
mercurial/changelog.py
--- a/mercurial/changelog.py	Sun Mar 06 13:13:54 2016 -0800
+++ b/mercurial/changelog.py	Sun Mar 06 15:40:20 2016 -0800
@@ -151,11 +151,8 @@
     """
 
     __slots__ = (
-        '_rawdateextra',
-        '_rawdesc',
-        '_rawfiles',
-        '_rawmanifest',
-        '_rawuser',
+        '_offsets',
+        '_text',
     )
 
     def __new__(cls, text):
@@ -185,41 +182,41 @@
         # changelog v0 doesn't use extra
 
         nl1 = text.index('\n')
-        self._rawmanifest = text[0:nl1]
-
         nl2 = text.index('\n', nl1 + 1)
-        self._rawuser = text[nl1 + 1:nl2]
-
         nl3 = text.index('\n', nl2 + 1)
-        self._rawdateextra = text[nl2 + 1:nl3]
 
         # The list of files may be empty. Which means nl3 is the first of the
         # double newline that precedes the description.
         if text[nl3 + 1] == '\n':
-            self._rawfiles = None
-            self._rawdesc = text[nl3 + 2:]
+            doublenl = nl3
         else:
             doublenl = text.index('\n\n', nl3 + 1)
-            self._rawfiles = text[nl3 + 1:doublenl]
-            self._rawdesc = text[doublenl + 2:]
+
+        self._offsets = (nl1, nl2, nl3, doublenl)
+        self._text = text
 
         return self
 
     @property
     def manifest(self):
-        return bin(self._rawmanifest)
+        return bin(self._text[0:self._offsets[0]])
 
     @property
     def user(self):
-        return encoding.tolocal(self._rawuser)
+        off = self._offsets
+        return encoding.tolocal(self._text[off[0] + 1:off[1]])
 
     @property
     def _rawdate(self):
-        return self._rawdateextra.split(' ', 2)[0:2]
+        off = self._offsets
+        dateextra = self._text[off[1] + 1:off[2]]
+        return dateextra.split(' ', 2)[0:2]
 
     @property
     def _rawextra(self):
-        fields = self._rawdateextra.split(' ', 2)
+        off = self._offsets
+        dateextra = self._text[off[1] + 1:off[2]]
+        fields = dateextra.split(' ', 2)
         if len(fields) != 3:
             return None
 
@@ -247,14 +244,15 @@
 
     @property
     def files(self):
-        if self._rawfiles is None:
+        off = self._offsets
+        if off[2] == off[3]:
             return []
 
-        return self._rawfiles.split('\n')
+        return self._text[off[2] + 1:off[3]].split('\n')
 
     @property
     def description(self):
-        return encoding.tolocal(self._rawdesc)
+        return encoding.tolocal(self._text[self._offsets[3] + 2:])
 
 class changelog(revlog.revlog):
     def __init__(self, opener):