mercurial: Changelog

changelog: avoid slicing raw data until needed Before, we were slicing the original raw text and storing individual variables with values corresponding to each field. This is avoidable overhead. With this patch, we store the offsets of the fields at construction time and perform the slice when a property is accessed. This appears to show a very marginal performance win on its own and the gains are so small as to not be worth reporting. However, this patch marks the end of our parsing refactor, so it is worth reporting the gains from the entire series: author(mpm) 0.896565 0.795987 89% desc(bug) 0.887169 0.803438 90% date(2015) 0.878797 0.773961 88% extra(rebase_source) 0.865446 0.761603 88% author(mpm) or author(greg) 1.801832 1.576025 87% author(mpm) or desc(bug) 1.812438 1.593335 88% date(2015) or branch(default) 0.968276 0.875270 90% author(mpm) or desc(bug) or date(2015) or extra(rebase_source) 3.656193 3.183104 87% Pretty consistent speed-up across the board for any revset accessing changelog revision data. Not bad! It's also worth noting that PyPy appears to experience a similar to marginally greater speed-up as well! According to statprof, revsets accessing changelog revision data are now clearly dominated by zlib decompression (16-17% of execution time). Surprisingly, it appears the most expensive part of revision parsing are the various text.index() calls to search for newlines! These appear to cumulatively add up to 5+% of execution time. I reckon implementing the parsing in C would make things marginally faster. If accessing larger strings (such as the commit message), encoding.tolocal() is the most expensive procedure outside of decompression.

changelog: parse description last Before, we first searched for the double newline as the first step in the parse then moved to the front of the string and worked our way to the back again. This made sense when we were splitting the raw text on the double newline. But in our new newline scanning based approach, this feels awkward. This patch updates the parsing logic to parse the text linearly and deal with the description field last. Because we're avoiding an extra string scan, revsets appear to demonstrate a very slight performance win. But the percentage change is marginal, so the numbers aren't worth reporting.

changelog: lazily parse files More of the same. Again, modest revset performance wins: author(mpm) 0.896565 0.822961 0.805156 desc(bug) 0.887169 0.847054 0.798101 date(2015) 0.878797 0.811613 0.786689 extra(rebase_source) 0.865446 0.797756 0.777408 author(mpm) or author(greg) 1.801832 1.668172 1.626547 author(mpm) or desc(bug) 1.812438 1.677608 1.613941 date(2015) or branch(default) 0.968276 0.896032 0.869017

changelog: lazily parse date/extra field This is probably the most complicated patch in the parsing refactor. Because the date and extras are encoded in the same field, we stuff the entire field into a dedicated variable and add a property for accessing the sub-components of each. There is some duplicated code here. But the code is relatively simple, so it shouldn't be a big deal. We see revset performance wins across the board: author(mpm) 0.896565 0.876713 0.822961 desc(bug) 0.887169 0.895514 0.847054 date(2015) 0.878797 0.820987 0.811613 extra(rebase_source) 0.865446 0.823811 0.797756 author(mpm) or author(greg) 1.801832 1.784160 1.668172 author(mpm) or desc(bug) 1.812438 1.822756 1.677608 date(2015) or branch(default) 0.968276 0.910981 0.896032 author(mpm) or desc(bug) or date(2015) or extra(rebase_source) 3.656193 3.516788 3.265024 We see a speed-up on revsets accessing date and extras because the new parsing code only parses what you access. Even though they are stored the same text field, we avoid parsing dates when accessing extras and vice-versa. But strangely revsets accessing both date and extras appeared to speed up as well! I'm not sure if this is due to refactoring the parsing code or due to an optimization in revsets. You can't argue with the results!

changelog: lazily parse user Same strategy as before. Revsets not accessing the user demonstrate a slight performance win: desc(bug) 0.887169 0.910400 0.895514 date(2015) 0.878797 0.870697 0.820987 extra(rebase_source) 0.865446 0.841644 0.823811 date(2015) or branch(default) 0.968276 0.945792 0.910981

changelog: lazily parse manifest node Like the description, we store the raw bytes and convert from hex on access. This patch also marks the beginning of our new parsing method, which is based on newline offsets and doesn't rely on str.split(). Many revsets showed a performance improvement: author(mpm) 0.896565 0.869085 0.868598 desc(bug) 0.887169 0.928164 0.910400 extra(rebase_source) 0.865446 0.871500 0.841644 author(mpm) or author(greg) 1.801832 1.791589 1.731503 author(mpm) or desc(bug) 1.812438 1.851003 1.798764 date(2015) or branch(default) 0.968276 0.974027 0.945792

changelog: lazily parse description Before, the description field was converted to a localstr at parse time. With this patch, we store the raw description and convert to a localstr when it is first accessed. We see a revset speedup for revsets that don't access the description: author(mpm) 0.896565 0.914234 0.869085 date(2015) 0.878797 0.891980 0.862525 extra(rebase_source) 0.865446 0.912514 0.871500 author(mpm) or author(greg) 1.801832 1.860402 1.791589 date(2015) or branch(default) 0.968276 0.994673 0.974027 author(mpm) or desc(bug) or date(2015) or extra(rebase_source) 3.656193 3.721032 3.643593 As you can see, most of these revsets are already faster than from before this refactoring: we have already offset the performance loss from the introduction of the new class representing parsed changelog entries!

context: use changelogrevision Upcoming patches will make the changelogrevision object perform lazy parsing. Let's switch to it. Because we're switching from a tuple to an object, everthing that accesses the internal cached attribute needs to be updated to access via attributes. A nice side-effect is this makes the code easier to read! Surprisingly, this appears to make revsets accessing this data slightly faster (values are before series, p1, this patch): author(mpm) 0.896565 0.929984 0.914234 desc(bug) 0.887169 0.935642 0.921073 date(2015) 0.878797 0.908094 0.891980 extra(rebase_source) 0.865446 0.922624 0.912514 author(mpm) or author(greg) 1.801832 1.902112 1.860402 author(mpm) or desc(bug) 1.812438 1.860977 1.844850 date(2015) or branch(default) 0.968276 1.005824 0.994673 author(mpm) or desc(bug) or date(2015) or extra(rebase_source) 3.656193 3.743381 3.721032

changelog: add class to represent parsed changelog revisions Currently, changelog entries are parsed into their respective components at read time. Many operations are only interested in a subset of fields of a changelog entry. The parsing and storing of all the fields adds avoidable overhead. This patch introduces the "changelogrevision" class. It takes changelog raw text and exposes the parsed results as attributes. The code for parsing changelog entries has been moved into its construction function. changelog.read() has been modified to use the new class internally while maintaining its existing API. Future patches will make revision parsing lazy. We implement the construction function of the new class with __new__ instead of __init__ so we can use a named tuple to represent the empty revision. This saves overhead and complexity of coercing later versions of this class to represent an empty instance. While we are here, we add a method on changelog to obtain an instance of the new type. The overhead of constructing the new class regresses performance of revsets accessing this data: author(mpm) 0.896565 0.929984 desc(bug) 0.887169 0.935642 105% date(2015) 0.878797 0.908094 extra(rebase_source) 0.865446 0.922624 106% author(mpm) or author(greg) 1.801832 1.902112 105% author(mpm) or desc(bug) 1.812438 1.860977 date(2015) or branch(default) 0.968276 1.005824 author(mpm) or desc(bug) or date(2015) or extra(rebase_source) 3.656193 3.743381 Once lazy parsing is implemented, these revsets will all be faster than before. There is no performance change on revsets that do not access this data. There /could/ be a performance regression on operations that perform several changelog reads. However, I can't think of anything outside of revsets and `hg log` (basically the same as a revset) that would be impacted.