mercurial: Changelog

shelve: use matcher to restrict prefetch to just the modified files Shelve currently operates by: - make a temp commit - identify all the bases necessary to shelve, put them in the bundle - use exportfile to export the temp commit to the bundle ('file' here means "export to this fd", not "export this file") - remove the temp commit exportfile calls prefetchfiles, and prefetchfiles uses a matcher to restrict what files it's going to prefetch; if it's not provided, it's alwaysmatcher. This means that `hg shelve` in a remotefilelog repo can possibly download the file contents of everything in the repository, even when it doesn't need to. It luckily is restricted to the narrowspec (if there is one), but this is still a lot of downloading that's just unnecessary, especially if there's a "smart" VCS-aware filesystem involved. exportfile is called with exactly one revision to emit, so we're just restricting it to prefetching the files from that revision. The base revisions having separate files should not be a concern since they're handled already; example: commit 10 is draft and modifies foo/a.txt and foo/b.txt commit 11 is draft and modifies foo/a.txt my working directory that I'm shelving modifies foo/b.txt By the time we get to exportfile, commit 10 and 11 are already handled, so the matcher only specifying foo/b.txt does not cause any problems. I verified this by doing an `hg unbundle` on the bundle that shelve produces, and getting the full contents of those commits back out, instead of just the files that were modified in the shelve. Differential Revision: https://phab.mercurial-scm.org/D5268

revlog: automatically read from opened file handles The revlog reading code commonly opens a new file handle for reading on demand. There is support for passing a file handle to revlog.revision(). But it is marked as an internal argument. When revlogs are written, we write() data as it is available. But we don't flush() data until all revisions are written. Putting these two traits together, it is possible for an in-process revlog reader during active writes to trigger the opening of a new file handle on a file with unflushed writes. The reader won't have access to all "available" revlog data (as it hasn't been flushed). And with the introduction of the previous patch, this can lead to the revlog raising an error due to a partial read. I witnessed this behavior when applying changegroup data (via `hg pull`) before issue6006 was fixed via different means. Having this and the previous patch in play would have helped cause errors earlier rather than manifesting as hash verification failures. While this has been a long-standing issue, I believe the relatively new delta computation code has tickled it into being more common. This is because the new delta computation code will compute deltas in more scenarios. This can lead to revlog reading. While the delta computation code is probably supposed to reuse file handles, it appears it isn't doing so in all circumstances. But the issue runs deeper than that. Theoretically, any code can access revision data during revlog writes. It appears we were just getting lucky that it wasn't. (The "add revision callback" passed to addgroup() provides an avenue to do this.) If I changed the revlog's behavior to not cache the full revision text or to clear caches after revision insertion during addgroup(), I was able to produce crashes 100% of the time when writing changelog revisions. This is because changelog's add revision callback attempts to resolve the revision data to access the changed files list. And without the revision's fulltext being cached, we performed a revlog read, which required opening a new file handle. This attempted to read unflushed data, leading to a partial read and a crash. This commit teaches the revlog to store the file handles used for writing multiple revisions during addgroup(). It also teaches the code for resolving a file handle when reading to use these handles, if available. This ensures that *any* reads (regardless of their source) use the active writing file handles, if available. These file handles have access to the unflushed data because they wrote it. This allows reads to complete without issue. Differential Revision: https://phab.mercurial-scm.org/D5267

revlog: detect incomplete revlog reads _readsegment() is supposed to return N bytes of revlog revision data starting at a file offset. Surprisingly, its behavior before this patch never verified that it actually read and returned N bytes! Instead, it would perform the read(), then return whatever data was available. And even more surprisingly, nothing in the call chain appears to have been validating that it received all the data it was expecting. This behavior could lead to partial or incomplete revision chunks being operated on. This could result in e.g. cached deltas being applied against incomplete base revisions. The delta application process would happily perform this operation. Only hash verification would detect the corruption and save us. This commit changes the behavior of raw revlog reading to validate that we actually read() the number of bytes that were requested. We will raise a more specific error faster, rather than possibly have it go undetected or manifest later in the call stack, at delta application or hash verification. Differential Revision: https://phab.mercurial-scm.org/D5266

revlog: use single file handle when de-inlining revlog _getsegmentforrevs() will eventually call into _datareadfp() to resolve a file handle to read revision data. If no file handle is passed into _getsegmentforrevs(), it opens a new one. Explicit is better than implicit. This commit changes _enforceinlinesize() to open a file handle explicitly when converting inline revlogs to split revlogs and to pass this file handle into _getsegmentforrevs(). I haven't measured, but this change should improve performance, as we no longer reopen the revlog for reading for every revision in the revlog when it is converted from inline to split. Instead, we open it at most once and use it for the duration of the operation. That being said, I /think/ the chunk cache may mitigate the number of file opens required. Differential Revision: https://phab.mercurial-scm.org/D5265

store: raise ProgrammingError if unable to decode a storage path Right now, the function magically return False which is dangerous, so let's raise ProgrammingError. Suggested by Augie in D5139. Differential Revision: https://phab.mercurial-scm.org/D5264

tests: document a known failing interaction between narrow and lfs This is one of the two remaining aborts I found looking into issue5794. I've got no idea what's wrong with the hook, since the changes there fixed the other two problems noted in that bug report. It seems like it might go away when the narrow issue is fixed, but let's make sure this doesn't get lost. The stacktrace for the hook seems to indicate that the missing file *is* in ctx: remote: Traceback (most recent call last): remote: File "c:\Users\Matt\projects\hg\hgext\lfs\__init__.py", line 253, in checkrequireslfs remote: if any(f in ctx and match(f) and ctx[f].islfs() for f in ctx.files()): remote: File "c:\Users\Matt\projects\hg\hgext\lfs\__init__.py", line 253, in <genexpr> remote: if any(f in ctx and match(f) and ctx[f].islfs() for f in ctx.files()): remote: File "c:\Users\Matt\projects\hg\hgext\lfs\wrapper.py", line 191, in filectxislfs remote: return _islfs(self.filelog(), self.filenode()) remote: File "c:\Users\Matt\projects\hg\mercurial\context.py", line 631, in filenode remote: return self._filenode remote: File "c:\Users\Matt\projects\hg\mercurial\util.py", line 1528, in __get__ remote: result = self.func(obj) remote: File "c:\Users\Matt\projects\hg\mercurial\context.py", line 579, in _filenode remote: return self._filelog.lookup(self._fileid) remote: File "c:\Users\Matt\projects\hg\mercurial\filelog.py", line 68, in lookup remote: self._revlog.indexfile) remote: File "c:\Users\Matt\projects\hg\mercurial\utils\storageutil.py", line 218, in fileidlookup remote: raise error.LookupError(fileid, identifier, _('no match found')) remote: LookupError: data/inside2/f.i@f59b4e021835: no match found

logtoprocess: drop support for ui.log() call with invalid msg arguments (BC) Before, the logtoprocess extension put a formatted message into $MSG1, and its arguments to $MSG2... If the specified arguments couldn't be formatted because of a caller bug, an unformatted message was passed in to $MSG1 instead of exploding. This behavior doesn't make sense. Since I'm planning to formalize the ui.log() interface such that we'll no longer have to extend the ui class, I want to remove any features not conforming to the ui.log() API. So this patch removes the support for ill-formed arguments, and $MSG{n} (where n > 1) parameters which seems useless as long as the message can be formatted. The $MSG1 variable isn't renamed for the maximum compatibility. In future patches, a formatted msg will be passed to a processlogger object, instead of overriding the ui.log() function. .. bc:: The logtoprocess extension no longer supports invalid ``ui.log()`` arguments. A log message is always formatted and passed in to the ``$MSG1`` environment variable.