Skip to content

Fix access violations when cloning/fetching via HTTPS #38

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Mar 20, 2015

Conversation

dscho
Copy link
Member

@dscho dscho commented Mar 17, 2015

Wow. This has really taken quite a lot of time to debug. This developer alone spent 15 hours tracking it down and squashing it, and @nalla as well as @sschuberth helped as well.

@dscho
Copy link
Member Author

dscho commented Mar 17, 2015

I just realized that the most important place to call maybe_reinitialize_environ() is the make_environ_block() which was the actual culprit (but mingw_getenv() is called several times between curl_global_init() and that function, papering over my omission). Will fix tomorrow.

@nalla
Copy link

nalla commented Mar 17, 2015

That damn thing drove me crazy the last two days...

@dscho
Copy link
Member Author

dscho commented Mar 18, 2015

That damn thing drove me crazy the last two days...

Me, too, believe me! ;-) Thanks for helping me debug this!

It is unlikely that we have an empty environment, ever, but *if* we do,
when `environ_size - 1` is passed to `bsearchenv()` it is misinterpreted
as a real large integer.

To make the code truly defensive, refuse to do anything at all if the
size is negative (which should not happen, of course).

Signed-off-by: Johannes Schindelin <[email protected]>
@nalla
Copy link

nalla commented Mar 18, 2015

I wonder if

t5539-fetch-http-shallow.sh ........................ skipped: no web server found at '/usr/sbin/apache2'
t5540-http-push-webdav.sh .......................... skipped: no web server found at '/usr/sbin/apache2'
t5541-http-push-smart.sh ........................... skipped: no web server found at '/usr/sbin/apache2'
t5542-push-http-shallow.sh ......................... skipped: no web server found at '/usr/sbin/apache2'
t5550-http-fetch-dumb.sh ........................... skipped: no web server found at '/usr/sbin/apache2'
t5551-http-fetch-smart.sh .......................... skipped: no web server found at '/usr/sbin/apache2'
t5561-http-backend.sh .............................. skipped: no web server found at '/usr/sbin/apache2'

would have caught that error in the first place. If not, maybe we could add some general MinGW tests about environment handling?

@dscho dscho force-pushed the mingw-environment branch from ea60bc0 to b690df8 Compare March 18, 2015 07:37
@dscho
Copy link
Member Author

dscho commented Mar 18, 2015

I wonder if [web server tests] would have caught that error in the first place.

Possibly. But frankly, I do not want to install a full-fledged Apache Web Server just for that... ;-)

If we are serious to catch that bug if it reoccurs, we need to make a special test-mingw-environ.c that #undefs putenv, then deliberately messes with the environment and after that spawns some program like /usr/bin/true.exe.

@dscho
Copy link
Member Author

dscho commented Mar 18, 2015

I just realized that the most important place to call maybe_reinitialize_environ() is the make_environ_block() which was the actual culprit (but mingw_getenv() is called several times between curl_global_init() and that function, papering over my omission).

Fixed here.

@dscho
Copy link
Member Author

dscho commented Mar 18, 2015

Hold off! I am going to add one more commit.

@dscho
Copy link
Member Author

dscho commented Mar 18, 2015

Yep, that 15d371c was the commit I still wanted to have included. Now this PR is good to go for review and merge.

@dscho
Copy link
Member Author

dscho commented Mar 18, 2015

Actually, for the benefit of interested parties, I would like to share the strategy that eventually let me diagnose this problem properly and even fix it.

The first thing was to find out where this exit code 128 came from. Unfortunately, GIT_TRACE misled me substantially. The reason is that I had to run my tests in mintty to reproduce (in hindsight, it is clear that the CHARSET variable was set when starting in cmd, and therefore cURL did not modify the environment, hiding the problem) and unlike in cmd, the stderr in that mintty is not connected directly to the console. Therefore, the error messages that went to stderr just before the crash were not shown at all, putting me off track.

I found out about this problem after hours of hunting when I saw that the echo $? (that I put after the clone command to print out the exit status) printed in the middle of a truncated error message. I worked around this by applying this patch:

diff --git a/usage.c b/usage.c
index ed14645..d0e1fcd 100644
--- a/usage.c
+++ b/usage.c
@@ -146,6 +146,7 @@ int error(const char *err, ...)
    va_start(params, err);
    error_routine(err, params);
    va_end(params);
+fflush(stderr);
    return -1;
 }

At this point I could at least rely on my debug statements (for which I use error(...) because it is quicker to write than fprintf(stderr, ...) and it also appends a new-line character automatically) were printed to the console. Splendid.

Now another hours-long hunt started, where I put in debug statements into the likely code paths, starting with remote-curl.c's main() function, compile git-remote-https.exe, copy it into /mingw32/libexec/git-core/, run a clone and print the exit status. Lather, rinse, repeat. (Actually, I had already hunted with debug statements starting in the cmd_clone() function and had already identified git-remote-https as at least part of the culprit.)

To reduce the turn-around time, I had two windows open, one with vi to edit the code, and another where I ran a command-line like this:

make git-remote-https.exe && cp git-remote-https.exe /mingw32/libexec/git-core/ && git clone https://github.com/dscho/git dscho-git; echo $?

The && chain ensures that, say, a compile error prevents the latter commands from executing. Note that there is a semicolon before the echo printing the exit status because I expected the clone to fail most of the time.

Every once in a while, I would stage only the most recent debug statement with git add -p <file>, followed by git stash -k && git reset, essentially keeping only that debug statement and stashing all the others.

Eventually, I arrived at this diff (in addition to the error() diff shown above):

diff --git a/compat/mingw.c b/compat/mingw.c
index 38a79d0..5730141 100644
--- a/compat/mingw.c
+++ b/compat/mingw.c
@@ -990,6 +990,8 @@ static wchar_t *make_environment_block(char **deltaenv)

    /* create environment block from temporary environment */
    for (i = 0; tmpenv[i]; i++) {
+error("i: %d", i);
+error("strlen: %d", strlen(tmpenv[i]));
        size = 2 * strlen(tmpenv[i]) + 2; /* +2 for final \0 */
        ALLOC_GROW(wenvblk, (wenvpos + size) * sizeof(wchar_t), wenvsz);
        wenvpos += xutftowcs(&wenvblk[wenvpos], tmpenv[i], size) + 1;

Note that I had two debug statements because the second would not even be executed because the strlen() would already fail while building the parameter list.

Further debug statements revealed that tmpenv[i] pointed to an invalid address and caused the crash, and a subsequent compile cycle showed that tmpenv[environ_size - 1] was not NULL as assumed by our code.

Another debug statement later, it was clear that some code other than mingw_putenv() modified the environment between the last mingw_putenv() call and the call to make_environment_block().

This started another hunt – several hours, debugging is a patient persons' game – for the code that had modified that pointer outside of compat/mingw.c. To help with debugging this, I eventually came up with something much like this:

static char *dscho_zero = NULL;
static char **dscho_pointer = &dscho_zero;

void dscho_assert(const char *file, int lineno)
{
    if (*dscho_pointer)
        die("%s:%d NONONO!", file, lineno);
    error("%s:%d STILL OK", file, lineno);
}

In mingw_putenv(), I would insert this snippet:

if (environ_size == 69) dscho_pointer = environ + (environ_size - 1);

Then I would litter the likely code paths with copies of the following snippet:

{ extern void dscho_assert(const char*, int); dscho_assert(__FILE__, __LINE__); }

This eventually led to the realization that the environment changes in the call to curl_global_init() (being terminated in a different manner than expected by compat/mingw.c's environment handling).

From that point it was all smooth sailing: it turned out that the difference between the environment before and after that function call was the addition of CHARSET=cp1252 and inspecting the next environ entry revealed that it was not NULL as expected, but pointing to an empty string.

The assumption of Git's MinGW-specific environment code is that we have full control over it and can ensure that it is sorted and terminated by a NULL pointer. Obviously this assumption is incorrect (sorry, @kblees, I should have thought of that when I reviewed your PR in the first place, my mistake!).

The way I handled this is to detect when outside changes occur (it is not a totally fool-proof way, of course, it would have to be an extensive test that the environment is still sorted and terminated by a NULL pointer and that environ_size is correct, that would be pretty expensive I think) and fix it up to meet our assumptions again in that case.

Phew!

@dscho
Copy link
Member Author

dscho commented Mar 18, 2015

@sschuberth would you have time to review this PR? (I know that @t-b is busy...)

@sschuberth
Copy link

@dscho I'd have some time tomorrow morning again. And I probably need some time to understand the underlying issue. (Is it something about a memory region being reallocated and thus potentially moved, leaving our pointer to the original region dandling?)

@dscho
Copy link
Member Author

dscho commented Mar 19, 2015

Is it something about a memory region being reallocated and thus potentially moved, leaving our pointer to the original region dandling?

No, it is about extending the environment and then terminating it by a pointer to a single NUL character. Our code so far assumes that the environment is terminated by a NULL pointer, though. Therefore make_environment_block() just ran beyond the end of the environment and accessed uninitialized and invalid addresses.

The issue would be easily papered over simply by this fix: 15d371c363f5062cbd88104602f89291a0d83a58. No more running beyond the end of the environment, case closed, right?

Actually, no. The entire idea of initializing our own environment is so that we can ensure that it is UTF-8 encoded. For performance, @kblees wrote the code in such a way that it sorts the entries after conversion, to be able to use a binary search in mingw_getenv(). Of course, mingw_putenv() needs to ensure that the environment stays sorted, even when new entries are added.

Except that mingw_putenv() only overrides putenv() for Git's own code. When calling curl_global_init(), the called code side-steps mingw_putenv() and appends a new environment variable. Therefore, the assumption of mingw_getenv() that the environment is sorted now becomes incorrect.

The idea of b690df82565f900d2d91e0936e354a229ab8c5e0 is to detect that situation quickly and remedy it.

I agree that it is not a total solution, but it is definitely necessary to fix the bug.

I could imagine, for example, that the external code re-allocates the environment in such a way that realloc() (called from here) says: "Wait a minute, this was not malloc()ed in the first place, I am going to exit() with an error". I do not know that because I have not tested the behavior of outside putenv() thoroughly. It could also be worse, much, much worse, in that putenv() may assume to never need to grow at all and run beyond our allocated buffer.

There is also the possibility of a situation where a couple of environment variables get removed and then others get appended. If that happens between calls to maybe_reinitialize_environ() and the new size is at most the size of the old environment, my current code does not detect that it should reinitialize the environment: the entries may be unsorted at that stage. If such a situation becomes a possibility, we have to go back and detect when mingw_getenv() wants to return NULL, and look harder whether the environment is sorted (this is an O(n) operation, and in this loop we would also have to fall back to the non-binary search for the environment variable), possibly running qsort() afterwards to fix the problem.

Lastly, I have a hunch that we will eventually need to use critical sections to guard this code against multi-threaded usage.

But those are really all bridges I would like to cross later, when – and if! – needed.

@dscho
Copy link
Member Author

dscho commented Mar 19, 2015

I guess I should enhance the commit message. Will do so after running some errands, i.e. in a couple of hours.

Outside of our Windows-specific code, the end of the environment can be
marked also by a pointer to a NUL character, not only by a NULL pointer
as our code assumed so far.

That led to a buffer overrun in `make_environment_block()` when running
`git-remote-https` in `mintty` (because `curl_global_init()` added the
`CHARSET` environment variable *outside* of `mingw_putenv()`, ending the
environment in a pointer to an empty string).

Side note for future debugging on Windows: when running programs in
`mintty`, the standard input/output/error is not connected to a Win32
Console, but instead is pipe()d. That means that even stderr may not be
written completely before a crash, but has to be fflush()ed explicitly.
For example, when debugging crashes, the developer should insert an
`fflush(stderr);` at the end of the `error()` function defined in
usage.c.

Signed-off-by: Johannes Schindelin <[email protected]>
@dscho dscho force-pushed the mingw-environment branch from 15d371c to ac5283c Compare March 19, 2015 09:55
@dscho
Copy link
Member Author

dscho commented Mar 19, 2015

Okay, I think the commit messages are better now (and the order of the commits is more intuitive now, albeit no longer chronological). @sschuberth please have a look.

The environment is modified in most surprising circumstances, and not
all of them are under Git's control. For example, calling
curl_global_init() on Windows will ensure that the CHARSET variable is
set, adding one if necessary.

While the previous commit worked around crashes triggered by such
outside changes of the environment by relaxing the requirement that the
environment be terminated by a NULL pointer, the other assumption made
by `mingw_getenv()` and `mingw_putenv()` is that the environment is
sorted, for efficient lookup via binary search.

Let's make real sure that our environment is intact before querying or
modifying it, and reinitialize our idea of the environment if necessary.

With this commit, before working on the environment we look briefly for
indicators that the environment was modified outside of our control, and
to ensure that it is terminated with a NULL pointer and sorted again in
that case.

Note: the indicators are maybe not sufficient. For example, when a
variable is removed, it will not be noticed. It might also be a problem
if outside changes to the environment result in a modified `environ`
pointer: it is unclear whether such a modification could result in a
problem when `mingw_putenv()` needs to `realloc()` the environment
buffer.

For the moment, however, the current fix works well enough, so let's
only face the potential problems when (and if!) they occur.

Signed-off-by: Johannes Schindelin <[email protected]>
@dscho dscho force-pushed the mingw-environment branch from ac5283c to 22cc9c9 Compare March 19, 2015 12:09
@sschuberth
Copy link

Ok @dscho, feel free to merge.

dscho added a commit that referenced this pull request Mar 20, 2015
Fix access violations when cloning/fetching via HTTPS
@dscho dscho merged commit 8e0da87 into git-for-windows:master Mar 20, 2015
@dscho dscho deleted the mingw-environment branch March 20, 2015 06:55
@dscho
Copy link
Member Author

dscho commented Mar 20, 2015

@sschuberth please note that I would really prefer strongly if you merged PRs after you reviewed them and are fine with merging. It feels like quite a bit of really superfluous back and forth to say "feel free to merge" instead of just merging it, and it also feels wrong to have the original author merge her own work after others reviewed it – just imagine that you send a patch to the Git mailing list, Junio and others review it, say it is fine, and then you push it to the official repository? That just sounds wrong, does it not?

@sschuberth
Copy link

That just sounds wrong, does it not?

Not necessarily. It depends, on many things. For example:

In a project with "dictatorship" (I do not mean to make this sound negative), i.e. with a single person responsible for final approval, and very few persons having the permission to push, you have no choice but to make the reviewer (i.e. maintainer / owner) merge.

But in a project where almost all contributors have equal push rights (like ours), the past years in my day job have proven that merges by the owner make more sense, both to have the change to make additional changes that none of the reviewers missed (like you wanted to do here), and to have control over when the PR is integrated.

Edit: But as GitHub, as opposed to e.g. Gerrit, has no nice way of recording / documenting review ratings / acknowledgements, I'll following your wish and merge PRs in the future if I'm fine with them instead of leaving that to the author.

dscho added a commit that referenced this pull request Mar 21, 2015
Fix access violations when cloning/fetching via HTTPS
dscho pushed a commit that referenced this pull request Mar 24, 2015
Fix access violations when cloning/fetching via HTTPS
@kblees
Copy link

kblees commented Mar 25, 2015

Wow, kudos for tracking down this issue.

I believe this was caused by disabling USE_NED_ALLOCATOR in the Makefile in the MSys2 case. With nedmalloc enabled, MSVCRT's putenv simply fails with EINVAL. While this is not nice for 3rd party libraries that try to modify the environment, at least it doesn't destroy our internal data structures.

This raises a few questions:

  • Why disable nedmalloc in the first place? Are we prepared to live with the ~50% performance drop for every malloc / realloc?
  • If nedmalloc was just disabled to deal with cURL-allocated memory, wouldn't it be more sensible to use curl_global_init_mem(), passing git's xmalloc / xrealloc wrappers?
  • Why do we need special compile settings for MSys2 at all? Git is still a native windows (MinGW) application, right? Whether git.exe is called from cmd.exe, msys-bash/perl or msys2-bash/perl shouldn't matter. Or in other words: if we need special compile-time settings to make git.exe work in msys2-whatever, chances are it will no longer work in cmd.exe, mingw-tcl etc..
  • Wouldn't 3rd party libs using the environment via stock MSVCRT functions expect it to be GetACP()-encoded? E.g. libz trying to use our UTF-8-encoded $TEMP or $PATH will most likely fail if the paths contain non-ASCII characters.

Especially the last point leads me to the conclusion that the issue can only be properly fixed by separating our environment from the MSVCRT environment. Something like this:

diff --git a/compat/mingw.c b/compat/mingw.c
index 85dfdb9..819ccb6 100644
--- a/compat/mingw.c
+++ b/compat/mingw.c
@@ -964,6 +964,8 @@ static int do_putenv(char **env, const char *name, int size, int free_old);
 static int environ_size = 0;
 /* allocated size of environ array, in bytes */
 static int environ_alloc = 0;
+/* our (UTF-8-encoded) version of the environment */
+char **mingw_environ;

 /*
  * Create environment block suitable for CreateProcess. Merges current
diff --git a/compat/mingw.h b/compat/mingw.h
index 5adc91c..e79b22f 100644
--- a/compat/mingw.h
+++ b/compat/mingw.h
@@ -213,6 +213,11 @@ char *mingw_getenv(const char *name);
 int mingw_putenv(const char *namevalue);
 #define putenv mingw_putenv
 #define unsetenv mingw_putenv
+extern char **mingw_environ;
+#ifdef environ
+#undef environ
+#endif
+#define environ mingw_environ

 int mingw_gethostname(char *host, int namelen);
 #define gethostname mingw_gethostname

This means that environment changes by git will not be seen by 3rd party libs and vice versa, but I think for the problem at hand this is acceptable.

Regarding this patch series, I believe the first two patches don't fix anything, and the commit messages are rather misleading than helpful. For example, environ_size is always at least 1 after being initialized in mingw_startup, even if the environment is empty. And if some developer happens to call mingw_getenv in mingw_startup before initializing the environment, I feel it is safer to just segfault than being 'defensive' and return NULL.

@dscho
Copy link
Member Author

dscho commented Mar 26, 2015

Why disable nedmalloc in the first place?

That was almost certainly an oversight.

Especially the last point leads me to the conclusion that the issue can only be properly fixed by separating our environment from the MSVCRT environment.

I understand that it is tempting to build on top of the original idea, but the fundamental issue that was unveiled by my bug hunt is that we play games with the environment. This PR is just a hot-fix so that we can keep going forward to a Git for Windows 2.x release, but the clean solution is to stop playing games with the environment, not to enhance the games.

I believe the first two patches don't fix anything,

You are correct. That is why they clearly state to introduce defensive programming, i.e. fending off future bugs.

the commit messages are rather misleading than helpful. For example, environ_size is always at least 1 after being initialized in mingw_startup, even if the environment is empty.

Well, please understand that this is unnecessarily offensive a statement.

The commit message states exactly what is going on:

It is unlikely that we have an empty environment, ever, but if we do, when environ_size - 1 is passed to bsearchenv() it is misinterpreted as a real large integer.

Yes, it implies that there is a significant error in having environ_size a signed integer and then silently converting it to an unsigned integer by passing it to bsearchenv() without any boundary checking. That error aside, the part "but if we do" is not theoretical at all: This developer was bitten quite well indeed by the undocumented and unnecessary assumption that mingw_getenv() never be called before mingw_startup() completes.

While it is true that the current code as of master does not access the environment before mingw_startup() is done initializing the environ/environ_size pair, good practice dictates that the code be defensive enough to work even if that assumption is broken. The simple rationale: it is all too easy for a different developer to not even know about this assumption. Or even for the same developer who introduced that assumption, six months from now.

This is not a theoretic issue, but a very practical one: The lack of this defensive programming cost me a sizable chunk of the two days I had to spend hunting this bug.

To recapitulate (because you have not followed this issue during the time it was current): there was a Heisenbug which prevented fetching via HTTPS. In the beginning, I could only reproduce by running the entire net installer (and not always; sometimes it would succeed), and the worst part: the symptom was simply a failed clone. No error message, no nothing. In hindsight it is very easy to see that the debugging was made especially hard due to the fact that it was an unreported segmentation fault in git-remote-https.exe that was spawned off of the transport helper, which times out if the called program does not react within a certain time window. Not knowing any of this, but realizing that attaching a gdb does not get me anywhere, I was stuck to introducing debug print statements (as described above; you should really read it) and ended up chasing my own tail for a bit because already the getenv(...) call caused a crash, but not the one I was trying to debug. I have to admit that I was unpleasantly surprised to find that our getenv() replacement allowed a crash so easily.

Please also understand that after having to spend such an amount of time on fixing this bug, the comment "And if some developer happens to call mingw_getenv in mingw_startup before initializing the environment, I feel it is safer to just segfault than being 'defensive' and return NULL" do come over quite as really unnecessarily smug and unhelpful. The entire reason why I had so much trouble with debugging this was that there were fragile assumptions in that code, and that it just "segfaulted" with no useful information whatsoever. We can do better than this.

dscho referenced this pull request Apr 7, 2015
Make sure that the nedmalloc patch is applied before.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants