PHPLIB-461: Fix functional tests for change streams on sharded clusters #659

alcaeus · 2019-08-05T07:44:27Z

https://jira.mongodb.org/browse/PHPLIB-461

Due to the way getMore works on sharded clusters, we can't rely on the first getMore call after an insert to return the data we expect. I'll quote the response from the query team to my questions about why getMore behaves this way.

First, it will be useful to understand the semantics of getMore on a tailable awaitData cursor such as that created by $changeStream:

If there are documents available when the getMore starts running, then getMore will read up to batchSize OR EOF, whichever comes first. It will then return immediately without waiting for any further results.

If there are no documents available when the getMore starts running, it will wait up to the specified awaitData timeout to see if any documents become available.

If documents become available while the getMore is waiting, it will adopt the behaviour outlined in (1). In other words: once the getMore has obtained at least one result, it cancels its awaitData timeout and does not wait again.

The second thing to understand is the behaviour of change streams in a sharded cluster. For reasons outlined in this comment, change streams must open a cursor on every shard even when monitoring an unsharded collection. Because each change stream is globally-sorted, it cannot return result A from shard X until it has obtained at least one result from every other shard that sorts later than result A. In other words, it cannot return A to the client until every other shard has promised that they will never produce a result that sorts before A. To guard against the possibility that one or more shards never return any results, each shard has a no-op writer that updates its oplog every periodicNoopIntervalSecs seconds. This allows the shard to return a high-water-mark promising that it will never return a future result which sorts before that point, so we are guaranteed that the change stream will eventually be allowed to return result A to the client.

With an maxAwaitTimeMS of 500 ms we stay below the minimum periodicNoopIntervalSecs of 1 second, which can cause our getMore commands on shards without data to time out before a no-op is read from the oplog. One way to solve this would be to increase the maxAwaitTimeMS to a value above 1000 ms to ensure that getMore calls don't time out before another no-op is written, but this causes tests to be significantly slower. Furthermore, we should never make any assumptions as to what batch a change stream event is returned in, only that it will eventually be returned in the correct order. To ensure the tests don't unnecessarily fail on sharded clusters, they will now execute multiple getMore calls if necessary, only failing if a given number (currently 5) of getMore calls fails to return any data. This was also necessary since resume operations would still yield empty batches on the initial call to aggregate.

This PR also contains a workaround for PHPC-1419, where a NonResumableChangeStreamError error label is not applied to the ServerException create for a ChangeStreamFatalError. This causes the iterator to repeatedly retry resuming the change stream for an unresumable error. This workaround can be removed once the above issue was fixed.

alcaeus · 2019-08-05T08:08:25Z

Note: build failure is because of a segmentation fault. I highly suspect it's for the same reason that @jmikola is investigating in #638.

tests/Operation/WatchFunctionalTest.php

alcaeus · 2019-09-02T10:58:35Z

Review feedback addressed in separate commit. I'll squash once the build is stable and after updating to a newer driver release that fixes PHPC-1419.

tests/Operation/WatchFunctionalTest.php

jmikola · 2019-09-04T01:27:04Z

tests/FunctionalTestCase.php

@@ -334,9 +334,6 @@ protected function skipIfChangeStreamIsNotSupported()
                if (! $this->isShardedClusterUsingReplicasets()) {
                    $this->markTestSkipped('$changeStream is only supported with replicasets');
                }
-
-                // Temporarily skip tests because of an issue with change streams in the driver
-                $this->markTestSkipped('$changeStreams currently don\'t on replica sets');


The PR itself LGTM, but I realized that removing this skip re-introduced the test failures.

I assume those were not happening in previous Travis builds since this skip was first introduced in 2008bc7 as part of #655?

We can discuss offline about whether to merge as-is for now and defer sorting out the double-free for replica set and sharded replica set tests in PHPLIB-432.

Yes and no. Not sure what made the errors appear consistently, but the skip only affected sharded clusters, not the replicaset build. With this branch, the replicaset build always fails while the build on a sharded cluster at least passes sometimes.

Since this entire PR deals with fixing tests on sharded clusters, I don't see a reason why to merge this without running the tests on a sharded cluster. I'd first like to investigate the cause for the double free some more before just plain skipping those tests that cause the failures.

jmikola · 2019-09-04T22:41:51Z

@alcaeus: Regarding the unexpected passing builds despite the lingering double-free issue (PHPLIB-432), I noticed that the job for this most recent, passing build ran on PHP 7.3.9 (built August 30). IIRC, previous failed builds were using 7.3.2 (built February 11).

I wasn't able to find the Travis build history for the branch you were using to research this issue, but you may want to check that to see if your recently passing builds also display a new version. If so, it's possible this was fixed upstream; however, I don't see anything in https://github.com/travis-ci/travis-ci/issues or https://github.com/travis-ci/php-src-builder to suggest it was an intentional fix.

EDIT: #638 is also passing after a fresh rebase on master using RS and sharded RS topologies from PHP 5.6 through 7.4snapshot.

alcaeus · 2019-09-05T09:33:17Z

I'm still observing the failure on PHP 7.3.9, see the most recent build log for example: https://travis-ci.com/alcaeus/mongo-php-library/jobs/231388623#L239. This branch was rebased to pull in latest changes from this PR, which interestingly enough no longer shows this behaviour, neither in the PR build on travis-ci.org nor in my branch build on travis-ci.com. I was afraid that the behaviour might only exhibit itself on travis-ci.com (since that uses different infrastructure IIRC), but that doesn't seem to be the case either.

Restarting the build a couple of times did not bring those failures back, so I'm even more at a loss than before.

alcaeus · 2019-11-19T08:12:13Z

@jmikola it looks like #694 got rid of the use-after-free issue we were seeing on previous builds. I re-ran the offending builds multiple times and was no longer able to provoke that behaviour.

alcaeus self-assigned this Aug 5, 2019

alcaeus requested a review from kvwalker August 5, 2019 07:45

alcaeus force-pushed the phplib-461 branch from b6edc40 to 361e4eb Compare August 8, 2019 05:26

alcaeus requested review from jmikola and removed request for kvwalker August 9, 2019 05:23

This was referenced Aug 20, 2019

PHPLIB-457: Audit and implement any missing change stream spec prose tests #662

Merged

Update to latest ext-mongodb and MongoDB server versions #669

Merged

alcaeus force-pushed the phplib-461 branch 3 times, most recently from d4fa1a0 to 72495a3 Compare August 28, 2019 11:05

jmikola reviewed Aug 30, 2019

View reviewed changes

jmikola reviewed Sep 2, 2019

View reviewed changes

tests/Operation/WatchFunctionalTest.php Outdated Show resolved Hide resolved

jmikola mentioned this pull request Sep 2, 2019

Assert initial resume token in resumeAfter test #676

Merged

alcaeus force-pushed the phplib-461 branch 2 times, most recently from dc74e7a to dbd1f0d Compare September 3, 2019 06:05

jmikola approved these changes Sep 4, 2019

View reviewed changes

jmikola mentioned this pull request Sep 4, 2019

PHPLIB-457: Add missing change stream prose test 14 #675

Merged

jmikola reviewed Sep 4, 2019

View reviewed changes

alcaeus force-pushed the phplib-461 branch from dbd1f0d to f6366c7 Compare September 4, 2019 18:20

alcaeus force-pushed the phplib-461 branch from f6366c7 to a3941c8 Compare September 5, 2019 14:06

alcaeus force-pushed the phplib-461 branch from 6bf6c2f to a096e20 Compare November 15, 2019 10:28

alcaeus added 2 commits November 15, 2019 11:39

PHPLIB-461: Fix ChangeStream tests on sharded clusters

32bade4

PHPLIB-461: Lower periodicNoopIntervalSecs

944e729

alcaeus force-pushed the phplib-461 branch 2 times, most recently from 8384953 to 944e729 Compare November 15, 2019 11:41

jmikola approved these changes Nov 21, 2019

View reviewed changes

alcaeus added a commit that referenced this pull request Nov 25, 2019

Merge pull request #659

ee0b453

alcaeus merged commit 944e729 into mongodb:master Nov 25, 2019

alcaeus deleted the phplib-461 branch February 15, 2021 13:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

PHPLIB-461: Fix functional tests for change streams on sharded clusters #659

PHPLIB-461: Fix functional tests for change streams on sharded clusters #659

Uh oh!

alcaeus commented Aug 5, 2019 •

edited

Loading

Uh oh!

alcaeus commented Aug 5, 2019

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

alcaeus commented Sep 2, 2019

Uh oh!

Uh oh!

jmikola Sep 4, 2019

Uh oh!

alcaeus Sep 4, 2019

Uh oh!

jmikola commented Sep 4, 2019 •

edited

Loading

Uh oh!

alcaeus commented Sep 5, 2019

Uh oh!

alcaeus commented Nov 19, 2019

Uh oh!

Uh oh!

PHPLIB-461: Fix functional tests for change streams on sharded clusters #659

PHPLIB-461: Fix functional tests for change streams on sharded clusters #659

Uh oh!

Conversation

alcaeus commented Aug 5, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

alcaeus commented Aug 5, 2019

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

alcaeus commented Sep 2, 2019

Uh oh!

Uh oh!

jmikola Sep 4, 2019

Choose a reason for hiding this comment

Uh oh!

alcaeus Sep 4, 2019

Choose a reason for hiding this comment

Uh oh!

jmikola commented Sep 4, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

alcaeus commented Sep 5, 2019

Uh oh!

alcaeus commented Nov 19, 2019

Uh oh!

Uh oh!

alcaeus commented Aug 5, 2019 •

edited

Loading

jmikola commented Sep 4, 2019 •

edited

Loading