-
Notifications
You must be signed in to change notification settings - Fork 266
PHPLIB-461: Fix functional tests for change streams on sharded clusters #659
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
d4fa1a0
to
72495a3
Compare
Review feedback addressed in separate commit. I'll squash once the build is stable and after updating to a newer driver release that fixes PHPC-1419. |
dc74e7a
to
dbd1f0d
Compare
@@ -334,9 +334,6 @@ protected function skipIfChangeStreamIsNotSupported() | |||
if (! $this->isShardedClusterUsingReplicasets()) { | |||
$this->markTestSkipped('$changeStream is only supported with replicasets'); | |||
} | |||
|
|||
// Temporarily skip tests because of an issue with change streams in the driver | |||
$this->markTestSkipped('$changeStreams currently don\'t on replica sets'); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The PR itself LGTM, but I realized that removing this skip re-introduced the test failures.
I assume those were not happening in previous Travis builds since this skip was first introduced in 2008bc7 as part of #655?
We can discuss offline about whether to merge as-is for now and defer sorting out the double-free for replica set and sharded replica set tests in PHPLIB-432.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes and no. Not sure what made the errors appear consistently, but the skip only affected sharded clusters, not the replicaset build. With this branch, the replicaset build always fails while the build on a sharded cluster at least passes sometimes.
Since this entire PR deals with fixing tests on sharded clusters, I don't see a reason why to merge this without running the tests on a sharded cluster. I'd first like to investigate the cause for the double free some more before just plain skipping those tests that cause the failures.
@alcaeus: Regarding the unexpected passing builds despite the lingering double-free issue (PHPLIB-432), I noticed that the job for this most recent, passing build ran on PHP 7.3.9 (built August 30). IIRC, previous failed builds were using 7.3.2 (built February 11). I wasn't able to find the Travis build history for the branch you were using to research this issue, but you may want to check that to see if your recently passing builds also display a new version. If so, it's possible this was fixed upstream; however, I don't see anything in https://github.com/travis-ci/travis-ci/issues or https://github.com/travis-ci/php-src-builder to suggest it was an intentional fix. EDIT: #638 is also passing after a fresh rebase on |
I'm still observing the failure on PHP 7.3.9, see the most recent build log for example: https://travis-ci.com/alcaeus/mongo-php-library/jobs/231388623#L239. This branch was rebased to pull in latest changes from this PR, which interestingly enough no longer shows this behaviour, neither in the PR build on travis-ci.org nor in my branch build on travis-ci.com. I was afraid that the behaviour might only exhibit itself on travis-ci.com (since that uses different infrastructure IIRC), but that doesn't seem to be the case either. Restarting the build a couple of times did not bring those failures back, so I'm even more at a loss than before. |
8384953
to
944e729
Compare
https://jira.mongodb.org/browse/PHPLIB-461
Due to the way
getMore
works on sharded clusters, we can't rely on the firstgetMore
call after an insert to return the data we expect. I'll quote the response from the query team to my questions about whygetMore
behaves this way.With an
maxAwaitTimeMS
of 500 ms we stay below the minimumperiodicNoopIntervalSecs
of 1 second, which can cause ourgetMore
commands on shards without data to time out before a no-op is read from the oplog. One way to solve this would be to increase themaxAwaitTimeMS
to a value above 1000 ms to ensure thatgetMore
calls don't time out before another no-op is written, but this causes tests to be significantly slower. Furthermore, we should never make any assumptions as to what batch a change stream event is returned in, only that it will eventually be returned in the correct order. To ensure the tests don't unnecessarily fail on sharded clusters, they will now execute multiplegetMore
calls if necessary, only failing if a given number (currently 5) ofgetMore
calls fails to return any data. This was also necessary since resume operations would still yield empty batches on the initial call toaggregate
.This PR also contains a workaround for PHPC-1419, where a
NonResumableChangeStreamError
error label is not applied to theServerException
create for aChangeStreamFatalError
. This causes the iterator to repeatedly retry resuming the change stream for an unresumable error. This workaround can be removed once the above issue was fixed.