Skip to content

PHPLIB-467: Connections survive primary step-down #667

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Aug 28, 2019

Conversation

alcaeus
Copy link
Member

@alcaeus alcaeus commented Aug 13, 2019

@alcaeus alcaeus requested a review from kvwalker August 13, 2019 12:10
@alcaeus alcaeus self-assigned this Aug 13, 2019
@alcaeus alcaeus changed the title PHPC-1294: Connections survive primary step-down [WIP] PHPC-1294: Connections survive primary step-down Aug 13, 2019
@alcaeus alcaeus removed the request for review from kvwalker August 13, 2019 12:13
@jmikola
Copy link
Member

jmikola commented Aug 20, 2019

FYI: I opened mongodb/specifications#634 to clarify topology requirements in the spec test.

Copy link
Member

@jmikola jmikola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion for tweaking waitForMasterReelection(), but LGTM either way.

/** @var Collection */
private $collection;

private function doSetUp()
Copy link
Member

@jmikola jmikola Aug 22, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was going to suggest dropping the collection under test on tearDown, since the tests write to it; however, I realized that we don't consistently do that across the test suite. For example: MongoDB\Tests\Collection\FunctionalTestCase::tearDown() drops collections after a successful test and leaves behind data after a failure in case we need to inspect it.

I don't think this is worth doing here, but I did want to share the thought. Long term, I'd be more interested in auditing/profiling the test suite to see where we can improve execution time by reducing things like drops. I know database drops can be fairly expensive, so those would be worth looking at to see if they serve a useful purpose.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll leave it for now. As you mentioned, leaving data behind to check out after test failures can make sense, and the spec tests drop collections during setup anyways. At a previous job I've played around with only dropping collections that were written to during tests (using APM to figure out which collections were written to), but the performance gain from that was negligible IIRC.

@alcaeus alcaeus changed the title [WIP] PHPC-1294: Connections survive primary step-down PHPC-1294: Connections survive primary step-down Aug 22, 2019
@alcaeus
Copy link
Member Author

alcaeus commented Aug 23, 2019

@jmikola it looks like the getMoreIteration test causes failures down the line, which unfortunately only appear when running the entire test suite. These errors disappear when skipping that test. Looking at the replica set configuration, it seems likely that there is in fact another step down sometime after running that initial step down: the first replica set member has a significantly higher priority, and in my local tests I was able to confirm that the first server becomes primary a short while after stepping down.

I'm not sure how to best fix this: giving the replica set members the same priority would definitely prevent another step down. The alternative is making the server step down for significantly longer than 5 seconds while hoping that there is no other step down during the tests. What do you think?

@jmikola
Copy link
Member

jmikola commented Aug 24, 2019

the first replica set member has a significantly higher priority

IIRC, the priorities were configured as such for PHPC since it had some tests that select primary or secondary nodes and assert their tags. Those are inherently fragile tests and I'm dubious of how much value they provide, but beyond that I don't think they have any relevance to PHPLIB. It's only coming up now because we evidently based the PHPLIB cluster configs on what we already had for PHPC.

I'd suggest removing the priority definitions and confirming that it has no adverse affects on the rest of the PHPLIB test suite (AFAIK, it should not). Bumping the replSetStepDown value higher may work today, but IMO it would just be kicking the can further down the road and invite a failure if our test suite runs beyond that delay.

@jmikola
Copy link
Member

jmikola commented Aug 24, 2019

I just realized that this was originally a PHPC ticket but you chose to implement in PHPLIB. If so, let's move the JIRA issue over to PHPLIB and revise the commit message.

It bears mentioning that we do have some existing tests in PHPC that start their own servers with MO, but those are mainly creating standalones to test fail points. I think you made the right call to use our existing cluster for these tests. This could have probably been implemented as phpt tests in PHPC, but your decision to use PHPLIB does avoid a possible conflict with fragile replica set tag tests I mentioned above, so all is well there :)

@alcaeus alcaeus force-pushed the phpc-1294 branch 2 times, most recently from b2996d4 to 91f63f1 Compare August 26, 2019 14:31
@alcaeus alcaeus changed the title PHPC-1294: Connections survive primary step-down PHPLIB-467: Connections survive primary step-down Aug 26, 2019
Copy link
Member

@jmikola jmikola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few small suggestions but LGTM if all is passing.

@@ -15,7 +15,6 @@
"smallfiles": true
},
"rsParams": {
"priority": 99,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did these removals actually break some PHPLIB tests, or was that resolved? IIRC, you mentioned something about it on Monday.

Copy link
Member Author

@alcaeus alcaeus Aug 27, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They pass on travis-ci, but sometimes fail on my machine. I'm currently investigating the cause of these sporadic failures.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wasn't able to figure out build failures, but at least they're not confined to my machine anymore. Re-adding the priority fixes the current failures but reintroduces others.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've taken a look at the libmongoc implementation of these tests, and they work exactly the same: grab the connection count from the current primary, perform an operation that fails, check connection count again. This works fine, unless the first test that triggers a stepdown ran. If it ran, connection counts are sometimes off, without knowing exactly why. Moving the stepdown test to the end solves this.

I feel bad moving the test, but I'm not sure what could be the cause of this. We don't manage connections at all, and the corresponding tests pass for libmongoc: https://evergreen.mongodb.com/task_log_raw/mongo_c_driver_gcc54_test_latest_replica_set_noauth_sasl_openssl_0acb4a539a62248c2a7f1e0b4a866340c463359c_19_08_27_00_33_53/0?type=T#L3694.

@alcaeus alcaeus force-pushed the phpc-1294 branch 2 times, most recently from d531beb to 3e50254 Compare August 28, 2019 10:39
Copy link
Member

@jmikola jmikola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Point about test ordering acknowledged. I'd suggest creating a follow-up ticket to address the fragile ordering. You can cross-reference it with this issue and have it depend on PHPC-1150 (SDAM monitoring).

alcaeus added a commit to alcaeus/mongo-php-library that referenced this pull request Aug 28, 2019
@alcaeus alcaeus merged commit 51081f4 into mongodb:master Aug 28, 2019
@alcaeus alcaeus deleted the phpc-1294 branch August 28, 2019 19:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants