Skip to content

Commit 2bedff2

Browse files
authored
SPEC-1708 Increase replSetStepDown timeout and run replSetFreeze instead of retrying (#826)
Increase waitForPrimaryChange timeout to workaround slow elections on Windows.
1 parent db4c07b commit 2bedff2

File tree

4 files changed

+38
-40
lines changed

4 files changed

+38
-40
lines changed

source/connections-survive-step-down/tests/README.rst

Lines changed: 4 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -64,26 +64,16 @@ Perform the following operations:
6464
- Insert 5 documents into a collection with a majority write concern.
6565
- Start a find operation on the collection with a batch size of 2, and
6666
retrieve the first batch of results.
67-
- Send a ``{replSetStepDown: 5, force: true}`` command to the current primary and verify that
67+
- Send a ``{replSetFreeze: 0}`` command to any secondary and verify that the
68+
command succeeded. This command will unfreeze the secondary and ensure that
69+
it will be eligible to be elected immediately.
70+
- Send a ``{replSetStepDown: 30, force: true}`` command to the current primary and verify that
6871
the command succeeded.
6972
- Retrieve the next batch of results from the cursor obtained in the find
7073
operation, and verify that this operation succeeded.
7174
- If the driver implements the `CMAP`_ specification, verify that no new `PoolClearedEvent`_ has been
7275
published. Otherwise verify that `connections.totalCreated`_ in `serverStatus`_ has not changed.
7376

74-
**Note:** The "replSetStepDown" command often fails with the following
75-
transient error (see `SERVER-48154`_)::
76-
77-
{
78-
"ok" : 0,
79-
"errmsg" : "Unable to acquire X lock on '{4611686018427387905: ReplicationStateTransition, 1}' within 1000ms. opId: 922, op: conn30, connId: 30.",
80-
"code" : 24,
81-
"codeName" : "LockTimeout",
82-
}
83-
84-
When running the "replSetStepDown" command, drivers MUST retry until the
85-
command succeeds. The number of retries should be limited to avoid an infinite
86-
failure loop. For example, the Python driver uses a 10 second retry period.
8777

8878
Not Master - Keep Connection Pool
8979
`````````````````````````````````
@@ -186,4 +176,3 @@ server communication.
186176
.. _PoolClearedEvent: /source/connection-monitoring-and-pooling/connection-monitoring-and-pooling.rst#events
187177
.. _serverStatus: https://docs.mongodb.com/manual/reference/command/serverStatus
188178
.. _connections.totalCreated: https://docs.mongodb.com/manual/reference/command/serverStatus/#serverstatus.connections.totalCreated
189-
.. _SERVER-48154: https://jira.mongodb.org/browse/SERVER-48154

source/server-discovery-and-monitoring/tests/README.rst

Lines changed: 4 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -304,26 +304,12 @@ MongoClient from the one used for test operations. For example::
304304

305305
- name: runAdminCommand
306306
object: testRunner
307-
command_name: replSetStepDown
307+
command_name: replSetFreeze
308308
arguments:
309309
command:
310-
replSetStepDown: 1
311-
secondaryCatchUpPeriodSecs: 1
312-
force: false
313-
314-
**Note:** The "replSetStepDown" command often fails with the following
315-
transient error (see `SERVER-48154`_)::
316-
317-
{
318-
"ok" : 0,
319-
"errmsg" : "Unable to acquire X lock on '{4611686018427387905: ReplicationStateTransition, 1}' within 1000ms. opId: 922, op: conn30, connId: 30.",
320-
"code" : 24,
321-
"codeName" : "LockTimeout",
322-
}
323-
324-
When running the "replSetStepDown" command, drivers MUST retry until the
325-
command succeeds. The number of retries should be limited to avoid an infinite
326-
failure loop. For example, the Python driver uses a 10 second retry period.
310+
replSetFreeze: 0
311+
readPreference:
312+
mode: Secondary
327313

328314
waitForPrimaryChange
329315
''''''''''''''''''''
@@ -456,4 +442,3 @@ Run the following test(s) on MongoDB 4.4+.
456442
.. Section for links.
457443
458444
.. _Server Description Equality: /source/server-discovery-and-monitoring/server-discovery-and-monitoring.rst#server-description-equality
459-
.. _SERVER-48154: https://jira.mongodb.org/browse/SERVER-48154

source/server-discovery-and-monitoring/tests/integration/rediscover-quickly-after-step-down.json

Lines changed: 16 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -45,14 +45,27 @@
4545
"name": "recordPrimary",
4646
"object": "testRunner"
4747
},
48+
{
49+
"name": "runAdminCommand",
50+
"object": "testRunner",
51+
"command_name": "replSetFreeze",
52+
"arguments": {
53+
"command": {
54+
"replSetFreeze": 0
55+
},
56+
"readPreference": {
57+
"mode": "Secondary"
58+
}
59+
}
60+
},
4861
{
4962
"name": "runAdminCommand",
5063
"object": "testRunner",
5164
"command_name": "replSetStepDown",
5265
"arguments": {
5366
"command": {
54-
"replSetStepDown": 1,
55-
"secondaryCatchUpPeriodSecs": 1,
67+
"replSetStepDown": 30,
68+
"secondaryCatchUpPeriodSecs": 30,
5669
"force": false
5770
}
5871
}
@@ -61,7 +74,7 @@
6174
"name": "waitForPrimaryChange",
6275
"object": "testRunner",
6376
"arguments": {
64-
"timeoutMS": 5000
77+
"timeoutMS": 15000
6578
}
6679
},
6780
{

source/server-discovery-and-monitoring/tests/integration/rediscover-quickly-after-step-down.yml

Lines changed: 14 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -31,19 +31,30 @@ tests:
3131
- _id: 4
3232
- name: recordPrimary
3333
object: testRunner
34+
# Unfreeze a secondary with replSetFreeze:0 to ensure a speedy election.
35+
- name: runAdminCommand
36+
object: testRunner
37+
command_name: replSetFreeze
38+
arguments:
39+
command:
40+
replSetFreeze: 0
41+
readPreference:
42+
mode: Secondary
3443
# Run replSetStepDown on the meta client.
3544
- name: runAdminCommand
3645
object: testRunner
3746
command_name: replSetStepDown
3847
arguments:
3948
command:
40-
replSetStepDown: 1
41-
secondaryCatchUpPeriodSecs: 1
49+
replSetStepDown: 30
50+
secondaryCatchUpPeriodSecs: 30
4251
force: false
4352
- name: waitForPrimaryChange
4453
object: testRunner
4554
arguments:
46-
timeoutMS: 5000
55+
# We use a relatively large timeout here to workaround slow
56+
# elections on Windows, possibly caused by SERVER-48154.
57+
timeoutMS: 15000
4758
# Rediscover the new primary.
4859
- name: insertMany
4960
object: collection

0 commit comments

Comments
 (0)