Skip to content

Commit f93d781

Browse files
author
Divjot Arora
authored
SPEC-1505 Use a whitelist for classifying change stream errors (#736)
1 parent dca9840 commit f93d781

8 files changed

+5810
-30
lines changed

source/change-streams/change-streams.rst

Lines changed: 53 additions & 25 deletions
Original file line numberDiff line numberDiff line change
@@ -9,8 +9,8 @@ Change Streams
99
:Status: Accepted
1010
:Type: Standards
1111
:Minimum Server Version: 3.6
12-
:Last Modified: April 3, 2019
13-
:Version: 1.7.0
12+
:Last Modified: Febrary 10, 2020
13+
:Version: 1.8.0
1414

1515
.. contents::
1616

@@ -43,35 +43,60 @@ Resumable Error
4343

4444
An error is considered resumable if it meets any of the following criteria:
4545

46-
- any error encountered which is not a server error (e.g. a timeout error or
46+
- Any error encountered which is not a server error (e.g. a timeout error or
4747
network error)
4848

49-
- *any* server error response from a getMore command excluding those
50-
containing the error label `NonResumableChangeStreamError` and those
51-
containing the following error codes
49+
- For servers with wire version 9 or higher (server version 4.4 or higher), any
50+
server error with the `ResumableChangeStreamError` error label.
51+
52+
- For servers with wire version less than 9, a server error with one of the
53+
following codes:
5254

5355
.. list-table::
5456
:header-rows: 1
5557

5658
* - Error Name
5759
- Error Code
58-
* - Interrupted
59-
- 11601
60-
* - CappedPositionLost
61-
- 136
62-
* - CursorKilled
63-
- 237
60+
* - HostUnreachable
61+
- 6
62+
* - HostNotFound
63+
- 7
64+
* - NetworkTimeout
65+
- 89
66+
* - ShutdownInProgress
67+
- 91
68+
* - PrimarySteppedDown
69+
- 189
70+
* - ExceededTimeLimit
71+
- 262
72+
* - SocketException
73+
- 9001
74+
* - NotMaster
75+
- 10107
76+
* - InterruptedAtShutdown
77+
- 11600
78+
* - InterruptedDueToReplStateChange
79+
- 11602
80+
* - NotMasterNoSlaveOk
81+
- 13435
82+
* - NotMasterOrSecondary
83+
- 13436
84+
* - StaleShardVersion
85+
- 63
86+
* - StaleEpoch
87+
- 150
88+
* - StaleConfig
89+
- 13388
90+
* - RetryChangeStream
91+
- 234
92+
* - FailedToSatisfyReadPreference
93+
- 133
94+
* - ElectionInProgress
95+
- 216
6496

6597
An error on an aggregate command is not a resumable error. Only errors on a
6698
getMore command may be considered resumable errors.
6799

68-
The criteria for resumable errors is similar to the discussion in the SDAM
69-
spec's section on `Error Handling`_, but includes additional error codes. See
70-
`What do the additional error codes mean?`_ for the reasoning behind these
71-
additional errors.
72-
73-
.. _Error Handling: ../server-discovery-and-monitoring/server-discovery-and-monitoring.rst#error-handling
74-
75100
--------
76101
Guidance
77102
--------
@@ -439,7 +464,7 @@ A change stream MUST track the last resume token, per `Updating the Cached Resum
439464

440465
Drivers MUST raise an error on the first document received without a resume token (e.g. the user has removed ``_id`` with a pipeline stage), and close the change stream. The error message SHOULD resemble “Cannot provide resume functionality when the resume token is missing”.
441466

442-
A change stream MUST attempt to resume a single time if it encounters any resumable error. A change stream MUST NOT attempt to resume on any other type of error, with the exception of a “not master” server error. If a driver receives a “not master” error (for instance, because the primary it was connected to is stepping down), it will treat the error as a resumable error and attempt to resume.
467+
A change stream MUST attempt to resume a single time if it encounters any resumable error per `Resumable Error`_. A change stream MUST NOT attempt to resume on any other type of error.
443468

444469
In addition to tracking a resume token, change streams MUST also track the read preference specified when the change stream was created. In the event of a resumable error, a change stream MUST perform server selection with the original read preference before attempting to resume.
445470

@@ -676,13 +701,14 @@ It was decided to remove this example from the specification for the following r
676701
- There is something to be said for an API that allows cooperation by default. The model in which a call to next only blocks until any response is returned (even an empty batch), allows for interruption and cooperation (e.g. interaction with other event loops).
677702

678703
----------------------------------------
679-
What do the additional error codes mean?
704+
Why is a whitelist of error codes preferable to a blacklist?
680705
----------------------------------------
681706

682-
The `CursorKilled` or `Interrupted` error implies some other actor killed the cursor.
683-
684-
The `CappedPositionLost` error implies falling off of the back of the oplog,
685-
so resuming is impossible.
707+
Change streams originally used a blacklist of error codes to determine which errors were not resumable. However, this
708+
allowed for the possibility of infinite resume loops if an error was not correctly blacklisted. Due to the fact that
709+
all errors aside from transient issues such as failovers are not resumable, the resume behavior was changed to use a
710+
whitelist. Part of this change was to introduce the ``ResumableChangeStreamError`` label so the server can add new error
711+
codes to the whitelist without requiring changes to drivers.
686712

687713
-------------------------------------------------------------------------------------------
688714
Why do we need to send a default ``startAtOperationTime`` when resuming a ``ChangeStream``?
@@ -795,3 +821,5 @@ Changelog
795821
| 2019-07-15 | Clarify resume process for change streams started with |
796822
| | the ``startAfter`` option. |
797823
+------------+------------------------------------------------------------+
824+
| 2020-02-10 | Changed error handling approach to use a whitelist |
825+
+------------+------------------------------------------------------------+

source/change-streams/tests/README.rst

Lines changed: 8 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -93,7 +93,10 @@ Spec Test Runner
9393

9494
Before running the tests
9595

96-
- Create a MongoClient ``globalClient``, and connect to the server
96+
- Create a MongoClient ``globalClient``, and connect to the server.
97+
When executing tests against a sharded cluster, ``globalClient`` must only connect to one mongos. This is because tests
98+
that set failpoints will only work consistently if both the ``configureFailPoint`` and failing commands are sent to the
99+
same mongos.
97100

98101
For each YAML file, for each element in ``tests``:
99102

@@ -128,8 +131,8 @@ For each YAML file, for each element in ``tests``:
128131
- If there are any ``expectations``
129132

130133
- For each (``expected``, ``idx``) in ``expectations``
131-
132-
- Assert that ``actual[idx]`` MATCHES ``expected``
134+
- If ``actual[idx]`` is a ``killCursors`` event, skip it and move to ``actual[idx+1]``.
135+
- Else assert that ``actual[idx]`` MATCHES ``expected``
133136

134137
- Close the MongoClient ``client``
135138

@@ -155,12 +158,12 @@ The following tests have not yet been automated, but MUST still be tested. All t
155158
#. ``ChangeStream`` will throw an exception if the server response is missing the resume token (if wire version is < 8, this is a driver-side error; for 8+, this is a server-side error)
156159
#. After receiving a ``resumeToken``, ``ChangeStream`` will automatically resume one time on a resumable error with the initial pipeline and options, except for the addition/update of a ``resumeToken``.
157160
#. ``ChangeStream`` will not attempt to resume on any error encountered while executing an ``aggregate`` command. Note that retryable reads may retry ``aggregate`` commands. Drivers should be careful to distinguish retries from resume attempts. Alternatively, drivers may specify `retryReads=false` or avoid using a [retryable error](../../retryable-reads/retryable-reads.rst#retryable-error) for this test.
158-
#. ``ChangeStream`` will not attempt to resume after encountering error code 11601 (Interrupted), 136 (CappedPositionLost), or 237 (CursorKilled) while executing a ``getMore`` command.
161+
#. **Removed**
159162
#. ``ChangeStream`` will perform server selection before attempting to resume, using initial ``readPreference``
160163
#. Ensure that a cursor returned from an aggregate command with a cursor id and an initial empty batch is not closed on the driver side.
161164
#. The ``killCursors`` command sent during the "Resume Process" must not be allowed to throw an exception.
162165
#. ``$changeStream`` stage for ``ChangeStream`` against a server ``>=4.0`` and ``<4.0.7`` that has not received any results yet MUST include a ``startAtOperationTime`` option when resuming a change stream.
163-
#. ``ChangeStream`` will resume after a ``killCursors`` command is issued for its child cursor.
166+
#. **Removed**
164167
#. For a ``ChangeStream`` under these conditions:
165168

166169
- Running against a server ``>=4.0.7``.

source/change-streams/tests/change-streams-errors.json

Lines changed: 47 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -108,6 +108,53 @@
108108
]
109109
}
110110
}
111+
},
112+
{
113+
"description": "change stream errors on MaxTimeMSExpired",
114+
"minServerVersion": "4.2",
115+
"failPoint": {
116+
"configureFailPoint": "failCommand",
117+
"mode": {
118+
"times": 1
119+
},
120+
"data": {
121+
"failCommands": [
122+
"getMore"
123+
],
124+
"errorCode": 50,
125+
"closeConnection": false
126+
}
127+
},
128+
"target": "collection",
129+
"topology": [
130+
"replicaset",
131+
"sharded"
132+
],
133+
"changeStreamPipeline": [
134+
{
135+
"$project": {
136+
"_id": 0
137+
}
138+
}
139+
],
140+
"changeStreamOptions": {},
141+
"operations": [
142+
{
143+
"database": "change-stream-tests",
144+
"collection": "test",
145+
"name": "insertOne",
146+
"arguments": {
147+
"document": {
148+
"z": 3
149+
}
150+
}
151+
}
152+
],
153+
"result": {
154+
"error": {
155+
"code": 50
156+
}
157+
}
111158
}
112159
]
113160
}

source/change-streams/tests/change-streams-errors.yml

Lines changed: 29 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -73,3 +73,32 @@ tests:
7373
error:
7474
code: 280
7575
errorLabels: [ "NonResumableChangeStreamError" ]
76+
-
77+
description: change stream errors on MaxTimeMSExpired
78+
minServerVersion: "4.2"
79+
failPoint:
80+
configureFailPoint: failCommand
81+
mode: { times: 1 }
82+
data:
83+
failCommands: ["getMore"]
84+
errorCode: 50 # An error code that's not on the old blacklist or whitelist
85+
closeConnection: false
86+
target: collection
87+
topology:
88+
- replicaset
89+
- sharded
90+
changeStreamPipeline:
91+
-
92+
$project: { _id: 0 }
93+
changeStreamOptions: {}
94+
operations:
95+
-
96+
database: *database_name
97+
collection: *collection_name
98+
name: insertOne
99+
arguments:
100+
document:
101+
z: 3
102+
result:
103+
error:
104+
code: 50

0 commit comments

Comments
 (0)