|
| 1 | +# Retryable Write Tests |
| 2 | + |
| 3 | +## Introduction |
| 4 | + |
| 5 | +The YAML and JSON files in this directory are platform-independent tests meant to exercise a driver's implementation of |
| 6 | +retryable writes. These tests utilize the [Unified Test Format](../../unified-test-format/unified-test-format.md). |
| 7 | + |
| 8 | +Several prose tests, which are not easily expressed in YAML, are also presented in this file. Those tests will need to |
| 9 | +be manually implemented by each driver. |
| 10 | + |
| 11 | +Tests will require a MongoClient created with options defined in the tests. Integration tests will require a running |
| 12 | +MongoDB cluster with server versions 3.6.0 or later. The `{setFeatureCompatibilityVersion: 3.6}` admin command will also |
| 13 | +need to have been executed to enable support for retryable writes on the cluster. Some tests may have more stringent |
| 14 | +version requirements depending on the fail points used. |
| 15 | + |
| 16 | +## Use as Integration Tests |
| 17 | + |
| 18 | +Integration tests are expressed in YAML and can be run against a replica set or sharded cluster as denoted by the |
| 19 | +top-level `runOn` field. Tests that rely on the `onPrimaryTransactionalWrite` fail point cannot be run against a sharded |
| 20 | +cluster because the fail point is not supported by mongos. |
| 21 | + |
| 22 | +The tests exercise the following scenarios: |
| 23 | + |
| 24 | +- Single-statement write operations |
| 25 | + - Each test expecting a write result will encounter at-most one network error for the write command. Retry attempts |
| 26 | + should return without error and allow operation to succeed. Observation of the collection state will assert that the |
| 27 | + write occurred at-most once. |
| 28 | + - Each test expecting an error will encounter successive network errors for the write command. Observation of the |
| 29 | + collection state will assert that the write was never committed on the server. |
| 30 | +- Multi-statement write operations |
| 31 | + - Each test expecting a write result will encounter at-most one network error for some write command(s) in the batch. |
| 32 | + Retry attempts should return without error and allow the batch to ultimately succeed. Observation of the collection |
| 33 | + state will assert that each write occurred at-most once. |
| 34 | + - Each test expecting an error will encounter successive network errors for some write command in the batch. The batch |
| 35 | + will ultimately fail with an error, but observation of the collection state will assert that the failing write was |
| 36 | + never committed on the server. We may observe that earlier writes in the batch occurred at-most once. |
| 37 | + |
| 38 | +We cannot test a scenario where the first and second attempts both encounter network errors but the write does actually |
| 39 | +commit during one of those attempts. This is because (1) the fail point only triggers when a write would be committed |
| 40 | +and (2) the skip and times options are mutually exclusive. That said, such a test would mainly assert the server's |
| 41 | +correctness for at-most once semantics and is not essential to assert driver correctness. |
| 42 | + |
| 43 | +## Split Batch Tests |
| 44 | + |
| 45 | +The YAML tests specify bulk write operations that are split by command type (e.g. sequence of insert, update, and delete |
| 46 | +commands). Multi-statement write operations may also be split due to `maxWriteBatchSize`, `maxBsonObjectSize`, or |
| 47 | +`maxMessageSizeBytes`. |
| 48 | + |
| 49 | +For instance, an insertMany operation with five 10 MiB documents executed using OP_MSG payload type 0 (i.e. entire |
| 50 | +command in one document) would be split into five insert commands in order to respect the 16 MiB `maxBsonObjectSize` |
| 51 | +limit. The same insertMany operation executed using OP_MSG payload type 1 (i.e. command arguments pulled out into a |
| 52 | +separate payload vector) would be split into two insert commands in order to respect the 48 MB `maxMessageSizeBytes` |
| 53 | +limit. |
| 54 | + |
| 55 | +Noting when a driver might split operations, the `onPrimaryTransactionalWrite` fail point's `skip` option may be used to |
| 56 | +control when the fail point first triggers. Once triggered, the fail point will transition to the `alwaysOn` state until |
| 57 | +disabled. Driver authors should also note that the server attempts to process all documents in a single insert command |
| 58 | +within a single commit (i.e. one insert command with five documents may only trigger the fail point once). This behavior |
| 59 | +is unique to insert commands (each statement in an update and delete command is processed independently). |
| 60 | + |
| 61 | +If testing an insert that is split into two commands, a `skip` of one will allow the fail point to trigger on the second |
| 62 | +insert command (because all documents in the first command will be processed in the same commit). When testing an update |
| 63 | +or delete that is split into two commands, the `skip` should be set to the number of statements in the first command to |
| 64 | +allow the fail point to trigger on the second command. |
| 65 | + |
| 66 | +## Command Construction Tests |
| 67 | + |
| 68 | +Drivers should also assert that command documents are properly constructed with or without a transaction ID, depending |
| 69 | +on whether the write operation is supported. |
| 70 | +[Command Logging and Monitoring](../../command-logging-and-monitoring/command-logging-and-monitoring.rst) may be used to |
| 71 | +check for the presence of a `txnNumber` field in the command document. Note that command documents may always include an |
| 72 | +`lsid` field per the [Driver Session](../../sessions/driver-sessions.md) specification. |
| 73 | + |
| 74 | +These tests may be run against both a replica set and shard cluster. |
| 75 | + |
| 76 | +Drivers should test that transaction IDs are never included in commands for unsupported write operations: |
| 77 | + |
| 78 | +- Write commands with unacknowledged write concerns (e.g. `{w: 0}`) |
| 79 | +- Unsupported single-statement write operations |
| 80 | + - `updateMany()` |
| 81 | + - `deleteMany()` |
| 82 | +- Unsupported multi-statement write operations |
| 83 | + - `bulkWrite()` that includes `UpdateMany` or `DeleteMany` |
| 84 | +- Unsupported write commands |
| 85 | + - `aggregate` with write stage (e.g. `$out`, `$merge`) |
| 86 | + |
| 87 | +Drivers should test that transactions IDs are always included in commands for supported write operations: |
| 88 | + |
| 89 | +- Supported single-statement write operations |
| 90 | + - `insertOne()` |
| 91 | + - `updateOne()` |
| 92 | + - `replaceOne()` |
| 93 | + - `deleteOne()` |
| 94 | + - `findOneAndDelete()` |
| 95 | + - `findOneAndReplace()` |
| 96 | + - `findOneAndUpdate()` |
| 97 | +- Supported multi-statement write operations |
| 98 | + - `insertMany()` with `ordered=true` |
| 99 | + - `insertMany()` with `ordered=false` |
| 100 | + - `bulkWrite()` with `ordered=true` (no `UpdateMany` or `DeleteMany`) |
| 101 | + - `bulkWrite()` with `ordered=false` (no `UpdateMany` or `DeleteMany`) |
| 102 | + |
| 103 | +## Prose Tests |
| 104 | + |
| 105 | +The following tests ensure that retryable writes work properly with replica sets and sharded clusters. |
| 106 | + |
| 107 | +### 1. Test that retryable writes raise an exception when using the MMAPv1 storage engine. |
| 108 | + |
| 109 | +For this test, execute a write operation, such as `insertOne`, which should generate an exception. Assert that the error |
| 110 | +message is the replacement error message: |
| 111 | + |
| 112 | +``` |
| 113 | +This MongoDB deployment does not support retryable writes. Please add |
| 114 | +retryWrites=false to your connection string. |
| 115 | +``` |
| 116 | + |
| 117 | +and the error code is 20. |
| 118 | + |
| 119 | +> [!NOTE] |
| 120 | +> Drivers that rely on `serverStatus` to determine the storage engine in use MAY skip this test for sharded clusters, |
| 121 | +> since `mongos` does not report this information in its `serverStatus` response. |
| 122 | +
|
| 123 | +### 2. Test that drivers properly retry after encountering PoolClearedErrors. |
| 124 | + |
| 125 | +This test MUST be implemented by any driver that implements the CMAP specification. |
| 126 | + |
| 127 | +This test requires MongoDB 4.3.4+ for both the `errorLabels` and `blockConnection` fail point options. |
| 128 | + |
| 129 | +1. Create a client with maxPoolSize=1 and retryWrites=true. If testing against a sharded deployment, be sure to connect |
| 130 | + to only a single mongos. |
| 131 | + |
| 132 | +2. Enable the following failpoint: |
| 133 | + |
| 134 | + ```javascript |
| 135 | + { |
| 136 | + configureFailPoint: "failCommand", |
| 137 | + mode: { times: 1 }, |
| 138 | + data: { |
| 139 | + failCommands: ["insert"], |
| 140 | + errorCode: 91, |
| 141 | + blockConnection: true, |
| 142 | + blockTimeMS: 1000, |
| 143 | + errorLabels: ["RetryableWriteError"] |
| 144 | + } |
| 145 | + } |
| 146 | + ``` |
| 147 | + |
| 148 | +3. Start two threads and attempt to perform an `insertOne` simultaneously on both. |
| 149 | + |
| 150 | +4. Verify that both `insertOne` attempts succeed. |
| 151 | + |
| 152 | +5. Via CMAP monitoring, assert that the first check out succeeds. |
| 153 | + |
| 154 | +6. Via CMAP monitoring, assert that a PoolClearedEvent is then emitted. |
| 155 | + |
| 156 | +7. Via CMAP monitoring, assert that the second check out then fails due to a connection error. |
| 157 | + |
| 158 | +8. Via Command Monitoring, assert that exactly three `insert` CommandStartedEvents were observed in total. |
| 159 | + |
| 160 | +9. Disable the failpoint. |
| 161 | + |
| 162 | +### 3. Test that drivers return the original error after encountering a WriteConcernError with a RetryableWriteError label. |
| 163 | + |
| 164 | +This test MUST: |
| 165 | + |
| 166 | +- be implemented by any driver that implements the Command Monitoring specification, |
| 167 | +- only run against replica sets as mongos does not propagate the NoWritesPerformed label to the drivers. |
| 168 | +- be run against server versions 6.0 and above. |
| 169 | + |
| 170 | +Additionally, this test requires drivers to set a fail point after an `insertOne` operation but before the subsequent |
| 171 | +retry. Drivers that are unable to set a failCommand after the CommandSucceededEvent SHOULD use mocking or write a unit |
| 172 | +test to cover the same sequence of events. |
| 173 | + |
| 174 | +1. Create a client with `retryWrites=true`. |
| 175 | + |
| 176 | +2. Configure a fail point with error code `91` (ShutdownInProgress): |
| 177 | + |
| 178 | + ```javascript |
| 179 | + { |
| 180 | + configureFailPoint: "failCommand", |
| 181 | + mode: {times: 1}, |
| 182 | + data: { |
| 183 | + failCommands: ["insert"], |
| 184 | + errorLabels: ["RetryableWriteError"], |
| 185 | + writeConcernError: { code: 91 } |
| 186 | + } |
| 187 | + } |
| 188 | + ``` |
| 189 | + |
| 190 | +3. Via the command monitoring CommandSucceededEvent, configure a fail point with error code `10107` (NotWritablePrimary) |
| 191 | + and a NoWritesPerformed label: |
| 192 | + |
| 193 | + ```javascript |
| 194 | + { |
| 195 | + configureFailPoint: "failCommand", |
| 196 | + mode: {times: 1}, |
| 197 | + data: { |
| 198 | + failCommands: ["insert"], |
| 199 | + errorCode: 10107, |
| 200 | + errorLabels: ["RetryableWriteError", "NoWritesPerformed"] |
| 201 | + } |
| 202 | + } |
| 203 | + ``` |
| 204 | + |
| 205 | + Drivers SHOULD only configure the `10107` fail point command if the the succeeded event is for the `91` error |
| 206 | + configured in step 2. |
| 207 | + |
| 208 | +4. Attempt an `insertOne` operation on any record for any database and collection. For the resulting error, assert that |
| 209 | + the associated error code is `91`. |
| 210 | + |
| 211 | +5. Disable the fail point: |
| 212 | + |
| 213 | + ```javascript |
| 214 | + { |
| 215 | + configureFailPoint: "failCommand", |
| 216 | + mode: "off" |
| 217 | + } |
| 218 | + ``` |
| 219 | + |
| 220 | +### 4. Test that in a sharded cluster writes are retried on a different mongos when one is available. |
| 221 | + |
| 222 | +This test MUST be executed against a sharded cluster that has at least two mongos instances, supports |
| 223 | +`retryWrites=true`, has enabled the `configureFailPoint` command, and supports the `errorLabels` field (MongoDB 4.3.1+). |
| 224 | + |
| 225 | +> [!NOTE] |
| 226 | +> This test cannot reliably distinguish "retry on a different mongos due to server deprioritization" (the behavior |
| 227 | +> intended to be tested) from "retry on a different mongos due to normal SDAM randomized suitable server selection". |
| 228 | +> Verify relevant code paths are correctly executed by the tests using external means such as a logging, debugger, code |
| 229 | +> coverage tool, etc. |
| 230 | +
|
| 231 | +1. Create two clients `s0` and `s1` that each connect to a single mongos from the sharded cluster. They must not connect |
| 232 | + to the same mongos. |
| 233 | + |
| 234 | +2. Configure the following fail point for both `s0` and `s1`: |
| 235 | + |
| 236 | + ```javascript |
| 237 | + { |
| 238 | + configureFailPoint: "failCommand", |
| 239 | + mode: { times: 1 }, |
| 240 | + data: { |
| 241 | + failCommands: ["insert"], |
| 242 | + errorCode: 6, |
| 243 | + errorLabels: ["RetryableWriteError"] |
| 244 | + } |
| 245 | + } |
| 246 | + ``` |
| 247 | + |
| 248 | +3. Create a client `client` with `retryWrites=true` that connects to the cluster using the same two mongoses as `s0` and |
| 249 | + `s1`. |
| 250 | + |
| 251 | +4. Enable failed command event monitoring for `client`. |
| 252 | + |
| 253 | +5. Execute an `insert` command with `client`. Assert that the command failed. |
| 254 | + |
| 255 | +6. Assert that two failed command events occurred. Assert that the failed command events occurred on different mongoses. |
| 256 | + |
| 257 | +7. Disable the fail points on both `s0` and `s1`. |
| 258 | + |
| 259 | +### 5. Test that in a sharded cluster writes are retried on the same mongos when no others are available. |
| 260 | + |
| 261 | +This test MUST be executed against a sharded cluster that supports `retryWrites=true`, has enabled the |
| 262 | +`configureFailPoint` command, and supports the `errorLabels` field (MongoDB 4.3.1+). |
| 263 | + |
| 264 | +Note: this test cannot reliably distinguish "retry on a different mongos due to server deprioritization" (the behavior |
| 265 | +intended to be tested) from "retry on a different mongos due to normal SDAM behavior of randomized suitable server |
| 266 | +selection". Verify relevant code paths are correctly executed by the tests using external means such as a logging, |
| 267 | +debugger, code coverage tool, etc. |
| 268 | + |
| 269 | +1. Create a client `s0` that connects to a single mongos from the cluster. |
| 270 | + |
| 271 | +2. Configure the following fail point for `s0`: |
| 272 | + |
| 273 | + ```javascript |
| 274 | + { |
| 275 | + configureFailPoint: "failCommand", |
| 276 | + mode: { times: 1 }, |
| 277 | + data: { |
| 278 | + failCommands: ["insert"], |
| 279 | + errorCode: 6, |
| 280 | + errorLabels: ["RetryableWriteError"], |
| 281 | + closeConnection: true |
| 282 | + } |
| 283 | + } |
| 284 | + ``` |
| 285 | + |
| 286 | +3. Create a client `client` with `directConnection=false` (when not set by default) and `retryWrites=true` that connects |
| 287 | + to the cluster using the same single mongos as `s0`. |
| 288 | + |
| 289 | +4. Enable succeeded and failed command event monitoring for `client`. |
| 290 | + |
| 291 | +5. Execute an `insert` command with `client`. Assert that the command succeeded. |
| 292 | + |
| 293 | +6. Assert that exactly one failed command event and one succeeded command event occurred. Assert that both events |
| 294 | + occurred on the same mongos. |
| 295 | + |
| 296 | +7. Disable the fail point on `s0`. |
| 297 | + |
| 298 | +## Changelog |
| 299 | + |
| 300 | +- 2024-05-30: Migrated from reStructuredText to Markdown. |
| 301 | + |
| 302 | +- 2024-02-27: Convert legacy retryable writes tests to unified format. |
| 303 | + |
| 304 | +- 2024-02-21: Update prose test 4 and 5 to workaround SDAM behavior preventing\ |
| 305 | + execution of deprioritization code |
| 306 | + paths. |
| 307 | + |
| 308 | +- 2024-01-05: Fix typo in prose test title. |
| 309 | + |
| 310 | +- 2024-01-03: Note server version requirements for fail point options and revise\ |
| 311 | + tests to specify the `errorLabels` |
| 312 | + option at the top-level instead of within `writeConcernError`. |
| 313 | + |
| 314 | +- 2023-08-26: Add prose tests for retrying in a sharded cluster. |
| 315 | + |
| 316 | +- 2022-08-30: Add prose test verifying correct error handling for errors with\ |
| 317 | + the NoWritesPerformed label, which is to |
| 318 | + return the original error. |
| 319 | + |
| 320 | +- 2022-04-22: Clarifications to `serverless` and `useMultipleMongoses`. |
| 321 | + |
| 322 | +- 2021-08-27: Add `serverless` to `runOn`. Clarify behavior of\ |
| 323 | + `useMultipleMongoses` for `LoadBalanced` topologies. |
| 324 | + |
| 325 | +- 2021-04-23: Add `load-balanced` to test topology requirements. |
| 326 | + |
| 327 | +- 2021-03-24: Add prose test verifying `PoolClearedErrors` are retried. |
| 328 | + |
| 329 | +- 2019-10-21: Add `errorLabelsContain` and `errorLabelsContain` fields to\ |
| 330 | + `result` |
| 331 | + |
| 332 | +- 2019-08-07: Add Prose Tests section |
| 333 | + |
| 334 | +- 2019-06-07: Mention $merge stage for aggregate alongside $out |
| 335 | + |
| 336 | +- 2019-03-01: Add top-level `runOn` field to denote server version and/or\ |
| 337 | + topology requirements requirements for the |
| 338 | + test file. Removes the `minServerVersion` and `maxServerVersion` top-level fields, which are now expressed within |
| 339 | + `runOn` elements. |
| 340 | + |
| 341 | + Add test-level `useMultipleMongoses` field. |
0 commit comments