Skip to content

PHPLIB-1237 Implement Parallel Multi File Export Bench #1169

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Sep 22, 2023

Conversation

GromNaN
Copy link
Member

@GromNaN GromNaN commented Sep 21, 2023

Fix PHPLIB-1237
Follows #1166

Parallel Benchmarks specs: LDJSON multi-file export

Upgrade to AMPHP with Fibers.

Implementations:

  1. 🥇 Using multiple forked threads
  2. 🥈 Using amphp/parallel with worker pool
  3. 🥉 Using a single process

The scenario have 3 critical steps that I tried to optimize:

Find documents

I grouped files by chunks to perform a single query for several files and reduce server round trips.

I had to use limit/skip to paginate on results. I tried an aggregation pipeline to get the first ID of each chunk, but it takes ~600ms, which is too significant with no benefit on the query.

Aggregation pipeline to get the first ID of each chunk
function getFirstIdOfEachChunk(int $chunkSize = 5_000): array
{
    $ids = [];
    $cursor = Utils::getCollection()->aggregate([
        ['$sort' => ['_id' => 1]],
        // We work with the _id field only
        ['$project' => ['_id' => 1]],
        // Add a rank field to each document
        ['$setWindowFields' => ['sortBy' => ['_id' => 1], 'output' => ['rank' => ['$rank' => (object) []]]]],
        // Get 1 id every 5000
        ['$match' => ['rank' => ['$mod' => [$chunkSize, 1]]]],
    ], [
        'typeMap' => ['root' => 'bson'],
    ]);

    $ids = [];
    foreach ($cursor as $document) {
        $ids[] = $document->get('_id');
    }

    return $ids;
}

Encoding documents to JSON

Initially I used the methods of MongoDB\BSON\Document: toCanonicalExtendedJSON and toRelaxedExtendedJSON but this methods are slow compared to json_encode. And they don't return the expected document format.

Even if the bson typeMap is faster for the query, I had to use array typeMap to convert to JSON. json_encode($document->toPHP()) is slower than encoding an array.

Writing the file

I tested several approaches for writing the file. One fwrite per document was 1.1x slower than a file_put_contents of the whole file. I didn't see any benefit of doing async write using stream_set_blocking. Tried AMP's WritableStream but it was much slower. And in any case, it's not what takes the longest overall.

@GromNaN GromNaN requested review from jmikola and alcaeus September 21, 2023 23:53
@GromNaN
Copy link
Member Author

GromNaN commented Sep 22, 2023

We can see that forking is faster that using a single process or starting worker processes.

Thanks to the benchmark, we can see that there's a balance to be found to get the best number of threads.

benchmark subject set mem_peak mode
ParallelMultiFileImportBench benchBulkWrite by 1 1.230mb 14.302s
ParallelMultiFileImportBench benchBulkWrite by 2 1.230mb 13.488s
ParallelMultiFileImportBench benchBulkWrite by 4 1.230mb 13.353s
ParallelMultiFileImportBench benchBulkWrite by 8 1.230mb 13.425s
ParallelMultiFileImportBench benchBulkWrite by 13 1.230mb 13.418s
ParallelMultiFileImportBench benchBulkWrite by 20 1.230mb 13.170s
ParallelMultiFileImportBench benchBulkWrite by 100 1.230mb 13.305s
ParallelMultiFileImportBench benchInsertMany   2.501mb 13.810s
ParallelMultiFileImportBench benchFork by 1 1.230mb 9.058s
ParallelMultiFileImportBench benchFork by 2 1.230mb 8.573s
ParallelMultiFileImportBench benchFork by 4 1.230mb 7.856s
ParallelMultiFileImportBench benchFork by 8 1.230mb 7.454s
ParallelMultiFileImportBench benchFork by 13 1.230mb 7.498s
ParallelMultiFileImportBench benchFork by 20 1.230mb 7.316s
ParallelMultiFileImportBench benchFork by 100 1.230mb 13.343s
ParallelMultiFileImportBench benchAmpWorkers by 1 18.216mb 11.598s
ParallelMultiFileImportBench benchAmpWorkers by 2 15.240mb 9.641s
ParallelMultiFileImportBench benchAmpWorkers by 4 5.744mb 8.824s
ParallelMultiFileImportBench benchAmpWorkers by 8 4.270mb 8.059s
ParallelMultiFileImportBench benchAmpWorkers by 13 3.605mb 7.807s
ParallelMultiFileImportBench benchAmpWorkers by 20 3.199mb 7.530s
ParallelMultiFileImportBench benchAmpWorkers by 100 2.757mb 13.473s
ParallelMultiFileExportBench benchSequential by 1 7.023mb 18.278s
ParallelMultiFileExportBench benchSequential by 2 7.011mb 12.877s
ParallelMultiFileExportBench benchSequential by 4 7.005mb 10.146s
ParallelMultiFileExportBench benchSequential by 8 7.002mb 8.848s
ParallelMultiFileExportBench benchSequential by 13 7.002mb 8.244s
ParallelMultiFileExportBench benchSequential by 20 7.002mb 7.994s
ParallelMultiFileExportBench benchSequential by 100 7.002mb 7.556s
ParallelMultiFileExportBench benchFork by 1 929.808kb 11.601s
ParallelMultiFileExportBench benchFork by 2 929.808kb 7.574s
ParallelMultiFileExportBench benchFork by 4 929.808kb 5.754s
ParallelMultiFileExportBench benchFork by 8 929.808kb 4.873s
ParallelMultiFileExportBench benchFork by 13 929.808kb 4.426s
ParallelMultiFileExportBench benchFork by 20 929.808kb 4.201s
ParallelMultiFileExportBench benchFork by 100 929.808kb 7.464s
ParallelMultiFileExportBench benchAmpWorkers by 1 11.425mb 13.744s
ParallelMultiFileExportBench benchAmpWorkers by 2 7.895mb 9.107s
ParallelMultiFileExportBench benchAmpWorkers by 4 5.329mb 6.493s
ParallelMultiFileExportBench benchAmpWorkers by 8 3.812mb 5.396s
ParallelMultiFileExportBench benchAmpWorkers by 13 3.252mb 4.856s
ParallelMultiFileExportBench benchAmpWorkers by 20 2.850mb 4.414s
ParallelMultiFileExportBench benchAmpWorkers by 100 2.377mb 7.712s

Comment on lines +197 to +198
// We don't use MongoDB\BSON\Document::toCanonicalExtendedJSON() because
// it is slower than json_encode() on an array.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that json_encode does not produce the same result as toCanonicalExtendedJSON. That said, it does produce the result we wanted.

I also had to do a double-take on this, so I wrote a small benchmark:

    benchToJSONViaToCanonicalExtendedJSON...I2 - Mo48.200μs (±0.00%)
    benchToJSONViaToRelaxedExtendedJSON.....I2 - Mo48.591μs (±6.09%)
    benchToJSONViaJsonEncode................I2 - Mo12.410μs (±3.55%)

The last benchmark actually calls json_encode($document->toPHP(['root' => 'array'])), so I'm a bit surprised that is faster. This might provide an opportunity to revisit the JSON serialisation logic in libbson.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tracked in PHPC-2299

@GromNaN GromNaN merged commit 21946b8 into mongodb:master Sep 22, 2023
@GromNaN GromNaN deleted the PHPLIB-1237-export branch September 22, 2023 09:46
// Reset to ensure that the existing libmongoc client (via the Manager) is not re-used by the child
// process. When the child process constructs a new Manager, the differing PID will result in creation
// of a new libmongoc client.
Utils::reset();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there existing state prior to benchFork() where a Manager would have been reigstered? Perhaps from beforeIteration() dropping and creating a collection?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without the reset, I get this error:

Fatal error: Uncaught MongoDB\Driver\Exception\BulkWriteException: assertion src/mongo/util/hex.cpp:113 in /Users/jerome/Develop/mongo-php-library/benchmark/src/DriverBench/ParallelMultiFileImportBench.php:211

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Possibly this: https://github.com/mongodb/mongo/blob/r7.0.1/src/mongo/util/hex.cpp#L113

Seems odd that the server would throw back a cryptic assertion failure as its error message, though. If it's easily reproducible this might warrant a bug report in the SERVER project.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants