Skip to content

Add mutation batch id to index backfill #3202

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 20 commits into from
Closed

Conversation

thebrianchen
Copy link

This PR adds mutation batch id support to the index backfiller. All documents in the same mutation batch are processed, even if it goes over the limit (the alternative is to add an offset for mutations, but batches are limited to 500 writes).

Still todo:

  • Current implementation performs a collection group query with documents + overlay, but local mutations are not written to the collection_groups table that support collection group queries. This means that the current implementation only supports local mutations that already have a remote document in the same collection.

@thebrianchen thebrianchen self-assigned this Dec 2, 2021
@google-cla google-cla bot added the cla: yes Override cla label Dec 2, 2021
}

/** Updates the provided remote documents to include documents in the overlay. */
private ImmutableSortedMap<DocumentKey, MutableDocument>
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried refactoring out shared code between getDocumentsMatchingCollectionQueryFromOverlayCache and getDocumentsCollectionQueryWithBatchId, but I think it can probably be cleaner.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are probably at a point where we should just remove the non-overlay code.

@wu-hui

@google-oss-bot
Copy link
Contributor

google-oss-bot commented Dec 2, 2021

Coverage Report 1

Affected Products

No changes between base commit (46b3dfd) and merge commit (b025f3a).

Test Logs

Notes

  • Commit (b025f3a) is created by Prow via merging PR base commit (46b3dfd) and head commit (4b10ea0).
  • Run gradle <product>:checkCoverage to produce HTML coverage reports locally. After gradle commands finished, report files can be found under <product-build-dir>/reports/jacoco/.

  1. https://storage.googleapis.com/firebase-sdk-metric-reports/vUOXETjUI6.html

@google-oss-bot
Copy link
Contributor

google-oss-bot commented Dec 2, 2021

Size Report 1

Affected Products

  • firebase-firestore

    TypeBase (46b3dfd)Merge (b025f3a)Diff
    aar1.23 MB1.23 MB+3.69 kB (+0.3%)
    apk (release)3.33 MB3.33 MB+1.87 kB (+0.1%)

Test Logs

Notes

  • Commit (b025f3a) is created by Prow via merging PR base commit (46b3dfd) and head commit (4b10ea0).

  1. https://storage.googleapis.com/firebase-sdk-metric-reports/Hwitdog1OW.html

Copy link
Contributor

@schmidt-sebastian schmidt-sebastian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some early feedback (did not look at test). Looks pretty good, but we may want to think about how we can remove the need to read documents multiple times.

See also for some changes that may affect this: #3193

*/
public static <T extends Comparable<T>> void diffCollections(
SortedSet<T> before, SortedSet<T> after, Consumer<T> onAdd, Consumer<T> onRemove) {
diffCollections(before.iterator(), after.iterator(), before.comparator(), onAdd, onRemove);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This overload is actually different from the one we now default to. This method takes sorted data - the method you are calling now reorders the data (and copies it) even when already sorted. I think you solve the problem you encountered - that SortedSet<T>.comparator() can return null - by creating a default comparator based on T.compareTo() if the set's comparator is null.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, thanks. Added the code back and added a null check.

* change past `sinceBatchId` are returned.
*/
Map<DocumentKey, Pair<Integer, Mutation>> getOverlaysWithBatchId(
ResourcePath collection, int sinceBatchId);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am surprised we don't have a high-level class that contains both the mutation and the batch ID for Overlays. We should probably create one and combine this method with the existing method just above. (cc @wu-hui)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renamed this method into the original getOverlays() and changed logic to support Pair.


// Start indexing mutations only after catching up for read time. This must be done separately
// since the backfill's first pass is in read time order rather than mutation batch id order.
int documentsRemaining = documentsRemainingUnderCap - oldestDocuments.size();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you move this to a helper method for better encapsulation? We might also want to do the same for the RemoteDocumentCache code.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved into LocalDocumentsView

// order by batch id.
int newLargestBatchId =
documentToBatchId.get(documentsByBatchId.get(documentsByBatchId.size() - 1));
newOffset =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you move this into getNewOffset? (as a final call at the end of the method)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

N/A after moving into LocalDocumentsView.

* corresponding batch id.
*/
// TODO(indexing): Support adding local mutations to collection group table.
public Map<Document, Integer> getDocumentsBatchIdsMatchingCollectionGroupQuery(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your code would be cleaner if this method returned Map<BatchId, List<Document>>.

The name is also quite a mouthful - and the method should probably just accept a String collectionGroup rather than a query as it ignores all other query constraints.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, but cannot think of a cleaner name.

}

/** Updates the provided remote documents to include documents in the overlay. */
private ImmutableSortedMap<DocumentKey, MutableDocument>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are probably at a point where we should just remove the non-overlay code.

@wu-hui

@@ -105,4 +105,32 @@ public void removeOverlaysForBatchId(int batchId) {

return result;
}

@Override
public Map<DocumentKey, Pair<Integer, Mutation>> getOverlaysWithBatchId(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please find a way to remove this duplication - even if we do not add an Overlay class.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Combined the two methods into a single one.

@@ -111,4 +112,34 @@ public void removeOverlaysForBatchId(int batchId) {

return result;
}

@Override
public Map<DocumentKey, Pair<Integer, Mutation>> getOverlaysWithBatchId(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment as above.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Brian Chen and others added 2 commits December 3, 2021 08:21
documentsProcessed = backfiller.backfill();
assertEquals(1, documentsProcessed);
SortedSet<IndexEntry> after = indexManager.getExistingIndexEntries(docA.getKey(), fieldIndex);
assertNotEquals(before.first().getDirectionalValue(), after.first().getDirectionalValue());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of relying on getExistingIndexEntries, can you just the public API and invoke getDocumentsMatchingTarget with two different targets (foo > 10 and foo > 20 for example). Before the update, you can verify that one of them matches and then after the update two match.

Same comment applies below.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Copy link
Contributor

@schmidt-sebastian schmidt-sebastian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tests look great - but we probably do not need to use the private API.

@thebrianchen
Copy link
Author

Main changes:

  • Documents are no longer memoized.
  • LDV still calculates the new offset and returns the result as pair since it's cleaner than passing in the set of document keys in 2 separate calls.
  • Overlay class added.
  • Mutations now have the largest batch id

/test check-changed
/test device-check-changed

Copy link
Contributor

@schmidt-sebastian schmidt-sebastian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a huge step in the right direction. We should talk about the refactor to add (and use) the BatchId in the Document class. I see that you added the code to populate it - maybe it will now be easy enough to also use it? It would allow you to stop passing Pair<integer, document> types.

@@ -47,11 +48,14 @@
void removeOverlaysForBatchId(int batchId);

/**
* Returns all saved overlays for the given collection.
* Returns a mapping of overlays containing the largest batch id for each overlay mutation in the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can probably revert this documentation change now.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

*
* @param collection The collection path to get the overlays for.
* @param sinceBatchId The minimum batch ID to filter by (exclusive). Only overlays that contain a
* change past `sinceBatchId` are returned.
* @return Mapping of each document key in the collection to its largest batch id and overlay
* mutation.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please update.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

processedCollectionGroups.add(collectionGroup);
}
return maxDocumentsToProcess - documentsRemaining;
}

/** Writes entries for the fetched field indexes. */
/**
* Writes entries for the corresponding field indexes in the provided collection group. Returns
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please update as their are no "corresponding" indexes (at least none are passed to the method)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

lowestOffset = lowestOffset == null ? IndexOffset.NONE : lowestOffset;

// Add earliest batch id to offset
int earliestBatchId = getEarliestBatchId(fieldIndexes);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you inline this call? You are iterating over the list of field indexes twice.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.


// Apply overlays to empty mutable document map to generate base.
for (Map.Entry<DocumentKey, Overlay> entry : overlays.entrySet()) {
remoteDocuments.put(entry.getKey(), MutableDocument.newInvalidDocument(entry.getKey()));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we not need to fetch the documents from the RDC?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the document overlay cache contains the most up to date version of the documents, we don't need to.

mutation.applyToLocalView(docEntry.getValue(), null, batchId, Timestamp.now());
}
// Finally, insert the documents that still match the query.
if (query.matches(docEntry.getValue())) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should do this once at the end of getNextDocumentsAndOffset.

FWIW, we might not need to do this at all if we pass a collection group here instead of the query. None of the IndexBackfill methods need a query.

return getDocumentKey().compareTo(other.getDocumentKey());
cmp = getDocumentKey().compareTo(other.getDocumentKey());
if (cmp != 0) return cmp;
return Integer.signum(getLargestBatchId() - other.getLargestBatchId());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Integer.compare? (for noobs like me)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

@@ -182,6 +184,10 @@ public SnapshotVersion getReadTime() {
return readTime;
}

public int getMutationBatchID() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't seem used but I wish it was :)

* <p>Holds information about a mutation and the largest batch id in Firestore when the mutation was
* created.
*/
public class Overlay {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You probably want to do one of the following:

  • Make this an AutoValue
  • Make this a Kotlin value class
  • Add equals/hashCode

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Turned it into an AutoValue.

/** Returns the offset for the index based on the newly indexed documents. */
private IndexOffset getNewOffset(
ImmutableSortedMap<DocumentKey, Document> documents, IndexOffset currentOffset) {
if (documents.isEmpty()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW This should eventually live in the IndexBackfiller.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussed offline, this probably does not and we need to leave it here.

@google-oss-bot
Copy link
Contributor

google-oss-bot commented Jan 14, 2022

@thebrianchen: The following test failed, say /retest to rerun them all:

Test name Commit Details Rerun command
check-changed 4b10ea0 link /test check-changed

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

Copy link
Author

@thebrianchen thebrianchen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some comments are now outdated + I don't really remember their original context. Please review this as a rough draft. Will clean up the comments later.

*
* @param collection The collection path to get the overlays for.
* @param sinceBatchId The minimum batch ID to filter by (exclusive). Only overlays that contain a
* change past `sinceBatchId` are returned.
* @return Mapping of each document key in the collection to its largest batch id and overlay
* mutation.
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@@ -47,11 +48,14 @@
void removeOverlaysForBatchId(int batchId);

/**
* Returns all saved overlays for the given collection.
* Returns a mapping of overlays containing the largest batch id for each overlay mutation in the
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

processedCollectionGroups.add(collectionGroup);
}
return maxDocumentsToProcess - documentsRemaining;
}

/** Writes entries for the fetched field indexes. */
/**
* Writes entries for the corresponding field indexes in the provided collection group. Returns
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

lowestOffset = lowestOffset == null ? IndexOffset.NONE : lowestOffset;

// Add earliest batch id to offset
int earliestBatchId = getEarliestBatchId(fieldIndexes);
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

*/
public Map<Document, Integer> getDocumentsInOverlayCacheByBatchId(
Query query, IndexOffset offset) {
Map<DocumentKey, MutableDocument> remoteDocuments = new HashMap<>();
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.


// Apply overlays to empty mutable document map to generate base.
for (Map.Entry<DocumentKey, Overlay> entry : overlays.entrySet()) {
remoteDocuments.put(entry.getKey(), MutableDocument.newInvalidDocument(entry.getKey()));
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the document overlay cache contains the most up to date version of the documents, we don't need to.

overlay.applyToLocalView(docEntry.getValue(), null, Timestamp.now());
Overlay overlay = overlays.get(docEntry.getKey());
Mutation mutation = overlay != null ? overlay.getMutation() : null;
if (mutation != null) {
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

Comment on lines 327 to 333
if (!results.containsKey(batchId)) {
List<Document> documents = new ArrayList<>();
documents.add(document);
results = results.insert(batchId, documents);
} else {
results.get(batchId).add(document);
}
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

* <p>Holds information about a mutation and the largest batch id in Firestore when the mutation was
* created.
*/
public class Overlay {
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Turned it into an AutoValue.

@thebrianchen
Copy link
Author

Better version being implemented in #3323

@firebase firebase locked and limited conversation to collaborators Feb 24, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants