Compound keys implementation, using product order #375

erezsh · 2023-01-28T13:50:10Z

min_key/max_key are now Vectors with a product order, as it has much simpler geometric properties than the previously proposed lexicographic order.

GCCree · 2023-02-01T22:37:46Z

Just out of curiosity, would a side-effect of compound key support mean that single non-numeric keys would also be supported as part of this? Assuming you could pass a single key whose values could be vectorized and whose min/max could be determined by the product order of themselves? I don't know if that makes sense. I believe currently the single key that you pass has to be numeric, but from the description above it seems like you could support text keys if you vectorized the text values?

erezsh · 2023-02-02T01:46:04Z

@GCCree The current implementation depends on the compound key to have fixed length. Also It's quite inefficient for compound keys with many items (say, >=5)

However, we already support alphanumerics, and that could possibly be extended to supporting arbitrary text, at some performance cost but much smaller than the cost of compound keys.

nolar

I probably need some more time to review this with some drawings and charts — the logic behind the implementation is not very clear.

As far as I remember, the biggest challenge in the first implementation was that we did not know the profiles of columns and their ranges. A rather typical examples (as I use in ClickHouse) — a compound key (org_id, entity_uuid). Most orgs have nothing in them, several orgs are over populated with entities. As a result, the distribution is uneven.

With the first implementation, the idea was to split the 1st field into ranges, then the 2nd field into ranges, and then combined them. This would lead to way too many "empty" sections of the addressable key space — those "empty" orgs need no splitting, only several "heavy" orgs need, each with its own intensity.

Can you please briefly describe how this problem is addressed in this approach? This will help me understand the concept of "product order" better. Thanks.

data_diff/table_segment.py

erezsh · 2023-02-13T13:19:31Z

the biggest challenge in the first implementation was that we did not know the profiles of columns and their ranges

This was one of the challenges.

The problem with the first implementation was that we didn't divide the space up equally. This implementation solves it.

Another problem that came up later is dividing the space when one column is integer, while the other is alphanumeric. This implementation addresses this situation also.

As for compound keys that are not evenly distributed, this is still a performance problem, but not a huge one, since empty regions will only be iterated into once, and then be discarded.

In the future we can try to further optimize the algorithm, for example according to statistical guesses, or by re-querying the limits of the subsections. But since it is already sufficiently complicated, and I was under a time pressure to deliver this feature, I decided we can postpone this attempt to a separate PR.

If you need help understanding my approach, I can try explaining it live.

…ct order.

erezsh requested a review from nolar January 31, 2023 12:54

nolar reviewed Feb 13, 2023

View reviewed changes

data_diff/table_segment.py Show resolved Hide resolved

erezsh mentioned this pull request Feb 24, 2023

Multiple columns key #110

Closed

erezsh added 5 commits February 28, 2023 16:34

Refactor: extract to function split_key_space()

8443265

Refactor: split_key_space() now also returns start & end of range

27561c6

Implemented compound keys. min_key/max_key are now Vectors with produ…

b6cf899

…ct order.

Added more tests

e65dede

Fix types; Fix test for presto

6a2c3ec

erezsh force-pushed the compound_keys branch from f146070 to 6a2c3ec Compare February 28, 2023 15:37

erezsh merged commit 9267b9f into master Feb 28, 2023

dlawin mentioned this pull request Mar 1, 2023

--dbt Support compound PKs #427

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Compound keys implementation, using product order #375

Compound keys implementation, using product order #375

Uh oh!

erezsh commented Jan 28, 2023

Uh oh!

GCCree commented Feb 1, 2023 •

edited

Loading

Uh oh!

erezsh commented Feb 2, 2023

Uh oh!

nolar left a comment

Uh oh!

Uh oh!

erezsh commented Feb 13, 2023 •

edited

Loading

Uh oh!

Uh oh!

Compound keys implementation, using product order #375

Compound keys implementation, using product order #375

Uh oh!

Conversation

erezsh commented Jan 28, 2023

Uh oh!

GCCree commented Feb 1, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

erezsh commented Feb 2, 2023

Uh oh!

nolar left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

erezsh commented Feb 13, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

GCCree commented Feb 1, 2023 •

edited

Loading

erezsh commented Feb 13, 2023 •

edited

Loading