Skip to content
This repository was archived by the owner on May 17, 2024. It is now read-only.

Compound keys implementation, using product order #375

Merged
merged 5 commits into from
Feb 28, 2023
Merged

Conversation

erezsh
Copy link
Contributor

@erezsh erezsh commented Jan 28, 2023

min_key/max_key are now Vectors with a product order, as it has much simpler geometric properties than the previously proposed lexicographic order.

@erezsh erezsh requested a review from nolar January 31, 2023 12:54
@GCCree
Copy link

GCCree commented Feb 1, 2023

Just out of curiosity, would a side-effect of compound key support mean that single non-numeric keys would also be supported as part of this? Assuming you could pass a single key whose values could be vectorized and whose min/max could be determined by the product order of themselves? I don't know if that makes sense. I believe currently the single key that you pass has to be numeric, but from the description above it seems like you could support text keys if you vectorized the text values?

@erezsh
Copy link
Contributor Author

erezsh commented Feb 2, 2023

@GCCree The current implementation depends on the compound key to have fixed length. Also It's quite inefficient for compound keys with many items (say, >=5)

However, we already support alphanumerics, and that could possibly be extended to supporting arbitrary text, at some performance cost but much smaller than the cost of compound keys.

Copy link
Contributor

@nolar nolar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I probably need some more time to review this with some drawings and charts — the logic behind the implementation is not very clear.

As far as I remember, the biggest challenge in the first implementation was that we did not know the profiles of columns and their ranges. A rather typical examples (as I use in ClickHouse) — a compound key (org_id, entity_uuid). Most orgs have nothing in them, several orgs are over populated with entities. As a result, the distribution is uneven.

With the first implementation, the idea was to split the 1st field into ranges, then the 2nd field into ranges, and then combined them. This would lead to way too many "empty" sections of the addressable key space — those "empty" orgs need no splitting, only several "heavy" orgs need, each with its own intensity.

Can you please briefly describe how this problem is addressed in this approach? This will help me understand the concept of "product order" better. Thanks.

@erezsh
Copy link
Contributor Author

erezsh commented Feb 13, 2023

the biggest challenge in the first implementation was that we did not know the profiles of columns and their ranges

This was one of the challenges.

The problem with the first implementation was that we didn't divide the space up equally. This implementation solves it.

Another problem that came up later is dividing the space when one column is integer, while the other is alphanumeric. This implementation addresses this situation also.

As for compound keys that are not evenly distributed, this is still a performance problem, but not a huge one, since empty regions will only be iterated into once, and then be discarded.

In the future we can try to further optimize the algorithm, for example according to statistical guesses, or by re-querying the limits of the subsections. But since it is already sufficiently complicated, and I was under a time pressure to deliver this feature, I decided we can postpone this attempt to a separate PR.

If you need help understanding my approach, I can try explaining it live.

@erezsh erezsh mentioned this pull request Feb 24, 2023
@erezsh erezsh merged commit 9267b9f into master Feb 28, 2023
@dlawin dlawin mentioned this pull request Mar 1, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants