-
Notifications
You must be signed in to change notification settings - Fork 288
Compound keys implementation, using product order #375
Conversation
Just out of curiosity, would a side-effect of compound key support mean that single non-numeric keys would also be supported as part of this? Assuming you could pass a single key whose values could be vectorized and whose min/max could be determined by the product order of themselves? I don't know if that makes sense. I believe currently the single key that you pass has to be numeric, but from the description above it seems like you could support text keys if you vectorized the text values? |
@GCCree The current implementation depends on the compound key to have fixed length. Also It's quite inefficient for compound keys with many items (say, >=5) However, we already support alphanumerics, and that could possibly be extended to supporting arbitrary text, at some performance cost but much smaller than the cost of compound keys. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I probably need some more time to review this with some drawings and charts — the logic behind the implementation is not very clear.
As far as I remember, the biggest challenge in the first implementation was that we did not know the profiles of columns and their ranges. A rather typical examples (as I use in ClickHouse) — a compound key (org_id, entity_uuid)
. Most orgs have nothing in them, several orgs are over populated with entities. As a result, the distribution is uneven.
With the first implementation, the idea was to split the 1st field into ranges, then the 2nd field into ranges, and then combined them. This would lead to way too many "empty" sections of the addressable key space — those "empty" orgs need no splitting, only several "heavy" orgs need, each with its own intensity.
Can you please briefly describe how this problem is addressed in this approach? This will help me understand the concept of "product order" better. Thanks.
This was one of the challenges. The problem with the first implementation was that we didn't divide the space up equally. This implementation solves it. Another problem that came up later is dividing the space when one column is integer, while the other is alphanumeric. This implementation addresses this situation also. As for compound keys that are not evenly distributed, this is still a performance problem, but not a huge one, since empty regions will only be iterated into once, and then be discarded. In the future we can try to further optimize the algorithm, for example according to statistical guesses, or by re-querying the limits of the subsections. But since it is already sufficiently complicated, and I was under a time pressure to deliver this feature, I decided we can postpone this attempt to a separate PR. If you need help understanding my approach, I can try explaining it live. |
f146070
to
6a2c3ec
Compare
min_key/max_key are now Vectors with a product order, as it has much simpler geometric properties than the previously proposed lexicographic order.