-
Notifications
You must be signed in to change notification settings - Fork 1.5k
Document Table Constraint Enforcement Behavior in Custom Table Providers Guide #16340
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @kosiew -- this is a great improvement on what we have
i think we could merge this PR as is and update it as a follow on too so I am approving it
The optimizer also does not assume that these constraints hold when | ||
rewriting queries. For example, declaring a column as a primary key will | ||
not allow the optimizer to skip a `DISTINCT` aggregation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't think this was true -- I was pretty sure there are some ordering / functional dependency check that relies on declared constraints, but I couldn't find it quickly when searching
Maybe @mustafasrepo remembers 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hi @alamb,
You're right.
I tested this in datafusion-cli
-- Test 1: Create table with more data to see if DISTINCT appears
CREATE TABLE test_pk_large (
id INTEGER PRIMARY KEY,
name VARCHAR(50)
);
-- Insert duplicate names but unique IDs
INSERT INTO test_pk_large VALUES
(1, 'Alice'),
(2, 'Alice'),
(3, 'Bob'),
(4, 'Bob'),
(5, 'Charlie');
-- Test DISTINCT on primary key column
EXPLAIN SELECT DISTINCT id FROM test_pk_large;
+---------------+-------------------------------+
| plan_type | plan |
+---------------+-------------------------------+
| physical_plan | ┌───────────────────────────┐ |
| | │ DataSourceExec │ |
| | │ -------------------- │ |
| | │ bytes: 376 │ |
| | │ format: memory │ |
| | │ rows: 1 │ |
| | └───────────────────────────┘ |
| | |
+---------------+-------------------------------+
-- Test 2
CREATE TABLE test_no_pk (
id INTEGER,
name VARCHAR(50)
);
-- Insert unique IDs (same as before)
INSERT INTO test_no_pk VALUES
(1, 'Alice'),
(2, 'Alice'),
(3, 'Bob'),
(4, 'Bob'),
(5, 'Charlie');
EXPLAIN SELECT DISTINCT id FROM test_no_pk;
+---------------+-------------------------------+
| plan_type | plan |
+---------------+-------------------------------+
| physical_plan | ┌───────────────────────────┐ |
| | │ AggregateExec │ |
| | │ -------------------- │ |
| | │ group_by: id │ |
| | │ │ |
| | │ mode: │ |
| | │ FinalPartitioned │ |
| | └─────────────┬─────────────┘ |
| | ┌─────────────┴─────────────┐ |
| | │ CoalesceBatchesExec │ |
| | │ -------------------- │ |
| | │ target_batch_size: │ |
| | │ 8192 │ |
| | └─────────────┬─────────────┘ |
| | ┌─────────────┴─────────────┐ |
| | │ RepartitionExec │ |
| | │ -------------------- │ |
| | │ partition_count(in->out): │ |
| | │ 10 -> 10 │ |
| | │ │ |
| | │ partitioning_scheme: │ |
| | │ Hash([id@0], 10) │ |
| | └─────────────┬─────────────┘ |
| | ┌─────────────┴─────────────┐ |
| | │ RepartitionExec │ |
| | │ -------------------- │ |
| | │ partition_count(in->out): │ |
| | │ 1 -> 10 │ |
| | │ │ |
| | │ partitioning_scheme: │ |
| | │ RoundRobinBatch(10) │ |
| | └─────────────┬─────────────┘ |
| | ┌─────────────┴─────────────┐ |
| | │ AggregateExec │ |
| | │ -------------------- │ |
| | │ group_by: id │ |
| | │ mode: Partial │ |
| | └─────────────┬─────────────┘ |
| | ┌─────────────┴─────────────┐ |
| | │ DataSourceExec │ |
| | │ -------------------- │ |
| | │ bytes: 376 │ |
| | │ format: memory │ |
| | │ rows: 1 │ |
| | └───────────────────────────┘ |
| | |
+---------------+-------------------------------+
In other words, the declared constraints does affect the optimizer.
I'll remove this paragraph.
Co-authored-by: Andrew Lamb <[email protected]>
…e constraints documentation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you
Thank you @kosiew and @xudong963 |
Which issue does this PR close?
Rationale for this change
Table constraints like primary keys, uniqueness, and foreign keys are common features in relational systems, but DataFusion does not currently enforce or optimize based on most of them. This lack of enforcement isn't clearly documented, which can lead to confusion for TableProvider authors and users expecting standard SQL behavior. This PR aims to clarify that and guide users with expectations and references for typical implementations.
What changes are included in this PR?
custom-table-providers.md
file describing how DataFusion currently treats table constraints.Are these changes tested?
N/A – This change is purely documentation-related and does not include or require any code or behavior changes.
Are there any user-facing changes?
Yes – this change updates the documentation to make constraint behavior more transparent for users implementing custom
TableProvider
s.