Skip to content

Document Table Constraint Enforcement Behavior in Custom Table Providers Guide #16340

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 8 commits into from
Jun 12, 2025

Conversation

kosiew
Copy link
Contributor

@kosiew kosiew commented Jun 9, 2025

Which issue does this PR close?

Rationale for this change

Table constraints like primary keys, uniqueness, and foreign keys are common features in relational systems, but DataFusion does not currently enforce or optimize based on most of them. This lack of enforcement isn't clearly documented, which can lead to confusion for TableProvider authors and users expecting standard SQL behavior. This PR aims to clarify that and guide users with expectations and references for typical implementations.

What changes are included in this PR?

  • Adds documentation to the custom-table-providers.md file describing how DataFusion currently treats table constraints.
  • Notes that some constraints (like nullability) are enforced, but others (like uniqueness or PK/FK constraints) are not.
  • References relevant background discussion and highlights the optimizer's current limitations in leveraging constraint metadata.

Are these changes tested?

N/A – This change is purely documentation-related and does not include or require any code or behavior changes.

Are there any user-facing changes?

Yes – this change updates the documentation to make constraint behavior more transparent for users implementing custom TableProviders.

@github-actions github-actions bot added the documentation Improvements or additions to documentation label Jun 9, 2025
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @kosiew -- this is a great improvement on what we have

i think we could merge this PR as is and update it as a follow on too so I am approving it

Comment on lines 40 to 42
The optimizer also does not assume that these constraints hold when
rewriting queries. For example, declaring a column as a primary key will
not allow the optimizer to skip a `DISTINCT` aggregation.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't think this was true -- I was pretty sure there are some ordering / functional dependency check that relies on declared constraints, but I couldn't find it quickly when searching

Maybe @mustafasrepo remembers 🤔

Copy link
Contributor Author

@kosiew kosiew Jun 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hi @alamb,

You're right.

I tested this in datafusion-cli

-- Test 1: Create table with more data to see if DISTINCT appears
CREATE TABLE test_pk_large (
    id INTEGER PRIMARY KEY,
    name VARCHAR(50)
);

-- Insert duplicate names but unique IDs
INSERT INTO test_pk_large VALUES 
    (1, 'Alice'),
    (2, 'Alice'),
    (3, 'Bob'),
    (4, 'Bob'),
    (5, 'Charlie');

-- Test DISTINCT on primary key column
EXPLAIN SELECT DISTINCT id FROM test_pk_large;

+---------------+-------------------------------+
| plan_type     | plan                          |
+---------------+-------------------------------+
| physical_plan | ┌───────────────────────────┐ |
|               | │       DataSourceExec      │ |
|               | │    --------------------   │ |
|               | │         bytes: 376        │ |
|               | │       format: memory      │ |
|               | │          rows: 1          │ |
|               | └───────────────────────────┘ |
|               |                               |
+---------------+-------------------------------+

-- Test 2
CREATE TABLE test_no_pk (
    id INTEGER,
    name VARCHAR(50)
);

-- Insert unique IDs (same as before)
INSERT INTO test_no_pk VALUES 
    (1, 'Alice'),
    (2, 'Alice'),
    (3, 'Bob'),
    (4, 'Bob'),
    (5, 'Charlie');

EXPLAIN SELECT DISTINCT id FROM test_no_pk;

+---------------+-------------------------------+
| plan_type     | plan                          |
+---------------+-------------------------------+
| physical_plan | ┌───────────────────────────┐ |
|               | │       AggregateExec       │ |
|               | │    --------------------   │ |
|               | │        group_by: id       │ |
|               | │                           │ |
|               | │           mode:           │ |
|               | │      FinalPartitioned     │ |
|               | └─────────────┬─────────────┘ |
|               | ┌─────────────┴─────────────┐ |
|               | │    CoalesceBatchesExec    │ |
|               | │    --------------------   │ |
|               | │     target_batch_size:    │ |
|               | │            8192           │ |
|               | └─────────────┬─────────────┘ |
|               | ┌─────────────┴─────────────┐ |
|               | │      RepartitionExec      │ |
|               | │    --------------------   │ |
|               | │ partition_count(in->out): │ |
|               | │          10 -> 10         │ |
|               | │                           │ |
|               | │    partitioning_scheme:   │ |
|               | │      Hash([id@0], 10)     │ |
|               | └─────────────┬─────────────┘ |
|               | ┌─────────────┴─────────────┐ |
|               | │      RepartitionExec      │ |
|               | │    --------------------   │ |
|               | │ partition_count(in->out): │ |
|               | │          1 -> 10          │ |
|               | │                           │ |
|               | │    partitioning_scheme:   │ |
|               | │    RoundRobinBatch(10)    │ |
|               | └─────────────┬─────────────┘ |
|               | ┌─────────────┴─────────────┐ |
|               | │       AggregateExec       │ |
|               | │    --------------------   │ |
|               | │        group_by: id       │ |
|               | │       mode: Partial       │ |
|               | └─────────────┬─────────────┘ |
|               | ┌─────────────┴─────────────┐ |
|               | │       DataSourceExec      │ |
|               | │    --------------------   │ |
|               | │         bytes: 376        │ |
|               | │       format: memory      │ |
|               | │          rows: 1          │ |
|               | └───────────────────────────┘ |
|               |                               |
+---------------+-------------------------------+

In other words, the declared constraints does affect the optimizer.
I'll remove this paragraph.

Copy link
Member

@xudong963 xudong963 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you

@alamb
Copy link
Contributor

alamb commented Jun 12, 2025

Thank you @kosiew and @xudong963

@alamb alamb merged commit 0e84041 into apache:main Jun 12, 2025
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add documentation on constraint enforcements
3 participants