Skip to content
This repository was archived by the owner on May 17, 2024. It is now read-only.

add dbt integration #364

Merged
merged 3 commits into from
Feb 4, 2023
Merged

add dbt integration #364

merged 3 commits into from
Feb 4, 2023

Conversation

dlawin
Copy link
Contributor

@dlawin dlawin commented Jan 5, 2023

Adds options to datadiff: data-diff --dbt and data-diff --dbt --cloud

With some suboptions:
--dbt-profiles-dir
--dbt-project-dir

The main options parse a dbt project's artifacts and profiles.yml in order to easily run multiple diffs without specifying the typical connection strings and table names.

Currently, it will run a diff for each successful model in the project's last run.

The main configuration step is to add one or both of the following to the vars section of the dbt_project.yml:

vars:
  data_diff:
    prod_database: PROD_DATABASE_NAME
    prod_schema: PROD_SCHEMA_NAME

If a dbt project is organized at a schema level (e.g. all models write to a single schema), both are needed. If the dbt project writes to multiple schemas, just the database var is needed.

If you're using the --dbt-cloud option, that requires a Datafold api key on the env var DATAFOLD_API_KEY. Additionally a datasource_id in the vars section:

vars:
  data_diff:
    prod_database: PROD_DATABASE_NAME
    prod_schema: PROD_SCHEMA_NAME
    datasource_id: 123

Caveats:

  • Limited database support (Snowflake, BigQuery so far)
  • Dbt version >= 1.0.0
    • Other versions may be fine if the format of the artifacts don't change dramatically. I have not tested < 1.0.0
  • I've only added unit tests for the most part. There are two integration tests that can be used to debug a "real" dbt diff session, but you need to set an env variable to enable them.
  • Cloud diffs just provide a link to the diff running on Datafold (no results)
  • Local diffs run sequentially, could improve performance by running in parallel

@dlawin dlawin self-assigned this Jan 5, 2023
@erezsh
Copy link
Contributor

erezsh commented Jan 5, 2023

This change of syntax is a breaking change for every user of data-diff, not to mention our own documentation and videos.

Why not add a --dbt flag instead?

@dlawin
Copy link
Contributor Author

dlawin commented Jan 5, 2023

This change of syntax is a breaking change for every user of data-diff, not to mention our own documentation and videos.

Why not add a --dbt flag instead?

Mainly to support flags for the dbt functionality. I think it would be confusing to have flags that only apply when the --dbt flag is present. This would be odd for example:
data-diff --dbt --cloud

What would the behavior of data-diff --cloud or data-diff -proj /path be?

@leoebfolsom
Copy link
Contributor

This is very minor, but we should be consistent around saying database/table 1/2 vs database/table a/b. We should use a/b instead of 1/2.

Erez will recall that I was very annoying about this when he was naming columns for materialized tables. While I don't feel strongly about 1 vs a, we settled on a, and that is reflected in this guide: https://docs.datafold.com/guides/os_diff

🙏

@erezsh
Copy link
Contributor

erezsh commented Jan 12, 2023

@dlawin

I think it would be confusing to have flags that only apply when the --dbt flag is present

We already have this kind of behavior for hashdiff vs joindiff. It might be confusing, but I don't know of an easy way to avoid it.

Also, I believe we can allow the data-diff dbt ... syntax without making any of these changes.

We already do a bit of syntax shenanigans here to allow the db table1 table2 syntax. It shouldn't be hard to tweak it, so it tests if database1 == "dbt" and acts accordingly.

The last alternative is to create a new entry-point, like data-diff-dbt ..., which I think is a bit of an overkill, but might be a bit less confusing.

Either way, breaking the API is a definite no-no.

@dlawin
Copy link
Contributor Author

dlawin commented Jan 26, 2023

@dlawin

I think it would be confusing to have flags that only apply when the --dbt flag is present

We already have this kind of behavior for hashdiff vs joindiff. It might be confusing, but I don't know of an easy way to avoid it.

Also, I believe we can allow the data-diff dbt ... syntax without making any of these changes.

We already do a bit of syntax shenanigans here to allow the db table1 table2 syntax. It shouldn't be hard to tweak it, so it tests if database1 == "dbt" and acts accordingly.

The last alternative is to create a new entry-point, like data-diff-dbt ..., which I think is a bit of an overkill, but might be a bit less confusing.

Either way, breaking the API is a definite no-no.

Switched to option flags, dbt specific functionality is prefixed by --dbt-

@dlawin
Copy link
Contributor Author

dlawin commented Jan 26, 2023

I did move some of the option functionality up to the main method when I thought it made sense (--debug, --no-tracking, etc.)

For now, the majority of the other options are ignored when using --dbt or --dbt --cloud, but we can add that over time

@dlawin dlawin force-pushed the dbt branch 3 times, most recently from 4ce9f39 to 27e0a87 Compare January 27, 2023 18:00
@williebsweet williebsweet requested a review from nolar January 27, 2023 18:43
@dlawin
Copy link
Contributor Author

dlawin commented Feb 1, 2023

tested compatibility with dbt:
1.0
1.1
1.2
1.3
1.4

I didn't cover every adapter (just snowflake), but that should not affect the dbt-core artifacts being parsed.

@dlawin
Copy link
Contributor Author

dlawin commented Feb 1, 2023

Hey @williebsweet @nolar, I think I've addressed all of the feedback. Re-requested review

@dlawin dlawin requested a review from erezsh February 3, 2023 18:22
Copy link
Contributor

@erezsh erezsh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good enough for me

@erezsh erezsh merged commit 13f9beb into datafold:master Feb 4, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants