Skip to content

Add database conversion scripts #1005

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

tgnottingham
Copy link
Contributor

We already have some scripts that sort of partially convert between SQLite and Postgres DBs.
This change adds scripts that fully convert between SQLite and Postgres DBs, and vice-versa,
largely intended to make it easier to test changes on both supported DB types. It also sneaks
in some changes to make the schemas of the two DB types more consistent with eachother.

@tgnottingham
Copy link
Contributor Author

I tested correctness by creating at least one row of every table in SQLite, doing a round trip conversion (SQLite -> new Postgres -> new SQLite), and checking that the SQL dumps were identical before and after. Not a perfect test, but I think it's good enough.

I tested speed by using the DB at http://perf-data.rust-lang.org/export.db.sz, of which only the collection table had any significant size. Using individual insert statements for the SQLite DB was fine, but was pretty unbearably slow for Postgres, so I opted to convert the data to CSV and use the copy in command, which was much faster, if a bit ugly, due to some null case handling, and interoperation between csv and tokio_postgres APIs.

I'd be happy to create a solution that cleans up the workarounds related to those issues, and doesn't rely on our data not containing a particular sentinel string to represent null, it was just going to take a decent chunk of code to do that right, and I didn't want to add to the review burden if it wasn't that important.

@tgnottingham tgnottingham force-pushed the sqlite-postgres-converters branch from efddeb4 to 7206d37 Compare September 16, 2021 20:47
@tgnottingham
Copy link
Contributor Author

Force push was just an update to a comment.

@Mark-Simulacrum
Copy link
Member

Gave each file a fairly quick skim -- I think the approaches used seem OK, can likely merge this.

However, I think we should delete the export-to-sqlite.rs script, right? It seems like it's made obsolete by the postgres-to-sqlite script here. That said, in order to do that, we'll probably want a configuration option which skips export of the self profile query data, as it is way too large for us to export from the production db. (Probably we should investigate not storing all that data as we are today, since the table is unwieldy).

 Schema |             Name             | Type  |   Owner    |  Size   | Description
--------+------------------------------+-------+------------+---------+-------------
 public | artifact                     | table | rustc-perf | 576 kB  |
 public | artifact_collection_duration | table | rustc-perf | 240 kB  |
 public | benchmark                    | table | rustc-perf | 48 kB   |
 public | collection                   | table | rustc-perf | 492 MB  |
 public | collector_progress           | table | rustc-perf | 13 MB   |
 public | error                        | table | rustc-perf | 11 MB   |
 public | error_series                 | table | rustc-perf | 48 kB   |
 public | migrations                   | table | rustc-perf | 48 kB   |
 public | pstat                        | table | rustc-perf | 2671 MB |
 public | pstat_series                 | table | rustc-perf | 536 kB  |
 public | pull_request_build           | table | rustc-perf | 392 kB  |
 public | raw_self_profile             | table | rustc-perf | 148 MB  |
 public | rustc_compilation            | table | rustc-perf | 38 MB   |
 public | self_profile_query           | table | rustc-perf | 56 GB   |
 public | self_profile_query_series    | table | rustc-perf | 68 MB   |

I'm testing an export of the production database now, which takes forever (several hours at least) with self_profile_query but I figure will be a nice artifact to have available, though an unwieldy one.


When trying to build this locally to test out the export script cargo build --release -p database failed -- I think probably due to some features = [...] declarations missing? This might be preexisting, not sure, but could you fix this?

error[E0277]: the trait bound `DateTime<Utc>: _::_serde::Serialize` is not satisfied
    --> database/src/bin/sqlite-to-postgres.rs:359:5
     |
359  |     requested: Nullable<DateTime<Utc>>,
     |     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ the trait `_::_serde::Serialize` is not implemented for `DateTime<Utc>`
     |
note: required because of the requirements on the impl of `_::_serde::Serialize` for `Nullable<DateTime<Utc>>`

@Mark-Simulacrum
Copy link
Member

FWIW:

Copied 6160 rows from artifact table in 411.434401ms (14972 rows/second)
Copied 4203 rows from artifact_collection_duration table in 138.601804ms (30324 rows/second)
Copied 51 rows from benchmark table in 68.618983ms (743 rows/second)
Copied 7015220 rows from collection table in 18.94866753s (370222 rows/second)
Copied 188128 rows from collector_progress table in 763.427603ms (246425 rows/second)
Copied 48 rows from error_series table in 69.234538ms (693 rows/second)
Copied 1959 rows from error table in 3.246739192s (603 rows/second)
Copied 6449 rows from pstat_series table in 136.869589ms (47118 rows/second)
Copied 53662434 rows from pstat table in 161.743591828s (331775 rows/second)
Copied 1615 rows from pull_request_build table in 149.155674ms (10828 rows/second)
Copied 2201666 rows from raw_self_profile table in 11.513527173s (191224 rows/second)
Copied 657514 rows from rustc_compilation table in 2.644713239s (248614 rows/second)
Copied 716364 rows from self_profile_query_series table in 5.145879719s (139211 rows/second)
Copied 786711365 rows from self_profile_query table in 19569.481443377s (40201 rows/second)

@Mark-Simulacrum
Copy link
Member

(FYI -- pulled in part of this PR as part of #1019, so you'll want to rebase. Hopefully painless!)

@tgnottingham tgnottingham force-pushed the sqlite-postgres-converters branch from d979f22 to 3d19e4f Compare September 18, 2021 18:17
@tgnottingham
Copy link
Contributor Author

I added some options so that postgres-to-sqlite covers the export-sqlite functionality, albeit with more info needed from the user. I can add an alias option like --partial that does something commonly wanted if that helps.

$ ./target/release/postgres-to-sqlite --help
postgres-to-sqlite 
Exports a rustc-perf Postgres database to a SQLite database

USAGE:
    postgres-to-sqlite [FLAGS] [OPTIONS] <POSTGRES_DB> <SQLITE_DB>

FLAGS:
        --fast-unsafe    Enable faster execution at the risk of corrupting SQLite database in the event of a crash
    -h, --help           Prints help information
    -V, --version        Prints version information

OPTIONS:
        --exclude-tables <TABLES>    Exclude given tables (as foreign key constraints allow) [possible values: artifact,
                                     artifact_collection_duration, benchmark, collection, collector_progress,
                                     error_series, error, pstat_series, pstat, pull_request_build, raw_self_profile,
                                     rustc_compilation, self_profile_query_series, self_profile_query]
        --since-weeks-ago <WEEKS>    Exclude data associated with artifacts whose date value precedes <WEEKS> weeks ago

ARGS:
    <POSTGRES_DB>    Postgres database connection string, e.g. postgres://user:password@localhost:5432
    <SQLITE_DB>      SQLite database file

The --exclude-tables option isn't forgiving at all if you exclude a table that an unexcluded table has foreign key references to. It will panic when the foreign key constraint is violated. That would be painful if it was late in a long export process. So...don't get it wrong. :)


Copied 786711365 rows from self_profile_query table in 19569.481443377s (40201 rows/second)

Oof. Added the --fast-unsafe option to do what the export-to-sqlite script did by default -- set the journal mode and synchronous pragmas to "off". Also, if exporting is something people want to do more frequently, then inserting multiple rows per insert statement supposedly can help quite a bit. Just requires more code.

@tgnottingham tgnottingham force-pushed the sqlite-postgres-converters branch from 3d19e4f to ed4f7b1 Compare September 18, 2021 18:32
@tgnottingham
Copy link
Contributor Author

I updated Dockerfile to reference postgres-to-sqlite instead of export-to-sqlite. If anything outside of rustc-perf references that, it will need updating of course.

@tgnottingham tgnottingham force-pushed the sqlite-postgres-converters branch from ed4f7b1 to 086cdc7 Compare September 18, 2021 18:38
@Mark-Simulacrum
Copy link
Member

I don't think we use the export script anywhere (anymore) from docker, so that should be fine to update.

It would be nice to have a helper flag --no-self-profile or similar that drops inclusion of just the right tables to not have the self profile tables that are slow to export (so we should keep the urls table but not the summarize results rows).

I don't think we need to worry right now about optimizing the export/import too much. I suspect that at least in part the massive slowness might (though I have no real evidence) be because we're not commiting to the database - I saw the .wal file growing into the tens of gigabytes. If we're checking foreign keys + constraints during processing, then this will fail badly on the write ahead log if it's not indexed/nicely searchable, pretty much no matter what. So maybe we should be commiting in between.... I don't really know enough about the internals of SQLite to experiment here.

If you'd like me to I'm happy to throw the full export on the S3 bucket (42 GB, 17 GB zstd compressed) so you can play with it in both directions. (Not really sure why this is so much smaller than postgres - maybe we're missing some data?)

@tgnottingham
Copy link
Contributor Author

It would be nice to have a helper flag --no-self-profile or similar that drops inclusion of just the right tables to not have the self profile tables that are slow to export (so we should keep the urls table but not the summarize results rows).

Done. Really just excludes self_profile_query and self_profile_query_series, the latter of which apparently isn't large, but isn't really needed if you don't export the former.

I don't think we need to worry right now about optimizing the export/import too much. I suspect that at least in part the massive slowness might (though I have no real evidence) be because we're not commiting to the database - I saw the .wal file growing into the tens of gigabytes. If we're checking foreign keys + constraints during processing, then this will fail badly on the write ahead log if it's not indexed/nicely searchable, pretty much no matter what. So maybe we should be commiting in between.... I don't really know enough about the internals of SQLite to experiment here.

Could be. I have very little DB knowledge, much less SQLite-specific knowledge. I've seen it commonly recommended to delete constraints when doing mass DB conversions like this, then add them back at the end, to improve performance. But that would be somewhat painful to do I think, at least in a maintainable way, what with the migrations stuff. The --fast-unsafe option (which is not really unsafe if you are exporting to a new DB...) at least makes the WAL file a non-issue, IIUC.

If you'd like me to I'm happy to throw the full export on the S3 bucket (42 GB, 17 GB zstd compressed) so you can play with it in both directions. (Not really sure why this is so much smaller than postgres - maybe we're missing some data?)

Yeah, that would be nice! I think I'm probably done trying to speed up the exports for the time being, but it would definitely be helpful to have for testing in general.

As far as the size difference, from a quick search, it looks like it's not too uncommon for them to differ by quite a bit in one direction or the other.

@Mark-Simulacrum
Copy link
Member

Sounds good. I'll upload the file later then (not replacing the existing one though, given the size).

Will take a look at the commits here in a bit and hopefully leave either a final review or merge :)

@Mark-Simulacrum
Copy link
Member

https://perf-data.rust-lang.org/ptosql.db.zst is the file.

Copy link
Member

@Mark-Simulacrum Mark-Simulacrum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally looks good. Left one comment on the added migration that seems good to resolve (or drop the migration for now, which may be way easier).

I think we have some basic database checks right now in CI -- mostly just "can run migrations". I wonder if you'd be up for running to migration in both directions as well -- even on empty data sets it'll confirm we don't have any malformed queries. That won't check that new tables aren't forgotten, but at least any adjustments to old ones should get picked up (I guess not new columns...).

@tgnottingham
Copy link
Contributor Author

I think we have some basic database checks right now in CI -- mostly just "can run migrations". I wonder if you'd be up for running to migration in both directions as well -- even on empty data sets it'll confirm we don't have any malformed queries. That won't check that new tables aren't forgotten, but at least any adjustments to old ones should get picked up (I guess not new columns...).

You mean adding something to .github/workflows/ci.yml to test this in CI, right? Sure, I can do that.

SQLite allows primary keys to be null. For consistency with Postgres and
sanity, add a not null constraint to the benchmark table's primary key.

Also, add facility to support greater variety of migrations.
The feature was already used by another crate in the workspace, so the
database crate would build with `cargo build`, but not with `cargo build
-p database`. This fixes that.
This should be fairly safe to do. Although the side-effects of each step
will be visible to subsequent steps, for the checks we're running, it's
unlikely to matter. Saves a little build time.
@tgnottingham tgnottingham force-pushed the sqlite-postgres-converters branch from 95a9290 to b846359 Compare September 19, 2021 21:59
@tgnottingham
Copy link
Contributor Author

Added two commits to update CI checks. I combined the database checks into one job. Let me know if that seems like a bad idea.

Updated the first commit with some new migration code. Recommend viewing that diff with whitespace changes hidden. Verified that it bails when a migration violates a foreign key constraint (did this by having a migration delete a specific row that another table referenced). The panic message looks like this:

thread 'main' panicked at 'Foreign key violation encountered during migration
table: error_series,
column: crate,
row_id: Some(1),
foreign table: benchmark,
foreign column: name
migration ID: 11', database/src/pool/sqlite.rs:161:21

@Mark-Simulacrum
Copy link
Member

OK, I think this seems good to merge. I did have a thought just now (maybe a quick followup PR, though feel free to add to this one): we could use the export.db.sz on perf-data.rust-lang.org (the smaller export) to more thoroughly exercise import/export than with empty DBs. It'd also probably give us a more thorough test of the site loading code, fwiw.

For now that database is somewhat outdated (a couple months old at this point), but I imagine eventually we'll set up some kind of regular export.

@tgnottingham
Copy link
Contributor Author

Sure, I'll add that to this PR.

@Mark-Simulacrum
Copy link
Member

OK, happy to merge once that's done :)

@tgnottingham
Copy link
Contributor Author

Investigating the CI failure.

@tgnottingham tgnottingham force-pushed the sqlite-postgres-converters branch 3 times, most recently from 47750c8 to 8047729 Compare September 20, 2021 00:07
@tgnottingham tgnottingham force-pushed the sqlite-postgres-converters branch from 8047729 to 7566cda Compare September 20, 2021 00:21
@tgnottingham
Copy link
Contributor Author

tgnottingham commented Sep 20, 2021

CI issue was that I was diffing the before and after roundtrip SQLite dumps, but the before dump didn't have the latest migrations applied. Fixed that. The dumps can also differ by whitespace, so accounted for that, too.

Noticed that postgres-to-sqlite is ~7x slower in CI than on my 2015 laptop running Postgres in a container. May want to do the multiple row per insert approach at some point to help with that. But not now. :)

@Mark-Simulacrum
Copy link
Member

Sounds great! I'll go ahead and wait for all the CI checks to pass and then merge this (plausibly tomorrow).

@Mark-Simulacrum Mark-Simulacrum merged commit 672da94 into rust-lang:master Sep 20, 2021
@tgnottingham tgnottingham deleted the sqlite-postgres-converters branch September 25, 2021 17:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants