-
Notifications
You must be signed in to change notification settings - Fork 156
Add database conversion scripts #1005
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add database conversion scripts #1005
Conversation
I tested correctness by creating at least one row of every table in SQLite, doing a round trip conversion (SQLite -> new Postgres -> new SQLite), and checking that the SQL dumps were identical before and after. Not a perfect test, but I think it's good enough. I tested speed by using the DB at http://perf-data.rust-lang.org/export.db.sz, of which only the I'd be happy to create a solution that cleans up the workarounds related to those issues, and doesn't rely on our data not containing a particular sentinel string to represent null, it was just going to take a decent chunk of code to do that right, and I didn't want to add to the review burden if it wasn't that important. |
efddeb4
to
7206d37
Compare
Force push was just an update to a comment. |
Gave each file a fairly quick skim -- I think the approaches used seem OK, can likely merge this. However, I think we should delete the export-to-sqlite.rs script, right? It seems like it's made obsolete by the postgres-to-sqlite script here. That said, in order to do that, we'll probably want a configuration option which skips export of the self profile query data, as it is way too large for us to export from the production db. (Probably we should investigate not storing all that data as we are today, since the table is unwieldy).
I'm testing an export of the production database now, which takes forever (several hours at least) with self_profile_query but I figure will be a nice artifact to have available, though an unwieldy one. When trying to build this locally to test out the export script
|
FWIW:
|
7206d37
to
d979f22
Compare
(FYI -- pulled in part of this PR as part of #1019, so you'll want to rebase. Hopefully painless!) |
d979f22
to
3d19e4f
Compare
I added some options so that postgres-to-sqlite covers the export-sqlite functionality, albeit with more info needed from the user. I can add an alias option like
The
Oof. Added the |
3d19e4f
to
ed4f7b1
Compare
I updated |
ed4f7b1
to
086cdc7
Compare
I don't think we use the export script anywhere (anymore) from docker, so that should be fine to update. It would be nice to have a helper flag --no-self-profile or similar that drops inclusion of just the right tables to not have the self profile tables that are slow to export (so we should keep the urls table but not the summarize results rows). I don't think we need to worry right now about optimizing the export/import too much. I suspect that at least in part the massive slowness might (though I have no real evidence) be because we're not commiting to the database - I saw the .wal file growing into the tens of gigabytes. If we're checking foreign keys + constraints during processing, then this will fail badly on the write ahead log if it's not indexed/nicely searchable, pretty much no matter what. So maybe we should be commiting in between.... I don't really know enough about the internals of SQLite to experiment here. If you'd like me to I'm happy to throw the full export on the S3 bucket (42 GB, 17 GB zstd compressed) so you can play with it in both directions. (Not really sure why this is so much smaller than postgres - maybe we're missing some data?) |
Done. Really just excludes
Could be. I have very little DB knowledge, much less SQLite-specific knowledge. I've seen it commonly recommended to delete constraints when doing mass DB conversions like this, then add them back at the end, to improve performance. But that would be somewhat painful to do I think, at least in a maintainable way, what with the migrations stuff. The
Yeah, that would be nice! I think I'm probably done trying to speed up the exports for the time being, but it would definitely be helpful to have for testing in general. As far as the size difference, from a quick search, it looks like it's not too uncommon for them to differ by quite a bit in one direction or the other. |
Sounds good. I'll upload the file later then (not replacing the existing one though, given the size). Will take a look at the commits here in a bit and hopefully leave either a final review or merge :) |
https://perf-data.rust-lang.org/ptosql.db.zst is the file. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally looks good. Left one comment on the added migration that seems good to resolve (or drop the migration for now, which may be way easier).
I think we have some basic database checks right now in CI -- mostly just "can run migrations". I wonder if you'd be up for running to migration in both directions as well -- even on empty data sets it'll confirm we don't have any malformed queries. That won't check that new tables aren't forgotten, but at least any adjustments to old ones should get picked up (I guess not new columns...).
You mean adding something to |
SQLite allows primary keys to be null. For consistency with Postgres and sanity, add a not null constraint to the benchmark table's primary key. Also, add facility to support greater variety of migrations.
The feature was already used by another crate in the workspace, so the database crate would build with `cargo build`, but not with `cargo build -p database`. This fixes that.
This should be fairly safe to do. Although the side-effects of each step will be visible to subsequent steps, for the checks we're running, it's unlikely to matter. Saves a little build time.
95a9290
to
b846359
Compare
Added two commits to update CI checks. I combined the database checks into one job. Let me know if that seems like a bad idea. Updated the first commit with some new migration code. Recommend viewing that diff with whitespace changes hidden. Verified that it bails when a migration violates a foreign key constraint (did this by having a migration delete a specific row that another table referenced). The panic message looks like this:
|
OK, I think this seems good to merge. I did have a thought just now (maybe a quick followup PR, though feel free to add to this one): we could use the export.db.sz on perf-data.rust-lang.org (the smaller export) to more thoroughly exercise import/export than with empty DBs. It'd also probably give us a more thorough test of the site loading code, fwiw. For now that database is somewhat outdated (a couple months old at this point), but I imagine eventually we'll set up some kind of regular export. |
Sure, I'll add that to this PR. |
OK, happy to merge once that's done :) |
Investigating the CI failure. |
47750c8
to
8047729
Compare
8047729
to
7566cda
Compare
CI issue was that I was diffing the before and after roundtrip SQLite dumps, but the before dump didn't have the latest migrations applied. Fixed that. The dumps can also differ by whitespace, so accounted for that, too. Noticed that postgres-to-sqlite is ~7x slower in CI than on my 2015 laptop running Postgres in a container. May want to do the multiple row per insert approach at some point to help with that. But not now. :) |
Sounds great! I'll go ahead and wait for all the CI checks to pass and then merge this (plausibly tomorrow). |
We already have some scripts that sort of partially convert between SQLite and Postgres DBs.
This change adds scripts that fully convert between SQLite and Postgres DBs, and vice-versa,
largely intended to make it easier to test changes on both supported DB types. It also sneaks
in some changes to make the schemas of the two DB types more consistent with eachother.