Skip to content

[Incremental] Catch conversion failure in Fingerprint #35405

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 17 commits into from
Jan 16, 2021

Conversation

davidungar
Copy link
Contributor

@davidungar davidungar commented Jan 13, 2021

When creating a Fingerprint from a string, if the string is not valid hex, information is lost, creating the potential for miscompilation. However, I believe this situation only happens in some tests, for example, Driver/Dependencies/mutual-interface-hash-fine.swift. This change checks for that failure, but it cannot be merged as-is because some tests will (rightly) fail. Before the advent of Fingerprint the swift-dependency-tool would write out the whole string, and thus the tests worked correctly.

To see the failure, try converting the .swift files from before this PR in Driver/Dependencies/Inputs/mutual-interface-hash-fine from yaml to binary and back.

@davidungar davidungar requested a review from CodaFi January 13, 2021 05:23
Copy link
Contributor

@CodaFi CodaFi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should just convert the tests - it's not hard to teach Slava's tool to do this.

Fingerprints created by the frontend will otherwise round-trip losslessly, so you are correct that only the tests are wonky here.

@davidungar
Copy link
Contributor Author

We should just convert the tests - it's not hard to teach Slava's tool to do this.

Fingerprints created by the frontend will otherwise round-trip losslessly, so you are correct that only the tests are wonky here.

I see two issues: one meriting a fix and the other, an understandable oversight meriting discussion:
The definite problem is that there was no check in the fromString: input. That code could not deal with a non-hex number, but it did not check. You did put in an assertion which is great, but it turned out that that assertion was in the wrong place to check the input. I'll be refining the PR to fix that.

What merits discussion is that Fingerprint chooses limit the quantity of information to 128 bits, instead of an unlimited string. I understand how the present state of the compiler would inspire that limitation. But that limitation is inconsistent with the present state of the tests. I expect that a fixed, hard limit on the size of a fingerprint is good for compiler performance, but might there be a way to get the best of both worlds?

@CodaFi
Copy link
Contributor

CodaFi commented Jan 13, 2021

What merits discussion is that Fingerprint chooses limit the quantity of information to 128 bits

Increasing the number of bits in a hash does not correlate at all with an increase in desirable properties. For example, SHA-1 is a 256-bit hash, but it is deeply insecure and weaker than (the ironically named for this discussion) SHA-256, and less collision resistant than SipHash.

@davidungar
Copy link
Contributor Author

What merits discussion is that Fingerprint chooses limit the quantity of information to 128 bits

Increasing the number of bits in a hash does not correlate at all with an increase in desirable properties. For example, SHA-1 is a 256-bit hash, but it is deeply insecure and weaker than (the ironically named for this discussion) SHA-256, and less collision resistant than SipHash.

While there exist larger fingerprints with worse performance--a fair point!--that does not rule out the potential need for larger fingerprints in order to get better performance. From an information theoretic standpoint, a fingerprint with more bits can potentially fulfill its purpose better. By placing a fixed, hard limit on the size of a fingerprint, the existing code will make it harder to expand the fingerprint in the future, should we deem it necessary to do so.

That necessity cannot be ruled out because we don't have much data for our code-base regarding the efficacy of the fingerprints, that is, how often a fingerprint fails to catch a mutation in the code. I have heard that other projects have some experience here, but have not investigated. Thus, it seems conceivable that we might enlarge fingerprints someday. The Fingerprint abstraction will help do that if/when we do.

I am OK with leaving length fixed for now; however, we'll have to fix the tests in a manner which will make them slightly harder to debug in the future.

@CodaFi
Copy link
Contributor

CodaFi commented Jan 13, 2021

By placing a fixed, hard limit on the size of a fingerprint, the existing code will make it harder to expand the fingerprint in the future, should we deem it necessary to do so.

I don't understand this objection. We have the ability to change the version and format of swiftdeps at any time. If e.g. a larger bitwidth is needed, we have the headroom to make such a change.

@davidungar
Copy link
Contributor Author

By placing a fixed, hard limit on the size of a fingerprint, the existing code will make it harder to expand the fingerprint in the future, should we deem it necessary to do so.

I don't understand this objection. We have the ability to change the version and format of swiftdeps at any time. If e.g. a larger bitwidth is needed, we have the headroom to make such a change.

Yes we do, although it would be harder than when the fingerprint was a string.
I objected because I didn't understand the relevance of:

Increasing the number of bits in a hash does not correlate at all with an increase in desirable properties. For example, SHA-1 is a 256-bit hash, but it is deeply insecure and weaker than (the ironically named for this discussion) SHA-256, and less collision resistant than SipHash.

The fact that a poorer yet larger hash exists struck me as irrelevant to the merits of modifying the code to accommodate varying-length fingerprints today.

I also failed to understand the "does not correlate at all" statement, given my understanding of Shannon information theory.

Anyway, I'll look into fixing up the tests and adding those fixes to this PR.

@CodaFi
Copy link
Contributor

CodaFi commented Jan 13, 2021

Information theory is simply one aspect. You're considering the data format of the hash, instead of the hash algorithm itself and its result. If you have such a variable-length algorithm in mind, I invite you to propose it and implement it and we'll give it a shootout with the compiler performance suite.

Note that a variable-length format is not, in itself, a good thing. One could imagine a terrible hash function on variable-length outputs where the combiner simply appends a 0 or 1 to a bitstring for each quantum of data. You'd collide quite frequently, but you'd have all the entropy you could cram into the thing at your disposal.

@davidungar
Copy link
Contributor Author

Information theory is simply one aspect. You're considering the data format of the hash, instead of the hash algorithm itself and its result. If you have such a variable-length algorithm in mind, I invite you to propose it and implement it and we'll give it a shootout with the compiler performance suite.

Note that a variable-length format is not, in itself, a good thing. One could imagine a terrible hash function on variable-length outputs where the combiner simply appends a 0 or 1 to a bitstring for each quantum of data. You'd collide quite frequently, but you'd have all the entropy you could cram into the thing at your disposal.

Absolutely right! But doesn't information theory imply that although no fingerprint will be perfect, a longer fingerprint has the potential to be better than a shorter one, as long as the fingerprint is still shorter than the code its hashing.

Given that you've chosen an excellent hash, if someday it seems to be not quite good enough, might it not be the case that we need to enlarge it?

@davidungar
Copy link
Contributor Author

@swift-ci please smoke test

@davidungar davidungar changed the title [DNM; Incremental] Catch conversion failure in Fingerprint [Incremental] Catch conversion failure in Fingerprint Jan 14, 2021
@davidungar
Copy link
Contributor Author

Fingerprint::fromString now always checks for information loss, returning an optional. I believe this check is worthwhile because:

  1. The data being checked are read in from a file, either in the driver or when testing. It's good to check for file corruption, and
  2. Given all the other work in reading the file, I don't think we'll see a noticeable performance regression in the driver.

Copy link
Contributor

@CodaFi CodaFi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Neat. :shipit:

@CodaFi
Copy link
Contributor

CodaFi commented Jan 15, 2021

@swift-ci test

@davidungar
Copy link
Contributor Author

Thank you, @CodaFi

@swift-ci
Copy link
Contributor

Build failed
Swift Test Linux Platform
Git Sha - 67028b9f4f76e85e4a29a039398b52746c1c9cda

@swift-ci
Copy link
Contributor

Build failed
Swift Test OS X Platform
Git Sha - 67028b9f4f76e85e4a29a039398b52746c1c9cda

@davidungar
Copy link
Contributor Author

@swift-ci please smoke test

@davidungar
Copy link
Contributor Author

@swift-ci please smoke test

@davidungar
Copy link
Contributor Author

@swift-ci please smoke test

@davidungar
Copy link
Contributor Author

@swift-ci please smoke test

@davidungar davidungar merged commit 0c99379 into swiftlang:main Jan 16, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants