-
Notifications
You must be signed in to change notification settings - Fork 931
Dedup imports_granularity = "Item"
#4737
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dedup imports_granularity = "Item"
#4737
Conversation
Not sure I understand the comment. Two
|
To reiterate what I mentioned in #4725:
When Also note that it's possible for The contents of this proposed test file is actually a good picture of why the proposed solution does not truly address the problem either, despite the inefficiency. If I were going to work on this then I think I might try reaching for utilizing a HashSet (although think that would require a custom Hash impl) or try to reuse some of the existing code that performs the dedupe with the other granularity variants |
I believe I now understand my misunderstanding. Just to make sure - in the following input code, the duplicate // rustfmt-imports_granularity: Item
// rustfmt-reorder_imports: false
// rustfmt-group_imports: StdExternalCrate
use crate::lexer;
use crate::lexer::tokens::TokenData;
use crate::lexer; On the other hand, with |
Yup, you've got it!
Correct, provided |
07d8cca
to
57c2dd8
Compare
Changed. Hopefully it is o.k. now. Ended up using |
That's a good idea 👍
Yup I'm aware, but that doesn't mean that one could not use a hashset as part of the process to accept a vec of use trees and produce a deduped and flattend one. I'm not familiar with the itertools internals, but I'd wager a good amount that they are internally leveraging something like a hashset or hashmap for this. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! I like the itertools idea, should minimize refactoring and avoid re-inventing any wheels. Think we can make this a little cleaner though, few comments/requests inline below
src/formatting/imports.rs
Outdated
@@ -197,6 +201,7 @@ pub(crate) fn flatten_use_trees(use_trees: Vec<UseTree>) -> Vec<UseTree> { | |||
} | |||
tree | |||
}) | |||
.unique_by(|ut| format!("{}", ut)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would much rather implement Hash
on UseTree
so that we can just use unique()
directly instead of taking the hit on calling format
.
We're really just dealing with the path so I think this would suffice (along with the default Hash derive on UseSegment):
impl Hash for UseTree {
fn hash<H: Hasher>(&self, state: &mut H) {
self.path.hash(state);
}
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Before I try to implement this, I would like to make sure fully understand the implementation. My main question is what is default Hash derive on UseSegment
? How and were it is implemented?
I am asking because my first approach was to implement Hasher
for both UseTree
and UseSegment
. This is by generating the list of the segments` text and then hashing the generated text. I ended up with something very similar to the UseTree
format
and realized that using the format
is about the same from performance point of view.
One more question for my information (assuming that the answer is short). I didn't fully understand how the the hash function works. Taking as an example the hash function you suggested above. First, it doesn't have a return value, so where is the hash value stored and how it is used? Second, through the iterations I assume that several numeric hash valuses are calculated for the non-List UseSegments
. If this is the case, it is not clear to me if and how these hash values are "combined" to form the hash of the initial UseTree
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My main question is what is default Hash derive on UseSegment? How and were it is implemented?
I was referring to the Hash trait, and that we should able to leverage the derive attribute for UseSegment instead of having to manually implement the Hash trait ourselves. I'm not sure we'll be able to leverage derive
on UseTree because that only works when all the fields themselves have a Hash impl, and I know that not all of the AST nodes do (feel free to check/test though!). Once there's a Hash impl for both UseTree and UseSegment then you can use unique()
directly, instead of formatting to a temporary String.
I am asking because my first approach was to implement Hasher for both UseTree and UseSegment. This is by generating the list of the segments` text and then hashing the generated text. I ended up with something very similar to the UseTree format and realized that using the format is about the same from performance point of view.
I don't think there will be a need to implement the Hasher trait on any of our custom types to resolve this issue. My assumption here was that the itertools functions we were hoping to leverage utilized some hash-based data structure for the deduplication feature we needed, and that in turn would require that the items (UseTree in our case) implemented the Hash trait (side note, looks like they do leverage a hash set internally 😉).
This is indeed the case, as highlighted in the the itertools docs for the unique/unique_by functions. If you were to try to just use unique()
on the current code, you'd presumably get errors complaining about unbound constraints, because our UseTree does not have a Hash implementation.
Your initial proposal works because you're allocating an intermediate String representation for each UseTree (which we have to do again later for the actual emitting of the formatted code) and that satisfies the itertools constraints because String does have a Hash impl. We don't want to do this though because it is unnecessary, and we should simply ensure that we have a Hash impl.
One more question for my information (assuming that the answer is short). I didn't fully understand how the the hash function works
Sorry, but that's a little too off topic for a code review within rustfmt 😄 Hopefully the above info helps somewhat, but otherwise I'd refer you to materials like the Book, the stdlib docs (https://doc.rust-lang.org/std/index.html), or even the rustc source if you really want to drill into it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the detailed response! The reference to stdlib also help as I realized I missed the finish
method that emits the hash calculation result.
I would much rather implement Hash on UseTree so that we can just use unique() directly instead of taking the hit on calling format.
I have now change unique_by
to unique
using your suggested code.
@@ -0,0 +1,6 @@ | |||
use crate::lexer; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add some test files that have multiple groups, and perhaps some that have comments too? Maybe something like this:
use crate::lexer::{tokens::TokenData /* foo */ };
use crate::lexer::{tokens::TokenData};
It might also be useful to have some duplicates but with varying visibility levels (we may not handle this well with the other variants either FWIW, but would be good to check while we're working on this one)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add some test files that have multiple groups, and perhaps some that have comments too?
I didn't add such test files as there are some issues, especially related to comments, so maybe it is better to discuss first (here or in a new issue) which of these issues should be resolved, whether there are cases where the original code should be used, what should be the expected results, etc.
For example, the for the following input:
use crate::foo;
use crate::foo::self;
use crate::foo::self /* foo 1 */;
use crate::foo::{self /* foo 2 */ };
use crate::foo::{tokens::TokenData};
use crate::foo::{tokens::TokenData /* foo 3 */ };
use crate::foo::{self, tokens::TokenData /* foo 4 */};
use crate::bar::self /* bar 1 */;
use crate::bar::self;
use crate::foobar::self; /* foobar 1 */
use crate::foobar::self;
"Crate" granularity produce the following output:
use crate::bar; /* bar 1 */;
use crate::foo; /* foo 1 */;
use crate::foobar; /* foobar 1 */
use crate::{
bar, foo,
foo::{self /* foo 2 */, tokens::TokenData},
foobar,
};
Comments foo 3
and foo 4
disappeared and there is an extra ;
after comments bar 1
and foo 1
.
"Module" granularity produce the following output:
use crate::bar; /* bar 1 */;
use crate::foo; /* foo 1 */;
use crate::foo::tokens::TokenData;
use crate::foo::{self /* foo 2 */, self};
use crate::foobar; /* foobar 1 */
use crate::{bar, foo, foobar};
In addition to the same issues as for "Crate", foo::{...}
includes self
twice: one with foo 2
comment and one without a comment.
"Item" granularity produce the following output:
use crate::bar; /* bar 1 */;
use crate::foo;
use crate::foo::tokens::TokenData;
use crate::foo::{self /* foo 2 */};
use crate::foobar; /* foobar 1 */
Comment foo 1
is missing because in the original code crate::foo::self
is before crate::foo::self /* foo 1 */
. The bar 1
comment does appear, since crate::bar::self
is before crate::foobar::self; /* foobar 1 */
. As above, foo 3
and foo 4
comments disappeared and there is an extra ;
after comments bar 1
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as there are some issues, especially related to comments
wow there sure is 🙃
We've always had comment issues with this option (which was named merge_imports
prior to a recent rename/restructure) so I'm not too surprised by some of those, but the emission of invalid code in this type of case is particularly concerning:
use rustc_ast::ast /* abc */;
I didn't add such test files as there are some issues, especially related to comments, so maybe it is better to discuss first (here or in a new issue) which of these issues should be resolved, whether there are cases where the original code should be used, what should be the expected results, etc.
Good call 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Submitted PR #4759 to fix the extra semicolon issue.
I will also try to resolve the issue where in some cases if the same import module appears twice - once with a comment and once without a comment, it appears twice in the formatted output. E.g. in the above output for Crate
, 'create::barand
crate:::foo` appear twice.
Update:
The reason some import items appear twice for Crate
and Module
granularity, is because merge_use_trees()
don't try to merge items with comments and just push them to the output:
rustfmt/src/formatting/imports.rs
Lines 170 to 172 in a9876e8
if use_tree.has_comment() || use_tree.attrs.is_some() { | |
result.push(use_tree); | |
continue; |
This is probably a patch done because imports comments are not handled well, but it ignores duplicates.
There is a relatively simple way to remove these duplicates by ignoring all non-commented flattened items that have a duplicate with comment. However, because duplicates with and without comments are not expected to be common in practice, and because no issue was opened by users against this problem, it seems that it is not worth trying to add this enhancement.
I believe this backport is predicated on #3795 being backported first. At least the only portion of the tests that are failing have to do with the import order for raw identifiers. We could backport this and update the tests to follow the 1.x sorting behavior. @calebcartwright Thoughts? |
That works for me @ytmimi, or at least we could have multiple versions of these tests that add the corresponding Version={One/Two} to account for how the ordering is applied with those raw identifiers |
* Fix for issue 4725 - dedup Item imports_granularity (2nd version) * Use unique() instead of unique_by()
Suggested fix for issue #4725.
The PR includes also creation of
tests/imports
folder and moving all theimport*.rs
test cases to this folder.Note that the
issue-4725-imports_granularity_default.rs
test file is a copy of a test file from #4716 (for Module granularity). If both PRs are merged then only one copy of the file is required, possibly renaming it toimports_granularity_default.rs
. Also in this case the #4716 tests files should be moved under theimports
folder. (If #4716 is merged first I can rebase this PR and do these changes here.)