Skip to content

Dedup imports_granularity = "Item" #4737

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

davidBar-On
Copy link
Contributor

Suggested fix for issue #4725.

The PR includes also creation of tests/imports folder and moving all the import*.rs test cases to this folder.

Note that the issue-4725-imports_granularity_default.rs test file is a copy of a test file from #4716 (for Module granularity). If both PRs are merged then only one copy of the file is required, possibly renaming it to imports_granularity_default.rs. Also in this case the #4716 tests files should be moved under the imports folder. (If #4716 is merged first I can rebase this PR and do these changes here.)

@calebcartwright
Copy link
Member

Note that Vec's dedup only removes consecutive duplicates. As mentioned #4725 the collection of UseTrees hasn't been sorted at this point, and there's no guarantee that we won't have non-consecutive duplicates.

The resolution for #4725 will require a different approach.

@davidBar-On
Copy link
Contributor Author

Note that Vec's dedup only removes consecutive duplicates. As mentioned #4725 the collection of UseTrees hasn't been sorted at this point, and there's no guarantee that we won't have non-consecutive duplicates.

Not sure I understand the comment. Two dedup were added:

  1. dedup in flatten_use_trees which I believe this what you were referring to. This is for the case when reorder_imports is not set, as you mentioned in Deduplication doesn't occur with `imports_granularity = "Item" #4725, and removes only cosecutive duplicates in the original code. This case is tested by issue-4725-imports_granularity_item-StdExternalCrate-no-reorder.rs.

  2. dedup in rewrite_reorderable_or_regroupable_items that removes duplicates after the sort. This case is tested by issue-4725-imports_granularity_item.rs.

@calebcartwright
Copy link
Member

To reiterate what I mentioned in #4725:

We need to dedupe regardless of the value of reorder_imports

When imports_granularity is set to Item (or any other variant, but the other variants don't have this issue) we need to remove duplicates. Also, creating a flattened data structure with duplicates, then running dedupe against the unsorted collection to only capture consecutive duplicates, then (potentially) sorting and running dedupe would be unacceptably inefficient.

Also note that it's possible for group_imports and imports_granularity to both be set, with reorder_imports being false, but in even in this deduping is still necessary and we're still dealing with an unsorted collection.

The contents of this proposed test file is actually a good picture of why the proposed solution does not truly address the problem either, despite the inefficiency.

If I were going to work on this then I think I might try reaching for utilizing a HashSet (although think that would require a custom Hash impl) or try to reuse some of the existing code that performs the dedupe with the other granularity variants

@davidBar-On
Copy link
Contributor Author

The contents of this proposed test file is actually a good picture of why the proposed solution does not truly address the problem either, despite the inefficiency.

I believe I now understand my misunderstanding. Just to make sure - in the following input code, the duplicate use crate::lexer should be removed, although the imports are not sorted. Is this correct?

// rustfmt-imports_granularity: Item
// rustfmt-reorder_imports: false
// rustfmt-group_imports: StdExternalCrate
use crate::lexer;
use crate::lexer::tokens::TokenData;
use crate::lexer;

On the other hand, with rustfmt-group_imports: Preserve (instead of StdExternalCrate) the duplicate should not be removed and the output should be the same as the input. Is this correct?

@calebcartwright
Copy link
Member

Just to make sure - in the following input code, the duplicate use crate::lexer should be removed, although the imports are not sorted. Is this correct

Yup, you've got it! imports_granularity requires that either reorder_imports is true and/or that group_imports is something other than the default Preserve, otherwise overriding the default value of imports_granularity is a no-op.

On the other hand, with rustfmt-group_imports: Preserve (instead of StdExternalCrate) the duplicate should not be removed and the output should be the same as the input. Is this correct?

Correct, provided reorder_imports is still false

@davidBar-On davidBar-On force-pushed the issue-4725-dedup-Item-imports_granularity branch from 07d8cca to 57c2dd8 Compare March 10, 2021 07:16
@davidBar-On
Copy link
Contributor Author

Changed. Hopefully it is o.k. now. Ended up using Itertools unique_by() with the output of fmt::Display for UseTree as a key (HashSet does not keep the original order).

@calebcartwright
Copy link
Member

Ended up using Itertools

That's a good idea 👍

(HashSet does not keep the original order)

Yup I'm aware, but that doesn't mean that one could not use a hashset as part of the process to accept a vec of use trees and produce a deduped and flattend one. I'm not familiar with the itertools internals, but I'd wager a good amount that they are internally leveraging something like a hashset or hashmap for this.

Copy link
Member

@calebcartwright calebcartwright left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! I like the itertools idea, should minimize refactoring and avoid re-inventing any wheels. Think we can make this a little cleaner though, few comments/requests inline below

@@ -197,6 +201,7 @@ pub(crate) fn flatten_use_trees(use_trees: Vec<UseTree>) -> Vec<UseTree> {
}
tree
})
.unique_by(|ut| format!("{}", ut))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would much rather implement Hash on UseTree so that we can just use unique() directly instead of taking the hit on calling format.

We're really just dealing with the path so I think this would suffice (along with the default Hash derive on UseSegment):

impl Hash for UseTree {
    fn hash<H: Hasher>(&self, state: &mut H) {
        self.path.hash(state);
    }
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before I try to implement this, I would like to make sure fully understand the implementation. My main question is what is default Hash derive on UseSegment? How and were it is implemented?

I am asking because my first approach was to implement Hasher for both UseTree and UseSegment. This is by generating the list of the segments` text and then hashing the generated text. I ended up with something very similar to the UseTree format and realized that using the format is about the same from performance point of view.

One more question for my information (assuming that the answer is short). I didn't fully understand how the the hash function works. Taking as an example the hash function you suggested above. First, it doesn't have a return value, so where is the hash value stored and how it is used? Second, through the iterations I assume that several numeric hash valuses are calculated for the non-List UseSegments. If this is the case, it is not clear to me if and how these hash values are "combined" to form the hash of the initial UseTree.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My main question is what is default Hash derive on UseSegment? How and were it is implemented?

I was referring to the Hash trait, and that we should able to leverage the derive attribute for UseSegment instead of having to manually implement the Hash trait ourselves. I'm not sure we'll be able to leverage derive on UseTree because that only works when all the fields themselves have a Hash impl, and I know that not all of the AST nodes do (feel free to check/test though!). Once there's a Hash impl for both UseTree and UseSegment then you can use unique() directly, instead of formatting to a temporary String.

I am asking because my first approach was to implement Hasher for both UseTree and UseSegment. This is by generating the list of the segments` text and then hashing the generated text. I ended up with something very similar to the UseTree format and realized that using the format is about the same from performance point of view.

I don't think there will be a need to implement the Hasher trait on any of our custom types to resolve this issue. My assumption here was that the itertools functions we were hoping to leverage utilized some hash-based data structure for the deduplication feature we needed, and that in turn would require that the items (UseTree in our case) implemented the Hash trait (side note, looks like they do leverage a hash set internally 😉).

This is indeed the case, as highlighted in the the itertools docs for the unique/unique_by functions. If you were to try to just use unique() on the current code, you'd presumably get errors complaining about unbound constraints, because our UseTree does not have a Hash implementation.

Your initial proposal works because you're allocating an intermediate String representation for each UseTree (which we have to do again later for the actual emitting of the formatted code) and that satisfies the itertools constraints because String does have a Hash impl. We don't want to do this though because it is unnecessary, and we should simply ensure that we have a Hash impl.

One more question for my information (assuming that the answer is short). I didn't fully understand how the the hash function works

Sorry, but that's a little too off topic for a code review within rustfmt 😄 Hopefully the above info helps somewhat, but otherwise I'd refer you to materials like the Book, the stdlib docs (https://doc.rust-lang.org/std/index.html), or even the rustc source if you really want to drill into it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the detailed response! The reference to stdlib also help as I realized I missed the finish method that emits the hash calculation result.

I would much rather implement Hash on UseTree so that we can just use unique() directly instead of taking the hit on calling format.

I have now change unique_by to unique using your suggested code.

@@ -0,0 +1,6 @@
use crate::lexer;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add some test files that have multiple groups, and perhaps some that have comments too? Maybe something like this:

use crate::lexer::{tokens::TokenData /* foo */ };
use crate::lexer::{tokens::TokenData};

It might also be useful to have some duplicates but with varying visibility levels (we may not handle this well with the other variants either FWIW, but would be good to check while we're working on this one)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add some test files that have multiple groups, and perhaps some that have comments too?

I didn't add such test files as there are some issues, especially related to comments, so maybe it is better to discuss first (here or in a new issue) which of these issues should be resolved, whether there are cases where the original code should be used, what should be the expected results, etc.

For example, the for the following input:

use crate::foo;
use crate::foo::self;
use crate::foo::self /* foo 1 */;
use crate::foo::{self /* foo 2 */ };
use crate::foo::{tokens::TokenData};
use crate::foo::{tokens::TokenData /* foo 3 */ };
use crate::foo::{self, tokens::TokenData /* foo 4 */};
use crate::bar::self /* bar 1 */;
use crate::bar::self;
use crate::foobar::self; /* foobar 1 */
use crate::foobar::self;

"Crate" granularity produce the following output:

use crate::bar; /* bar 1 */;
use crate::foo; /* foo 1 */;
use crate::foobar; /* foobar 1 */
use crate::{
    bar, foo,
    foo::{self /* foo 2 */, tokens::TokenData},
    foobar,
};

Comments foo 3 and foo 4 disappeared and there is an extra ; after comments bar 1 and foo 1.

"Module" granularity produce the following output:

use crate::bar; /* bar 1 */;
use crate::foo; /* foo 1 */;
use crate::foo::tokens::TokenData;
use crate::foo::{self /* foo 2 */, self};
use crate::foobar; /* foobar 1 */
use crate::{bar, foo, foobar};

In addition to the same issues as for "Crate", foo::{...} includes self twice: one with foo 2 comment and one without a comment.

"Item" granularity produce the following output:

use crate::bar; /* bar 1 */;
use crate::foo;
use crate::foo::tokens::TokenData;
use crate::foo::{self /* foo 2 */};
use crate::foobar; /* foobar 1 */

Comment foo 1 is missing because in the original code crate::foo::self is before crate::foo::self /* foo 1 */. The bar 1 comment does appear, since crate::bar::self is before crate::foobar::self; /* foobar 1 */. As above, foo 3 and foo 4 comments disappeared and there is an extra ; after comments bar 1.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as there are some issues, especially related to comments

wow there sure is 🙃

We've always had comment issues with this option (which was named merge_imports prior to a recent rename/restructure) so I'm not too surprised by some of those, but the emission of invalid code in this type of case is particularly concerning:

use rustc_ast::ast /* abc */;

I didn't add such test files as there are some issues, especially related to comments, so maybe it is better to discuss first (here or in a new issue) which of these issues should be resolved, whether there are cases where the original code should be used, what should be the expected results, etc.

Good call 👍

Copy link
Contributor Author

@davidBar-On davidBar-On Mar 17, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Submitted PR #4759 to fix the extra semicolon issue.

I will also try to resolve the issue where in some cases if the same import module appears twice - once with a comment and once without a comment, it appears twice in the formatted output. E.g. in the above output for Crate, 'create::barandcrate:::foo` appear twice.

Update:
The reason some import items appear twice for Crate and Module granularity, is because merge_use_trees() don't try to merge items with comments and just push them to the output:

if use_tree.has_comment() || use_tree.attrs.is_some() {
result.push(use_tree);
continue;

This is probably a patch done because imports comments are not handled well, but it ignores duplicates.

There is a relatively simple way to remove these duplicates by ignoring all non-commented flattened items that have a duplicate with comment. However, because duplicates with and without comments are not expected to be common in practice, and because no issue was opened by users against this problem, it seems that it is not worth trying to add this enhancement.

@ytmimi
Copy link
Contributor

ytmimi commented Apr 3, 2022

I believe this backport is predicated on #3795 being backported first. At least the only portion of the tests that are failing have to do with the import order for raw identifiers. We could backport this and update the tests to follow the 1.x sorting behavior. @calebcartwright Thoughts?

@calebcartwright
Copy link
Member

We could backport this and update the tests to follow the 1.x sorting behavior. @calebcartwright Thoughts?

That works for me @ytmimi, or at least we could have multiple versions of these tests that add the corresponding Version={One/Two} to account for how the ordering is applied with those raw identifiers

davidlattimore pushed a commit to davidlattimore/rustfmt that referenced this pull request Jun 11, 2022
* Fix for issue 4725 - dedup Item imports_granularity (2nd version)

* Use unique() instead of unique_by()
calebcartwright pushed a commit that referenced this pull request Jun 12, 2022
* Fix for issue 4725 - dedup Item imports_granularity (2nd version)

* Use unique() instead of unique_by()
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants