Dedup `imports_granularity = "Item"` #4737

davidBar-On · 2021-03-04T16:27:30Z

Suggested fix for issue #4725.

The PR includes also creation of tests/imports folder and moving all the import*.rs test cases to this folder.

Note that the issue-4725-imports_granularity_default.rs test file is a copy of a test file from #4716 (for Module granularity). If both PRs are merged then only one copy of the file is required, possibly renaming it to imports_granularity_default.rs. Also in this case the #4716 tests files should be moved under the imports folder. (If #4716 is merged first I can rebase this PR and do these changes here.)

calebcartwright · 2021-03-05T04:32:21Z

Note that Vec's dedup only removes consecutive duplicates. As mentioned #4725 the collection of UseTrees hasn't been sorted at this point, and there's no guarantee that we won't have non-consecutive duplicates.

The resolution for #4725 will require a different approach.

davidBar-On · 2021-03-05T07:59:40Z

Note that Vec's dedup only removes consecutive duplicates. As mentioned #4725 the collection of UseTrees hasn't been sorted at this point, and there's no guarantee that we won't have non-consecutive duplicates.

Not sure I understand the comment. Two dedup were added:

dedup in flatten_use_trees which I believe this what you were referring to. This is for the case when reorder_imports is not set, as you mentioned in Deduplication doesn't occur with `imports_granularity = "Item" #4725, and removes only cosecutive duplicates in the original code. This case is tested by issue-4725-imports_granularity_item-StdExternalCrate-no-reorder.rs.
dedup in rewrite_reorderable_or_regroupable_items that removes duplicates after the sort. This case is tested by issue-4725-imports_granularity_item.rs.

calebcartwright · 2021-03-07T22:40:29Z

To reiterate what I mentioned in #4725:

We need to dedupe regardless of the value of reorder_imports

When imports_granularity is set to Item (or any other variant, but the other variants don't have this issue) we need to remove duplicates. Also, creating a flattened data structure with duplicates, then running dedupe against the unsorted collection to only capture consecutive duplicates, then (potentially) sorting and running dedupe would be unacceptably inefficient.

Also note that it's possible for group_imports and imports_granularity to both be set, with reorder_imports being false, but in even in this deduping is still necessary and we're still dealing with an unsorted collection.

The contents of this proposed test file is actually a good picture of why the proposed solution does not truly address the problem either, despite the inefficiency.

If I were going to work on this then I think I might try reaching for utilizing a HashSet (although think that would require a custom Hash impl) or try to reuse some of the existing code that performs the dedupe with the other granularity variants

davidBar-On · 2021-03-08T11:42:55Z

The contents of this proposed test file is actually a good picture of why the proposed solution does not truly address the problem either, despite the inefficiency.

I believe I now understand my misunderstanding. Just to make sure - in the following input code, the duplicate use crate::lexer should be removed, although the imports are not sorted. Is this correct?

// rustfmt-imports_granularity: Item
// rustfmt-reorder_imports: false
// rustfmt-group_imports: StdExternalCrate
use crate::lexer;
use crate::lexer::tokens::TokenData;
use crate::lexer;

On the other hand, with rustfmt-group_imports: Preserve (instead of StdExternalCrate) the duplicate should not be removed and the output should be the same as the input. Is this correct?

calebcartwright · 2021-03-09T02:56:02Z

Just to make sure - in the following input code, the duplicate use crate::lexer should be removed, although the imports are not sorted. Is this correct

Yup, you've got it! imports_granularity requires that either reorder_imports is true and/or that group_imports is something other than the default Preserve, otherwise overriding the default value of imports_granularity is a no-op.

On the other hand, with rustfmt-group_imports: Preserve (instead of StdExternalCrate) the duplicate should not be removed and the output should be the same as the input. Is this correct?

Correct, provided reorder_imports is still false

davidBar-On · 2021-03-10T07:42:13Z

Changed. Hopefully it is o.k. now. Ended up using Itertools unique_by() with the output of fmt::Display for UseTree as a key (HashSet does not keep the original order).

calebcartwright · 2021-03-12T03:01:07Z

Ended up using Itertools

That's a good idea 👍

(HashSet does not keep the original order)

Yup I'm aware, but that doesn't mean that one could not use a hashset as part of the process to accept a vec of use trees and produce a deduped and flattend one. I'm not familiar with the itertools internals, but I'd wager a good amount that they are internally leveraging something like a hashset or hashmap for this.

calebcartwright

Thanks! I like the itertools idea, should minimize refactoring and avoid re-inventing any wheels. Think we can make this a little cleaner though, few comments/requests inline below

calebcartwright · 2021-03-12T03:04:38Z

src/formatting/imports.rs

@@ -197,6 +201,7 @@ pub(crate) fn flatten_use_trees(use_trees: Vec<UseTree>) -> Vec<UseTree> {
            }
            tree
        })
+        .unique_by(|ut| format!("{}", ut))


I would much rather implement Hash on UseTree so that we can just use unique() directly instead of taking the hit on calling format.

We're really just dealing with the path so I think this would suffice (along with the default Hash derive on UseSegment):

impl Hash for UseTree { fn hash<H: Hasher>(&self, state: &mut H) { self.path.hash(state); } }

Before I try to implement this, I would like to make sure fully understand the implementation. My main question is what is default Hash derive on UseSegment? How and were it is implemented?

I am asking because my first approach was to implement Hasher for both UseTree and UseSegment. This is by generating the list of the segments` text and then hashing the generated text. I ended up with something very similar to the UseTree format and realized that using the format is about the same from performance point of view.

One more question for my information (assuming that the answer is short). I didn't fully understand how the the hash function works. Taking as an example the hash function you suggested above. First, it doesn't have a return value, so where is the hash value stored and how it is used? Second, through the iterations I assume that several numeric hash valuses are calculated for the non-List UseSegments. If this is the case, it is not clear to me if and how these hash values are "combined" to form the hash of the initial UseTree.

My main question is what is default Hash derive on UseSegment? How and were it is implemented?

I was referring to the Hash trait, and that we should able to leverage the derive attribute for UseSegment instead of having to manually implement the Hash trait ourselves. I'm not sure we'll be able to leverage derive on UseTree because that only works when all the fields themselves have a Hash impl, and I know that not all of the AST nodes do (feel free to check/test though!). Once there's a Hash impl for both UseTree and UseSegment then you can use unique() directly, instead of formatting to a temporary String.

I am asking because my first approach was to implement Hasher for both UseTree and UseSegment. This is by generating the list of the segments` text and then hashing the generated text. I ended up with something very similar to the UseTree format and realized that using the format is about the same from performance point of view.

I don't think there will be a need to implement the Hasher trait on any of our custom types to resolve this issue. My assumption here was that the itertools functions we were hoping to leverage utilized some hash-based data structure for the deduplication feature we needed, and that in turn would require that the items (UseTree in our case) implemented the Hash trait (side note, looks like they do leverage a hash set internally 😉).

This is indeed the case, as highlighted in the the itertools docs for the unique/unique_by functions. If you were to try to just use unique() on the current code, you'd presumably get errors complaining about unbound constraints, because our UseTree does not have a Hash implementation.

Your initial proposal works because you're allocating an intermediate String representation for each UseTree (which we have to do again later for the actual emitting of the formatted code) and that satisfies the itertools constraints because String does have a Hash impl. We don't want to do this though because it is unnecessary, and we should simply ensure that we have a Hash impl.

One more question for my information (assuming that the answer is short). I didn't fully understand how the the hash function works

Sorry, but that's a little too off topic for a code review within rustfmt 😄 Hopefully the above info helps somewhat, but otherwise I'd refer you to materials like the Book, the stdlib docs (https://doc.rust-lang.org/std/index.html), or even the rustc source if you really want to drill into it.

Thanks for the detailed response! The reference to stdlib also help as I realized I missed the finish method that emits the hash calculation result.

I would much rather implement Hash on UseTree so that we can just use unique() directly instead of taking the hit on calling format.

I have now change unique_by to unique using your suggested code.

calebcartwright · 2021-03-12T03:06:25Z

tests/source/imports/imports_granularity_default-with-dups.rs

@@ -0,0 +1,6 @@
+use crate::lexer;


Could you add some test files that have multiple groups, and perhaps some that have comments too? Maybe something like this:

use crate::lexer::{tokens::TokenData /* foo */ }; use crate::lexer::{tokens::TokenData};

It might also be useful to have some duplicates but with varying visibility levels (we may not handle this well with the other variants either FWIW, but would be good to check while we're working on this one)

Could you add some test files that have multiple groups, and perhaps some that have comments too?

I didn't add such test files as there are some issues, especially related to comments, so maybe it is better to discuss first (here or in a new issue) which of these issues should be resolved, whether there are cases where the original code should be used, what should be the expected results, etc.

For example, the for the following input:

use crate::foo; use crate::foo::self; use crate::foo::self /* foo 1 */; use crate::foo::{self /* foo 2 */ }; use crate::foo::{tokens::TokenData}; use crate::foo::{tokens::TokenData /* foo 3 */ }; use crate::foo::{self, tokens::TokenData /* foo 4 */}; use crate::bar::self /* bar 1 */; use crate::bar::self; use crate::foobar::self; /* foobar 1 */ use crate::foobar::self;

"Crate" granularity produce the following output:

use crate::bar; /* bar 1 */; use crate::foo; /* foo 1 */; use crate::foobar; /* foobar 1 */ use crate::{ bar, foo, foo::{self /* foo 2 */, tokens::TokenData}, foobar, };

Comments foo 3 and foo 4 disappeared and there is an extra ; after comments bar 1 and foo 1.

"Module" granularity produce the following output:

use crate::bar; /* bar 1 */; use crate::foo; /* foo 1 */; use crate::foo::tokens::TokenData; use crate::foo::{self /* foo 2 */, self}; use crate::foobar; /* foobar 1 */ use crate::{bar, foo, foobar};

In addition to the same issues as for "Crate", foo::{...} includes self twice: one with foo 2 comment and one without a comment.

"Item" granularity produce the following output:

use crate::bar; /* bar 1 */; use crate::foo; use crate::foo::tokens::TokenData; use crate::foo::{self /* foo 2 */}; use crate::foobar; /* foobar 1 */

Comment foo 1 is missing because in the original code crate::foo::self is before crate::foo::self /* foo 1 */. The bar 1 comment does appear, since crate::bar::self is before crate::foobar::self; /* foobar 1 */. As above, foo 3 and foo 4 comments disappeared and there is an extra ; after comments bar 1.

as there are some issues, especially related to comments

wow there sure is 🙃

We've always had comment issues with this option (which was named merge_imports prior to a recent rename/restructure) so I'm not too surprised by some of those, but the emission of invalid code in this type of case is particularly concerning:

use rustc_ast::ast /* abc */;

I didn't add such test files as there are some issues, especially related to comments, so maybe it is better to discuss first (here or in a new issue) which of these issues should be resolved, whether there are cases where the original code should be used, what should be the expected results, etc.

Good call 👍

Submitted PR #4759 to fix the extra semicolon issue.

I will also try to resolve the issue where in some cases if the same import module appears twice - once with a comment and once without a comment, it appears twice in the formatted output. E.g. in the above output for Crate, 'create::barandcrate:::foo` appear twice.

Update:
The reason some import items appear twice for Crate and Module granularity, is because merge_use_trees() don't try to merge items with comments and just push them to the output:

rustfmt/src/formatting/imports.rs

Lines 170 to 172 in a9876e8

if use_tree.has_comment() || use_tree.attrs.is_some() {

result.push(use_tree);

continue;

This is probably a patch done because imports comments are not handled well, but it ignores duplicates.

There is a relatively simple way to remove these duplicates by ignoring all non-commented flattened items that have a duplicate with comment. However, because duplicates with and without comments are not expected to be common in practice, and because no issue was opened by users against this problem, it seems that it is not worth trying to add this enhancement.

ytmimi · 2022-04-03T15:22:14Z

I believe this backport is predicated on #3795 being backported first. At least the only portion of the tests that are failing have to do with the import order for raw identifiers. We could backport this and update the tests to follow the 1.x sorting behavior. @calebcartwright Thoughts?

calebcartwright · 2022-06-10T23:33:25Z

We could backport this and update the tests to follow the 1.x sorting behavior. @calebcartwright Thoughts?

That works for me @ytmimi, or at least we could have multiple versions of these tests that add the corresponding Version={One/Two} to account for how the ordering is applied with those raw identifiers

* Fix for issue 4725 - dedup Item imports_granularity (2nd version) * Use unique() instead of unique_by()

calebcartwright added the pr-not-reviewed label Mar 5, 2021

calebcartwright mentioned this pull request Mar 5, 2021

Deduplication doesn't occur with `imports_granularity = "Item" #4725

Closed

Fix for issue 4725 - dedup Item imports_granularity (2nd version)

57c2dd8

davidBar-On force-pushed the issue-4725-dedup-Item-imports_granularity branch from 07d8cca to 57c2dd8 Compare March 10, 2021 07:16

calebcartwright reviewed Mar 12, 2021

View reviewed changes

calebcartwright added pr-waiting-on-author and removed pr-not-reviewed labels Mar 12, 2021

Use unique() instead of unique_by()

bd045d8

calebcartwright merged commit 6b415bd into rust-lang:master Mar 16, 2021

davidBar-On mentioned this pull request Mar 17, 2021

Fix issue with extra semicolon when import comment precedes semicolon #4759

Merged

karyon added backport-triage and removed pr-waiting-on-author labels Oct 26, 2021

davidlattimore pushed a commit to davidlattimore/rustfmt that referenced this pull request Jun 11, 2022

Dedup imports_granularity = "Item" (rust-lang#4737)

18c2a8e

* Fix for issue 4725 - dedup Item imports_granularity (2nd version) * Use unique() instead of unique_by()

calebcartwright pushed a commit that referenced this pull request Jun 12, 2022

Dedup imports_granularity = "Item" (#4737)

7d34cfa

* Fix for issue 4725 - dedup Item imports_granularity (2nd version) * Use unique() instead of unique_by()

calebcartwright added 1x-backport:completed and removed backport-triage labels Jun 12, 2022

	if use_tree.has_comment() \|\| use_tree.attrs.is_some() {
	result.push(use_tree);
	continue;

Dedup imports_granularity = "Item" #4737

Dedup imports_granularity = "Item" #4737

Uh oh!

Conversation

davidBar-On commented Mar 4, 2021

Uh oh!

calebcartwright commented Mar 5, 2021

Uh oh!

davidBar-On commented Mar 5, 2021

Uh oh!

calebcartwright commented Mar 7, 2021

Uh oh!

davidBar-On commented Mar 8, 2021

Uh oh!

calebcartwright commented Mar 9, 2021

Uh oh!

davidBar-On commented Mar 10, 2021

Uh oh!

calebcartwright commented Mar 12, 2021

Uh oh!

calebcartwright left a comment

Choose a reason for hiding this comment

Uh oh!

calebcartwright Mar 12, 2021

Choose a reason for hiding this comment

Uh oh!

davidBar-On Mar 12, 2021

Choose a reason for hiding this comment

Uh oh!

calebcartwright Mar 13, 2021

Choose a reason for hiding this comment

Uh oh!

davidBar-On Mar 15, 2021

Choose a reason for hiding this comment

Uh oh!

calebcartwright Mar 12, 2021

Choose a reason for hiding this comment

Uh oh!

davidBar-On Mar 15, 2021

Choose a reason for hiding this comment

Uh oh!

calebcartwright Mar 16, 2021

Choose a reason for hiding this comment

Uh oh!

davidBar-On Mar 17, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ytmimi commented Apr 3, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

calebcartwright commented Jun 10, 2022

Uh oh!

Uh oh!

Dedup `imports_granularity = "Item"` #4737

Dedup `imports_granularity = "Item"` #4737

davidBar-On Mar 17, 2021 •

edited

Loading

ytmimi commented Apr 3, 2022 •

edited

Loading