Skip to content

Add a Rewrite trait #107

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Jun 23, 2015
Merged

Add a Rewrite trait #107

merged 4 commits into from
Jun 23, 2015

Conversation

cassiersg
Copy link
Contributor

This PR is just a proposal of a new trait, and is primarily intended for discussion. It's a draft of implementation of an idea I wanted to share and discuss. It's just an idea, that will maybe not well work.

Because it might change a lot of code, I prefer to discuss it before a complete implementation.

Motivation

We have a recurrent pattern in the code: rewrite some code with a given offset and width. (It's very common in expr.rs, for example).

I wanted to abstract "something that can be rewritten into offset and width" when working on list formatting. Format a list (of args, types parameters...) is also something useful. With the current lists module, it needs some recurrent code, essentially : rewrite all items of the list within a maximal width, but the last one within the given width.

I found a lot of these patterns when I tried to format types (ast::Ty).

Design

The idea is quite simple: add a new trait Rewrite to represent the

pub trait Rewrite {
    fn rewrite(&self, width: usize, offset: usize) -> Option<(String, Option<Comment>)>;
}

rewrite returns an Option to express the fact that it's sometime not possible to not excess the given width. (Should it be a Result instead ?) The first element of the tuple (String) is the code formatted, and the second is a possible comment. Getting the comment separated of the code gives more degrees of freedom to the caller : it can put the comment on an own line before the code, of at the end of the line. The enum Comment is intended to be able to make a difference between a doc comment and another kind of comment (we will eventually replace bloc comments with line comments), but it's maybe not useful and a simple string would be sufficient.

The current implementation of this trait to format lists is only used by a place-holder for the write_list function.

At the very end, all rewrite_* functions could be impls of Rewrite.

When it comes to implement and use this trait, things are getting wrong : it makes sense to implement it for ast::Expr, and using rewrite_expr as the rewrite method, but it doesn't work since rewrite_expr is itself a method of FmtVisitor.

I see two solutions to this problem : in the longterm, we might have a self-sufficient AST, and FmtVisitor would not more be needed. Currently, it's not possible though. The second solution is to create structs containing the item we want to format and a ref to the FmtVisitor. I don't know if this approach would be sustainable if we use it extensively. There is maybe another solution. (?)

I tried also an alternative, which is a more ad-hoc solution : pass the rewriting function as argument of rewrite_list. It looked me a bit "hackish" and less extensible, but maybe it's only because i'm not used to that.

So, please let me know what you think about it.

About the list formatting algorithm, and tuple formatting, the changes are not very relevant, it's just an example of impl of Rewrite. The tuple formatting is a good example of use case for Rewrite : the same algorithm may be used for tuple litterals and for tuple types.

@nrc
Copy link
Member

nrc commented Jun 15, 2015

I have been thinking that the current architecture - a billion methods on one big context object is very non-optimal and I would love to improve it. There might well be a place for a Rewrite trait there. I wonder if we could have two traits - one for rewriting that returns Strings and one for rewriting which modifies a mutable buffer? I haven't really thought that through - it might be silly.

Some details on this proposal:

  • I think Option rather than Result is right - the 'error' case is expected in some cases and there is no error value, so I think Option is better
  • I'm not sure about returning the comment too. What is the motivation for this? It seems like unnecessary flexibility at first blush, and at the same time like it might not be flexible enough - I could imagine that if you want to break it up like this you might want pre- and post-comments plus some way to denote white space, etc.
  • The long term plan is to use our own visitor, but probably not our own AST. But that is some way off.
  • Creating more structs as well as the AST makes me feel bad - it seems like a lot of transforming data for not much pay off. Would it be reasonable to just call the trait method from the visit method?


pub enum Comment {
Simple(String),
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you envisage this enum growing in the future?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At least adding a DocComment and InnerDocComment. We need to know it if we handle any reformating of the comments.

@nrc
Copy link
Member

nrc commented Jun 15, 2015

@marcusklaas thoughts?

@cassiersg
Copy link
Contributor Author

  • I don't see much use cases for rewriting into a mutable buffer, why should we need it ? I had thought the goal was to move as much as possible from visit_* (and thus the shared changes) to rewrite_*, which only return Strings. (We might use something like Ropes for performance, but it's the same idea).
  • Comments: The way we handle it depends of how much semantic we have for them. My initial idea was they are semantically "pinned" to an element of the AST (thereafter "item"). They may also live alone, not attached to an element. There is also a distinction between comment types : standard comments and doc comments essentially (maybe others), but no distinction between inline (/* */) and line (//) comments, given there is no semantic distinction between them.
    To format these, there would be two possibilities : at the end of the line (and so the line should end with the item that's commented), or on the line before the commented item (and the item should be alone on the next line).
    All of this depends on convention used in the code, and I'm not aware of such explicit convention and I don't know well the big rust codebases like rustc and servo.
    We could also just use lexical position, and avoid semantic issues, but it would prevent useful things like moving an end-of-line comment after a long line to the previous line.
    The implementation of any of this can be done in a somewhat hacky way while we don't have comments in the AST. When enventually (let's hope it) comments will be in the AST, it will probably lead semantics in a clear direction.
  • I'm also not in favor of reconstructing a lot of structs. However there are some cases where we need a reference to the codemap to rewrite, and since it is not included in the elments of the AST, we need to handle it another way. With the Rewrite trait, the simplest might be to pass it as argument of rewrite, but I'm not having a good feeling with that. Do you see another solution ? (Or think that's right ?).

@nrc
Copy link
Member

nrc commented Jun 15, 2015

btw, cc #103 where there was also some discussion of re-architecting.

@nrc
Copy link
Member

nrc commented Jun 15, 2015

Comments

There is no formal statement on how comments work with the AST. Currently doc comments are in the AST and regular comments are not. But this is an implementation detail and subject to change (long term, I hope all comments will be in the AST, but who knows(it's not entirely clear how that would work)). So, currently comments are 'between' AST nodes. That makes it kind of hard to decide where they should be semantically attached. E.g.:

struct Foo; // A comment about Foo

// A comment about Bar
struct Bar;

At the moment there are some slightly odd heuristics about working this out in lists and we guess or fail else where.

I think that at the moment we have to be pretty crude with comments. We can probably have better heuristics, but we're not going to be perfect. I think that is OK - comments are kind of an edge case and people won't expect us to get them perfect (as long as we don't drop any, which we do in a few places atm).

@nrc
Copy link
Member

nrc commented Jun 15, 2015

re the Codemap, we should probably keep some kind of context struct which includes a reference to the Codemap. I expect we will want other info in there in the future. That should be one of the arguments to rewrite, I think.

@cassiersg
Copy link
Contributor Author

I think the discussion in #103 is somewhat orhogonal to the discussion here. We're only speaking of inline mode here. Still two points:

  • Go to block mode from rewrite should be possible as long as we have sufficient informations in the conext struct you suggested.
  • Performance : the rewrite doesn't make it worse than it's now, but we can use a more adapted data strucutre for the return, to avoid massive copies of Strings (a Rope ?).

@cassiersg
Copy link
Contributor Author

Comments

While the parsing is based on add heurisics, maybe we should define clearly what's the goal of these heuristics, ie what meaning and relation to the AST we will have after the comments have been parsed. But a precise definition of this is though not needed to implement Rewrite.
The question here is if we want rewrite to return only a string, with comments already included in it, or if we want to return some comments outside of the string.
If we include comments in the string, the parsing heuristics will be very simplified, but we will not be able to do any "smart" operation with comments: we can only leave them where they are in the text, keeping newline before/after as they are. The implementation would be something like: "while rewriting, for each pair of syntaxically following nodes in the AST, detect if there is a comment and copy it".
If we have a separate comment information, we can for example transform

struct AAAAAAAAAAAAAAA; // a long comment, the line goes over 100 chars
fn a(x: X /* comment */, y: Y /* comment */, z: Z)

to

// a long comment, the line goes over 100 chars
struct AAAAAAAAAAAAAAA;
fn a(x: X, // comment
     y: Y, // comment
     z: Z)

with some not so complicated heuristics. I think it's worth the cost, while it will not be perfect.

Another fact the pushes me in this direction is that we already use this kind of pattern in write_list : we have a separated string for comments.

The exact form of the comment information can be anything, from a simple string, to a big struct,

@marcusklaas
Copy link
Contributor

Interesting PR! I definitely believe that pulling logic from the Visitor is the way to go. The idea of returning comments separately to the caller also appeals to me. Doing transformations as in your last example would be very nice!

I do have some concerns regarding the implementation, however. Currently, most comments are pulled from the 'space' between the current span and the last span. I don't see how we could do this without an context. Also, what do with items with multiple comments? For example:

// Comment 1
fn some_func() /* Comment 2 */ {
    ...
}

The Rewrite trait is nice, I think. But we should only use it in cases where it fits naturally, e.g., actual items and expressions. For more abstract structures it may be better to just have a dedicated function, such as write_list.

@cassiersg
Copy link
Contributor Author

I think the context strcture suggested by @nrc is a good solution for that. It could include a Span going from the end of the preceding AST element to the following one. EDIT: In fact, not a good idea, see my next comment.

The handling of multiple comments is an interesting question. They could be represented as a variant for Comment: Composed(Vec<Comment>). How they should be formatted is however not the question here, I think.

Do you have a reason for which you want to avoid implementing Rewrite on lists ?
I was thinking we should implement Rewrite even on abstract structures. For example, we could have a simple Parentized<T: Rewrite> { item: T} struct, that could be used to rewrite lots of things, like tuples, arg lists, arithmetic operations (and also type parameters lists, arrays... with a parameter to use other enclosing delimiters). A less trivial example is the case of things that look like function definitions, which appears in fn types (pure function pointers) and generic types (often for closures: Fn(x: X) -> A).
If we implement this logic as functions, I think we will have to write a lot of trivial impls of Rewrite. If we have an function into_delimiters<T: Rewrite>(item: T, delimiters:...), it will be complicated to rewrite a tuple : we will need to make something like type TupleLikeList = Vec<Expr> and then implement Rewrite on this type, to be able to use into_delimiters on it.
In brief, we will have a problem when composing such "abstract" functions.

I will update this PR to implement things we discussed, but it may take a few days.

@cassiersg
Copy link
Contributor Author

I changed the content of the PR, removing the impl for list (which was not very representative and a bit messy) in favor of a impl for ast::Expr. This is a only a reacrchitecturing of the code, the result should not have changed.

The diff, ignoring whitespaces (there is a lot of reindentation) is essentially made of trivial fixes. The biggest difference being that all rewrite_* functions now return an Option<String>.

I think this shows that the rewrite_* approach is at least not worst than the previous one.

In this commit, I completely avoided the tricky part of the question: comments.

@cassiersg
Copy link
Contributor Author

Comments

Upon reflection, I think I've been wrong with that. It's probably better to leave comments outside of the result of rewrite. This is motivated by two facts:

  • While we don't have comments in AST, it's very difficult to retrieve comments surrounding an element having only the element itself.
  • The case of write_list (needing comments associated elements, with no more context) will not be prevalent.

So, I suggest to keeping the current approach of comments, taking convention for rewrite:

  • Comments that are inside the span of an element should be included in the string returned by rewrite.
  • Comments outside this span are handled by the parent element.

We could though to have a type representing a comment, which could implement Rewrite. This would allow to keep a uniform formatting of comments while leaving room for ad-hoc heuristics in the parsing and placement.

@cassiersg
Copy link
Contributor Author

I'm a bit irresolute how to make this PR move forward: should I make a big move to the use of Rewriter in items.rs, types.rs, imports.rs (and maybe lists.rs). On one side, this would show clearly if this concept works well. On the other side, doing it well would involve lots of changes in the formatting algorithms to take into account the width, where we currently ignore it. This would make a huge PR.
Do you think the design looks well now, and what changes should be made ? Is there a need for more implementation to see how it behaves ?
The big improvement by @marcusklaas in #93 shows well that there is no problem with comment handling decoupled for element rewriting, so I removed all stuff with comments here. That makes the design much simpler.

@nrc
Copy link
Member

nrc commented Jun 18, 2015

@cassiersg sorry I haven't had a chance to look at the revised PR. I'll try and do so over the weekend. I would advise against a big all over change. Is it possible to do it incrementally - just try it on one kind of item and see how it feels? I would hold off on doing anything major before I get a chance to read over the current stuff.

@cassiersg
Copy link
Contributor Author

@nrc All proposed changes may be made incrementally, and they are very simple for a basic "lifting only" implementation, which doesn't change anything in the logic of the code. However, in many cases, this will lead to a code that doesn't take advantage of the possibility to return None, and some functions will not respect the width constraint, because they didn't had such thing implemented.

So, I would prefer an approach where we implement Rewrite when we write new code, and when we update existing code to respect the contract of Rewrite. This kind of update is much more complex since it touches the behavior of the code.

I implemented Rewrite for the ast::Expr, replacing rewrite_expr, and it works well, seems natural. I think it will play nice with ast::Ty (I was working on that when I thought to this design), and it will probably benefit to things like LifetimeDef, generics...

Take you time to review this PR and if you feel you need other implementations to evaluate the benefits and drawbacks, don't hesitate to tell it to me.

@marcusklaas
Copy link
Contributor

I have thought a bit about this. I think returning an (optional) String is not an optimal strategy from a performance perspective: expressions with deep syntax trees can cause many unnecessary copies. A leaf in a tree of height n is copied n times!

I now quite like the solution that was proposed by @nrc here. If I understand correctly, we would give expression rewriters a generic buffer which it can append. This could be either a String, a temporary buffer, or a ChangeSet, the final buffer. To the rewriting function, it makes no difference which it receives. Most of the time, we just want to append the ChangeSet, so we can just pass that along to the child expressions. In this case, we do no unnecessary copies. It still gives us the freedom to try formatting a subtree and discarding it if we don't like it. This would incur an extra copy at that level.

A rewrite function's signature could look something like this:

fn rewrite_expr<B: Buf>(buf: &mut B, ex: ast::Expr, codemap: &Codemap, width: usize, offset: usize) -> Option<()>

where Buf is a trait containing a single append method, implemented by String and ChangeSet. The CodeMap we would need to map span to strings, to capture comments, for example.

What do you think?

@cassiersg
Copy link
Contributor Author

I totally agree with you that the current proposal is very bad in terms of performance. My idea was first to make something simple with String (poor performance, but no problems/questions), and then replace String with a more efficient data structure, that has the same interface as String (so the change could be painless).
The data structure may be a tree of strings (using Cow or another mechanism to keep refs to the codemap). That would make to amortized cost of appending O(1), essentially just push of pointer into a Vec, and avoid multiple copies.

I think @nrc has also thought to use better datastructures to represent strings, as his strings.rs crate is a dependency of rustfmt.

The option you suggest is interesting, and is probably not far of the optimum performance while we don't make too much try/discard. I'm not clear how we do such discards: does it need to allocate a new Buf (and then copy it to another Buf), or is there some way to take a mark to a point in the buffer and remove everything after that mark ?

I think there will be lots of places in the code where we might want to discard some formatting already done: when we need to rewrite some sub-items when we rewrite an item, if the formatting of the second sub-item fails, we must cancel the first.

Another point is out-of-order reformatting: how would fit the current rewriting of lists (major place where we don't rewrite things in a "streaming" way) ? Would that also need to allocate new buffers ?

If all these things need new buffers, we won't probably gain a lot of performance over returning String, since these are omnipresent. All these concerns are valid mainly in deeply nested contexts. In more linear context such as blocks, where we append in-order, without the need of "discard point" between each instruction, a buffer is probably well suited.

If there isn't major perfomance problems, I think it is simpler to work with code that returns something than with code that modifies a shared buffer, especially when discard is needed.

As with most performance-related questions, it's hard to evaluate the exact impact without significant benchmarks, but i don't know if the code is sufficiently mature to make that.

Does it make sense or I am missing a point here ?

@nrc
Copy link
Member

nrc commented Jun 22, 2015

@cassiersg I think you are on point with all your questions/assumptions, and this is why we have the two different modes of operation - buffers vs returning strings. Furthermore, currently, whenever we are in buffer mode, formatting is infallible, so we never need to 'undo' formatting. Earlier, I had a version which did do modification, and that is why there is a rope in strings.rs. I'm not sure exactly how to make that efficient, but I think it can be done.

Some(result)
}

fn rewrite_call(context: &RewriteContext,
callee: &ast::Expr,
args: &[ptr::P<ast::Expr>],
width: usize,
offset: usize)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please fix the indentation here (this should make us fail a test, I wonder why travis passes?)

@nrc
Copy link
Member

nrc commented Jun 22, 2015

@cassiersg I'd like to keep thinking about the long term and an efficient solution for the buffered case, but this PR looks good as is (modulo the small comments), so I'd like to land it. Thanks for working on it!

@cassiersg
Copy link
Contributor Author

@nrc I fixed style issues, and added a try_opt! macro.
To compile on last nightly, it needs two feature gates changes and the update of strings.rs.
Thanks for the review.

#[macro_export]
macro_rules! try_opt_collect {
($expr:expr) => ({
let mut res = Vec::with_capacity($expr.size_hint().0);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could using $expr twice have unintended consequences? When it alters mutable state, for example? It think we're evaluating the expression twice anyway, which is not optimal.

Maybe we should let e = $expr; first.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Such a thing is already in the std libaries, by the way. See the first implementor of: http://doc.rust-lang.org/std/iter/trait.FromIterator.html.

let vec: Vec<_> = try_opt_collect!(option_iter)

should be equivalent to

let vec: Vec<_> = try_opt!(option_iter.collect())

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't found this impl of IntoIterator for Option. That's exactly what I need, thanks for the pointer.

nrc added a commit that referenced this pull request Jun 23, 2015
@nrc nrc merged commit 58df466 into rust-lang:master Jun 23, 2015
@cassiersg cassiersg deleted the rewrite branch June 23, 2015 09:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants