Skip to content

Improve allocation of escape_into #655

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Mar 24, 2020

Conversation

hhirtz
Copy link
Contributor

@hhirtz hhirtz commented Mar 24, 2020

In regex-syntax,
Make escape_into allocate at most twice (all characters are escaped)
instead of more if used on an unallocated String. This way, escape
doesn't have to allocate.

In regex-syntax,
Make escape_into allocate at most twice (all characters are escaped)
instead of more if used on an unallocated String.  This way, escape
doesn't have to allocate.
@hhirtz
Copy link
Contributor Author

hhirtz commented Mar 24, 2020

Sorry for the second push, added syntax: to the commit title.

Copy link
Member

@BurntSushi BurntSushi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@BurntSushi BurntSushi merged commit 3ff6ae1 into rust-lang:master Mar 24, 2020
@hhirtz hhirtz deleted the escape-reserve branch March 24, 2020 12:09
@hhirtz hhirtz restored the escape-reserve branch March 24, 2020 12:09
@hhirtz hhirtz deleted the escape-reserve branch March 24, 2020 12:10
@pickfire
Copy link

pickfire commented Apr 8, 2020

@hhirtz doesn't it still needs to be allocated later from escape_into since escape uses it, why should we allocate twice?

@hhirtz
Copy link
Contributor Author

hhirtz commented Apr 8, 2020

hi @pickfire, I don't really understand the first part of your question, but the patch here made escape_into performs less allocations on average.

For example, if you called escape_into with String::new, you'd likely end up doing more allocations than necessary, because String allocates on push when there's not enough space. There's an example here: https://doc.rust-lang.org/std/string/struct.String.html#representation.

Since s = String::with_capacity(cap) is semantically the same as s = String::new(); s.reserve(cap), I removed the with_capacity in escape.

Also escape_into(text, &mut buf) writes between text.len() (no character needs an escape) and text.len() * 2 (all characters are 1-byte codepoints, and they all need to be escaped*) characters in buf. Here it reserves text.len() bytes in buf because we know it needs at least text.len() more bytes in buf.

*Though I don't know any regex meta character that uses more than 1 byte in UTF-8.

@pickfire
Copy link

pickfire commented Apr 8, 2020

@hhirtz Ah, thanks for the explaination, now I understand it. By the way, if we count first for exactly how many allocations we need, will allocating / reserving once be faster than potentially allocating multiple times?

@hhirtz
Copy link
Contributor Author

hhirtz commented Apr 8, 2020

Yes, because in most cases reallocation is performed by allocating a larger buffer and copying the previous one into it. If you know how much you want to push, it's better to call reserve before.

@BurntSushi
Copy link
Member

will allocating / reserving once be faster than potentially allocating multiple times?

Generally, yes. This is generally referred to as "amortizing allocation" and it greatly influences the structure of a lot of performance critical code. The entire architecture of the regex crate is designed around doing just that. :-)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants