Skip to content

adding proof of context-sensitivy of raw string literals #16001

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Jul 27, 2014

Conversation

Gankra
Copy link
Contributor

@Gankra Gankra commented Jul 26, 2014

Stumbled across this and thought it would be cool to prove. I've never used Ogden's Lemma before, but I'm pretty sure I used it right. The pumping lemma definitely doesn't seem sufficient for the job. In particular, when using the pumping lemma, you can always just pump one of the quotes, and it's fine. Ogden's Lemma lets you effectively force the pumper to use certain characters in the string.

@cmr

@emberian
Copy link
Member

Awesome, one more item from my todo list gone! I spent a lot of time trying to use the pumping lemma and came to the same conclusion.

@emberian
Copy link
Member

Looks good to me, could someone with better chops than I look at this? @pnkfelix @nikomatsakis @zwarich ?

@@ -1,7 +1,8 @@
Rust's lexical grammar is not context-free. Raw string literals are the source
of the problem. Informally, a raw string literal is an `r`, followed by `N`
hashes (where N can be zero), a quote, any characters, then a quote followed
by `N` hashes. This grammar describes this as best possible:
by `N` hashes. Critically, `N` consecutive hashes *cannot* occur between the quotes.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does this line mean? It sounds to me like it's saying you can't say r#"#"#, but I just tested and that parses just fine.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, I think this text contains a misunderstanding of the end-delimiter of a raw-string.

As @kballard said, r#"#"# is fine.

I imagine the actual problematic string literal will be something like r####" "### "####, where there is a leading " before the series of repeated octothorpes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I reviewed the reasoning, and I think it all still holds together if you just change the definition of s in the manner I suggested above, namely:

`s = r#^{p+2}""#^{p+1}"#^{p+2}`

e.g. for `p = 2`: `s = r####""###"####`

Then that will actually produce an invalid string literal after pumping.

Update: Well, maybe that's a little too optimistic on my part. The problem is that the input may decompose into some choice for v and x where v or x includes the " that occurs within the string literal. I am still working through the case analysis here. There may be a different choice for s that is easier to work with.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In fact, this problem holds for my original proof. Given r####"###"#### with the given marking, one can take v= "# x=epsilon, z=epsilon, and the string becomes r####"#"###"#### which is valid even in my incorrect scheme.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Noticing this flaw, it strikes me that Ogden's Lemma isn't powerful enough either. The high level of non-structure allowed in between the quotes makes this tricky.

@Gankra
Copy link
Contributor Author

Gankra commented Jul 26, 2014

Regexes to the rescue. New proof has landed.

@Gankra
Copy link
Contributor Author

Gankra commented Jul 26, 2014

Shout-outs to http://cs.stackexchange.com/a/437/14863 for inspiration, and also letting me avoid even trying to understand Parikh's theorem.


Neither `v` nor `x` can contain a `"` or `r`, as the number of these characters
in any string in `R'` is fixed. So `v` and `x` contain only hashes.
Consequently, Of the three sequences of hashes, `v` and `x` combined
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: lowercase "Of" to "of" here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More than happy to have that be the biggest complaint you have by the time you've reached that part of the proof :)

Fixed.

@Gankra
Copy link
Contributor Author

Gankra commented Jul 27, 2014

@pnkfelix all squashed and accounted for

bors added a commit that referenced this pull request Jul 27, 2014
Stumbled across this and thought it would be cool to prove. I've never used Ogden's Lemma before, but I'm pretty sure I used it right. The pumping lemma definitely doesn't seem sufficient for the job. In particular, when using the pumping lemma, you can always just pump one of the quotes, and it's fine. Ogden's Lemma lets you effectively force the pumper to use certain characters in the string.

@cmr
@bors bors closed this Jul 27, 2014
@bors bors merged commit 6444b5e into rust-lang:master Jul 27, 2014
@Gankra Gankra deleted the rawstrings-proof branch August 18, 2014 19:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants