Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Delay literal unescaping #118699

Closed

Conversation

nnethercote
Copy link
Contributor

r? @ghost

@rustbot rustbot added S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. labels Dec 7, 2023
@rustbot
Copy link
Collaborator

rustbot commented Dec 7, 2023

These commits modify the Cargo.lock file. Unintentional changes to Cargo.lock can be introduced when switching branches and rebasing PRs.

If this was unintentional then you should revert the changes before this PR is merged.
Otherwise, you can ignore this comment.

@nnethercote nnethercote marked this pull request as draft December 7, 2023 08:10
@rust-log-analyzer

This comment has been minimized.

@bors

This comment was marked as resolved.

@nnethercote nnethercote force-pushed the delay-literal-unescaping branch 2 times, most recently from 536b970 to 8f12cac Compare December 8, 2023 06:40
GuillaumeGomez added a commit to GuillaumeGomez/rust that referenced this pull request Dec 9, 2023
…e1-dead

Unescaping cleanups

Minor improvements I found while working on rust-lang#118699.

r? `@fee1-dead`
rust-timer added a commit to rust-lang-ci/rust that referenced this pull request Dec 9, 2023
Rollup merge of rust-lang#118734 - nnethercote:literal-cleanups, r=fee1-dead

Unescaping cleanups

Minor improvements I found while working on rust-lang#118699.

r? `@fee1-dead`
@nnethercote nnethercote force-pushed the delay-literal-unescaping branch from 8f12cac to 9f7951b Compare December 11, 2023 06:38
@rust-log-analyzer

This comment has been minimized.

@nnethercote nnethercote force-pushed the delay-literal-unescaping branch from 9f7951b to 4314dff Compare December 11, 2023 23:50
@rust-log-analyzer

This comment has been minimized.

@nnethercote nnethercote force-pushed the delay-literal-unescaping branch from 4314dff to 13e9c76 Compare December 12, 2023 03:43
@rust-log-analyzer

This comment has been minimized.

@nnethercote nnethercote force-pushed the delay-literal-unescaping branch from 13e9c76 to a748526 Compare December 12, 2023 04:20
@rust-log-analyzer

This comment has been minimized.

@nnethercote
Copy link
Contributor Author

nnethercote commented Dec 12, 2023

@rust-lang/lang: This PR proposes a backward-compatible language change.

It changes some errors related to literals from syntactic to semantic. In the compiler, this is all the errors in the EscapeError type. These all relate to char/byte/string literals, and mostly relate to escape sequences, plus a couple of others (e.g. having more or less than one char/byte in a char/byte string literal, or having a NUL char in a C string literal.)

Here is example code showing how things change. This covers some of the relevant errors, but not all; the missing ones would be affected in the same way.

 fn main() {
    '';            //~ error: empty character literal
    b'ab';         //~ error: character literal may only contain one codepoint
    "\a";          //~ error: unknown character escape: `a`
    b"\xzz";       //~ error: invalid character in numeric character escape
    "\u20";        //~ error: incorrect unicode escape sequence
    c"\u{999999}"; //~ error: invalid unicode character escape
}

sink! {
    '';             // was an error, now allowed
    b'ab';          // was an error, now allowed
    "\a";           // was an error, now allowed
    b"\xzz";        // was an error, now allowed
    "\u20";         // was an error, now allowed
    c"\u{999999}";  // was an error, now allowed
};  

#[cfg(FALSE)]
fn configured_out() {
    '';             // was an error, now allowed
    b'ab';          // was an error, now allowed
    "\a";           // was an error, now allowed
    b"\xzz";        // was an error, now allowed
    "\u20";         // was an error, now allowed
    c"\u{999999}";  // was an error, now allowed
}

This means a macro can assign meaning to arbitrary escape sequences, such as foo!("\a\b\c").

This change is consistent with a general trend of delayed literal checking:

  • Floating point literals have had this delayed checking behaviour for a long time (forever?)
  • This PR did a similar thing for numeric literals.
  • The NUL char checking for C string literals is implemented in this semantic (delayed) fashion.

cc @m-ou-se

@nnethercote nnethercote added the I-lang-nominated Nominated for discussion during a lang team meeting. label Dec 12, 2023
@petrochenkov petrochenkov self-assigned this Dec 12, 2023
@scottmcm
Copy link
Member

scottmcm commented Dec 12, 2023

Some parts of this 💯 agreed. For example, "\u{999999}" working lexically seems entirely appropriate, along the lines of the numeric literal things we decided before.

I do wonder, though, whether we want to leave some of the lexical space intentionally reserved here. The above says

This means a macro can assign meaning to arbitrary escape sequences,

But that restricts the possibilities of new lexically-meaningful escape sequences in future.

Imagine, say, that we wanted to one day allow something like "hello \{foo("world")}!". Supporting that is a non-breaking change today, since \{ is a lexical error, but if I understand correctly it'd be a breaking change after doing this, since it'd change 1 token to 3 tokens.

Relatedly, 'ab' working lexically seems fine, I guess, but it's not obvious to me that '' should. That just not being a token seems fine? I don't see how there's an "unescaping" complication with it like there is for some of the other cases here. The regex using + instead of * doesn't seem like a "semantic" check to me.

And we've discussed things like using ''a as a variant of lifetimes before. (IIRC it was to give two different scopes of lifetimes for the in-band lifetimes discussions.) So even if we don't have any plans to actually use that today, it's not clear to me that we should have it start tokenizing now, vs staying a lexical error. Letting it potentially be a single token in future, without needing a breaking change, rather than the two it would be in this PR (if I'm understanding right) could potentially be valuable.

To try to make another example, I'm definitely happy with "\u{111111}" and '\u{111111}' and "\u{11_1111}". But I'd be very skeptical of lexically allowing, say, "\u{11'1111}" or particularly '\u{11'1111}', even though we could.


TL/DR: I agree with delayed semantic checking, but my instinct is that in-the-straight-forward-lexing-regex restrictions are different from semantic checks.

@bors

This comment was marked as resolved.

@nnethercote
Copy link
Contributor Author

For completeness, here is a program exhibiting all the relevant literal errors. The comments describe the current behaviour.

fn main() {
    // ---- SYNTACTIC (LEXING) ERRORS ----

    '';             // ZeroChars

    'ab';           // MoreThanOneChar

    "\x\\";         // InvalidCharInHexEscape + LoneSlash (a weird one)

    "\a";           // InvalidEscape

    "a^Mb";         // BareCarriageReturn
    r"a^Mb";        // BareCarriageReturnInRawString

    '   ';          // (TAB) EscapeOnlyChar

    "\x1";          // TooShortHexEscape

    "\xz";          // InvalidCharInHexEscape

    "\xff";         // OutOfRangeHexEscape

    "\u1234";       // NoBraceInUnicodeEscape

    "\u{xyz}";      // InvalidCharInUnicodeEscape

    "\u{}";         // EmptyUnicodeEscape

    "\u{1234";      // UnclosedUnicodeEscape

    "\u{_123}";     // LeadingUnderscoreUnicodeEscape

    "\u{1234567}";  // OverlongUnicodeEscape

    "\u{dfff}";     // LoneSurrogateUnicodeEscape

    "\u{123456}";   // OutOfRangeUnicodeEscape

    b"\u{1234}";    // UnicodeEscapeInByte

    b"🦀";          // NonAsciiCharInByte

    // ---- SEMANTIC (DELAYED) ERRORS ----

    "abc"xyz;       // InvalidSuffix

    1xyz;           // InvalidIntSuffix
    1u20;           // InvalidIntSuffix

    1.0xyz;         // InvalidFloatSuffix
    1.0f20;         // InvalidFloatSuffix

    0b10f32;        // NonDecimalFloat
    0o10f32;        // NonDecimalFloat

    0x100000000000000000000000000000000; // IntTooLarge

    c"a \0 b";      // NulInCStr 
}

This PR would make all of them delayed, i.e. compiling this program with -Zparse-only would produce no errors.

I'm going to argue for one of two outcomes here.

  • Accept this PR. It makes for a consistent story: all literal checking is delayed. But it does preclude (or at least make more difficult) the addition of some hypothetical future escape syntax.
  • Reject this PR, but change NulInCStr to a lexer error. It's a weird exception at the moment, being the only string literal error that is delayed. I'd be happy to implement that change in a separate PR. (In fact, it's already implemented as the "Detect NulInCStr error earlier" commit in this PR). I think there is still time for this because C string literals haven't reached the release channel yet.

I don't want to end up in a situation where some string literal errors are lexer-level and some are delayed (e.g. \u{999999}). I think that's ugly from both a language design POV and an implementation POV. The numeric literals are already like that, which is unfortunate, but let's not make it worse.

@nnethercote nnethercote force-pushed the delay-literal-unescaping branch from a748526 to f9e3632 Compare December 12, 2023 22:25
@rust-log-analyzer

This comment has been minimized.

@scottmcm
Copy link
Member

I think that's ugly from both a language design POV and an implementation POV.

I'm not convinced by this everywhere.

If I look at https://spec.ferrocene.dev/lexical-elements.html#character-literals for example, that core of

CharacterLiteral ::=
    ' CharacterContent '

seems entirely reasonable, and changing that to

CharacterLiteral ::=
    ' CharacterContent+ '

(or to *) doesn't seem like that big of a difference to me. If anything, it might be easier to not have the repeat there, since that's one fewer thing that needs to be specified later, since it's lexically one codepoint automatically.

Whereas there are things like

A UnicodeEscape starts with a \u{ literal, followed by 1 to 6 instances of a HexadecimalDigit, inclusive, followed by a } character. It can represent any Unicode codepoint between U+00000 and U+10FFFF, inclusive, except Unicode surrogate codepoints, which exist between the range of U+D800 and U+DFFF, inclusive.

where I do agree with this PR that having it be just either \\u\{\x{1,6}\} or \\u\{\x+\} from a lexical sense would be a nice simplification, moving the USV requirements out to a semantic check instead.

@nnethercote
Copy link
Contributor Author

nnethercote commented Dec 12, 2023

I do agree with this PR that having it be just either \\u\{\x{1,6}\} or \\u\{\x+\} from a lexical sense would be a nice simplification, moving the USV requirements out to a semantic check instead.

It would simplify the lexical specification, at the cost of requiring an additional semantic check later. Overall that's less simple, IMO.

Likewise, from the implementation POV, it's nice if all the char/byte/string literal checking is in one place, be it the lexer, or HIR lowering. Having some checks in the lexer and others in HIR lowering is worse.

@nnethercote nnethercote marked this pull request as ready for review December 14, 2023 00:06
bors added a commit to rust-lang-ci/rust that referenced this pull request Dec 16, 2023
…, r=fee1-dead

More unescaping cleanups

More minor improvements I found while working on rust-lang#118699.

r? `@fee1-dead`
@bors

This comment was marked as resolved.

@nnethercote nnethercote force-pushed the delay-literal-unescaping branch 2 times, most recently from 1e49114 to 96e5189 Compare December 18, 2023 05:04
@rust-log-analyzer

This comment has been minimized.

@bors

This comment was marked as resolved.

By making it an `EscapeError` instead of a `LitError`. This makes it
more like the other errors produced during unescaping.

NOTE: this means these errors are issued earlier, before expansion,
which changes behaviour. The next commit will delay issue of this error
and others, reverting the behaviour change for this particular error.

One nice thing about this: the old approach had some code in
`report_lit_error` to calculate the span of the nul char from a range.
This code used a hardwired `+2` to account for the `c"` at the start of
a C string literal, but this should have changed to a `+3` for raw C
string literals to account for the `cr"`, which meant that the caret in
`cr"` nul error messages was one short of where it should have been. The
new approach doesn't need any of this and avoids the off-by-one error.
It's a more logical spot for it, and will be a big help for the next
commit.

Doing this creates a new dependency from `rustc_ast_lowering` on
`rustc_parse`, but `rustc_ast_lowering` is clearly higher up the crate
graph, so this isn't a big deal.

One thing in favour of this change, is that two fluent labels were
duplicated across `rustc_session` and `rustc_parse`:
`invalid_literal_suffix` and `parse_not_supported`. This duplication is
now gone, so that's nice evidence that this is a reasonable change.
Currently string literals are unescaped twice.

- Once during lexing in `cook_quoted`/`cook_c_string`/`cook_common`.
  This one just checks for errors.

- Again in `LitKind::from_token_lit`, which is called when lowering AST
  to HIR, and also in a few other places during expansion. This one
  actually constructs the unescaped string. It also has error checking
  code, but that part of the code is actually dead (and has several
  bugs) because the check during lexing catches all errors!

This commit removes the error-check-only unescaping during lexing, and
fixes up `LitKind::from_token_lit` so it properly does both checking and
construction. This is a backwards-compatible language change: some
programs now compile that previously did not. For example, it is now
possible for macros to consume "invalid" string literals like "\a\b\c".
This is a continuation of a trend of delaying semantic error checking of
literals to after expansion:
- rust-lang#102944 did this for some cases for numeric literals
- The detection of NUL chars in C string literals is already delayed in
  this way.

Notes about test changes:

- `ignore-block-help.rs`: this requires a parse error for the test to
  work. The error used was an unescaping error, which is now delayed to
  after parsing. So the commit changes it to an "unterminated character
  literal" error which still occurs during parsing.

- `tests/ui/lexer/error-stage.rs`: this shows the newly allowed cases,
  due to delayed literal unescaping.

- Several tests had unescaping errors combined with unterminated literal
  errors. The former are now delayed but the latter remain as lexing
  errors. So the unterminated literal part needed to be split into a
  separate test file otherwise compilation would end before the other
  errors were reported.

- issue-62913.rs: The structure and output changed a bit. Issue rust-lang#62913
  was about an ICE due to an unterminated string literal, so the new
  version should be good enough.

- literals-are-validated-before-expansion.rs: this tests exactly the
  behaviour that has been changed, and so was removed

- A couple of other test produce the same errors, just in a different
  order.
@nnethercote nnethercote force-pushed the delay-literal-unescaping branch from 96e5189 to ee19d52 Compare December 19, 2023 11:46
@rust-log-analyzer
Copy link
Collaborator

The job mingw-check failed! Check out the build log: (web) (plain)

Click to see the possible cause of the failure (guessed by this bot)
GITHUB_ENV=/home/runner/work/_temp/_runner_file_commands/set_env_3a5a031f-968a-478b-b431-02e86fc1bc96
GITHUB_EVENT_NAME=pull_request
GITHUB_EVENT_PATH=/home/runner/work/_temp/_github_workflow/event.json
GITHUB_GRAPHQL_URL=https://api.github.com/graphql
GITHUB_HEAD_REF=delay-literal-unescaping
GITHUB_JOB=pr
GITHUB_PATH=/home/runner/work/_temp/_runner_file_commands/add_path_3a5a031f-968a-478b-b431-02e86fc1bc96
GITHUB_REF=refs/pull/118699/merge
GITHUB_REF_NAME=118699/merge
GITHUB_REF_PROTECTED=false
---
    Checking tracing v0.1.40
    Checking ra-ap-rustc_abi v0.21.0
    Checking rustc-dependencies v0.0.0 (/checkout/src/tools/rust-analyzer/crates/rustc-dependencies)
    Checking parser v0.0.0 (/checkout/src/tools/rust-analyzer/crates/parser)
error[E0004]: non-exhaustive patterns: `EscapeError::NulInCStr` not covered
    |
343 |     match error {
343 |     match error {
    |           ^^^^^ pattern `EscapeError::NulInCStr` not covered
    |
note: `EscapeError` defined here
   --> /checkout/compiler/rustc_lexer/src/unescape.rs:16:1
16  | pub enum EscapeError {
    | ^^^^^^^^^^^^^^^^^^^^
...
63  |     NulInCStr,
63  |     NulInCStr,
    |     --------- not covered
    = note: the matched value is of type `EscapeError`
help: ensure that all possible cases are being handled by adding a match arm with a wildcard pattern or an explicit pattern as shown
    |
375 ~         EscapeError::MultipleSkippedLinesWarning => "",
376 ~         EscapeError::NulInCStr => todo!(),

   Compiling chalk-derive v0.95.0
    Checking rust-analyzer-salsa v0.17.0-pre.4
For more information about this error, try `rustc --explain E0004`.

@traviscross
Copy link
Contributor

@rustbot labels -I-lang-nominated

We discussed this in the T-lang triage meeting on 2023-12-20. There were parts of this that the members liked (e.g. allowing an arbitrary string of 9s), and there were parts of this the members did not (e.g. allowing '').

Our impression was that @nnethercote would prefer to go fully in one direction or the other here. In our discussion, we were somewhat open to and curious about the direction of going the other way and rejecting syntactically more of those things that are semantically invalid. @nnethercote, what are your thoughts on the desirability and feasibility of this?

Alternatively, if we were to go in the direction of relaxing the syntactic rules, we were interested in perhaps breaking these out. @nnethercote, is this something you'd consider doing, and if so, what categories might you propose?

Please renominate this for us with your answers.

@rustbot rustbot removed the I-lang-nominated Nominated for discussion during a lang team meeting. label Dec 21, 2023
@bors
Copy link
Contributor

bors commented Dec 22, 2023

☔ The latest upstream changes (presumably #119097) made this pull request unmergeable. Please resolve the merge conflicts.

@nnethercote
Copy link
Contributor Author

nnethercote commented Dec 22, 2023

Our impression was that @nnethercote would prefer to go fully in one direction or the other here. In our discussion, we were somewhat open to and curious about the direction of going the other way and rejecting syntactically more of those things that are semantically invalid. @nnethercote, what are your thoughts on the desirability and feasibility of this?

I'm going to switch terminology from syntactic/semantic to lex-time/post-expansion, because I think the definition of what's syntactic vs. semantic is open to interpretation, and what's important is when the check occurs, not which conceptual category each check belongs to.

In general, you can't move post-expansion checks to lex-time, because that could break code that currently compiles. The one exception is NulInCStr, because that hasn't yet stabilized. I am advocating for moving that earlier in #119172 and this Zulip thread, at least in the short term, for consistency with all the other byte/char/string literal checks. That needs to be done before stabilization occurs.

Moving a check from lex-time to post-expansion can be done at any time, because it's a backwards compatible change.

Alternatively, if we were to go in the direction of relaxing the syntactic rules, we were interested in perhaps breaking these out. @nnethercote, is this something you'd consider doing, and if so, what categories might you propose?

Do you mean break them into categories? I don't particularly want to do that, because it would encourage a conclusion where some checks are done at lex-time and some post-expansion. (The attraction of this PR is that it moves all the checks to post-expansion, for maximum consistency.) But if you put a gun to my head I'd look at this comment above and come up with:

  • Wrong length: ZeroChars, MoreThanOneChar
  • Disallowed char: BareCarriageReturn, BareCarriageReturnInRawString, NonAsciiCharInByte, EscapeOnlyChar
  • Invalid escape: all the others

@traviscross
Copy link
Contributor

traviscross commented Dec 27, 2023

@nnethercote:

In general, you can't move post-expansion checks to lex-time, because that could break code that currently compiles.

We do have an edition coming up. If we could do it, what things would we want to change to achieve consistency?

@traviscross traviscross added the T-lang Relevant to the language team, which will review and decide on the PR/issue. label Jan 6, 2024
@petrochenkov
Copy link
Contributor

petrochenkov commented Jan 12, 2024

There's a category of artificial errors, like "this can compile just right, but we don't want to accept it to avoid confusing human readers".
I think errors like BareCarriageReturn(InRawString) or some of EscapeOnlyChar are in this category.

If we want them reported at all, then we want them to be reported in lexer because human eyes look at code regardless of whether it's in main, sink! or #[cfg(FALSE)].

Edit: errors like NonAsciiCharInByte are in the same category as well, we can compile them just fine, except the shouldn't be errors at all, IMO (is rust-lang/rfcs#3349 implemented yet?).

@petrochenkov
Copy link
Contributor

petrochenkov commented Jan 12, 2024

I don't think moving all errors either to post-expansion, or to lexer is a goal.
Having both points gives us enough flexibility to both report human-oriented (#118699 (comment)) or future-proofing errors early, and to report "clearly semantic" errors at a later point.

The NulInCStr is in the "clearly semantic" category because it doesn't depend on how exactly the NUL is written lexically - \0, or \x00, or \u{0} - it's the lowered "semantic string content" that matters here.
It's different from e.g. bare carriage return errors, because those depend on how exactly the symbol is written lexically - \r - ok, \x0D - ok, bare - not ok.

@petrochenkov
Copy link
Contributor

There's also a question of when a "not a literal" turns into a "literal with escaping problems", which is relevant to cases like ''.
This is purely a question of error recovery strategy in the lexer, so perhaps the error reporting should also stay in the lexer for cases like this.

@petrochenkov
Copy link
Contributor

Basically, I'm currently on the position that what we have now is fine, but we should probably remove some artificial errors.

@petrochenkov petrochenkov added S-waiting-on-team Status: Awaiting decision from the relevant subteam (see the T-<team> label). and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Jan 12, 2024
@nnethercote
Copy link
Contributor Author

nnethercote commented Jan 12, 2024

is rust-lang/rfcs#3349 implemented yet?).

I have a partial implementation of RFC3349. There is no PR yet. It's blocked on two things:

  • The decision on whether to delay literal unescaping checks or not (i.e. this PR).
  • Some issues with the RFC itself, as described here.

I know I wrote this PR, but I think I will close it. Why?

  • It's getting in the way of RFC3349.
  • This eager vs delayed checking discussion is prone to hair-splitting, e.g. about the exact meaning of "syntactic" vs "semantic", and what is or isn't "human-oriented". The kind of thing that can drag on and block useful progress.
  • I'm motivated by (a) what is useful to users, and (b) what simplifies the implementation. For (a), eager checking is working pretty well right now and I haven't seen any compelling reasons to delay checking so far. For (b), and doing all the checks early simplifies the implementation. (It's also easier to explain.)
  • If we later find a compelling reason to delay checking, we can do so.

@apiraino apiraino removed the S-waiting-on-team Status: Awaiting decision from the relevant subteam (see the T-<team> label). label Jan 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. T-lang Relevant to the language team, which will review and decide on the PR/issue.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants