-
Notifications
You must be signed in to change notification settings - Fork 13k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
remove language-level UB for non-UTF-8 str #71033
Comments
Would this mean that the following stops being UB? use std::mem;
fn main() {
// Being valid utf8 is still a safety invariant of `str`.
// As any method using `str` may depend on this invariant,
// it would still be UB to use `str::from_utf8_unchecked`
// or `str::as_bytes` here.
let s: &str = unsafe { std::mem::transmute(b"\xff\xff" as &[u8]) };
let bytes: &[u8] = unsafe { std::mem::transmute(s) };
assert_eq!(bytes, &[0xff, 0xff][..]);
} |
@lcnr yes exactly.
To be more precise, it would be library UB -- the way the methods are implemented right now, there is actually no language-level UB and nothing that Miri could possibly find, but the library is permitted to change in the future in ways that would make this language UB. |
Dear community and language team. As Ralf notes, there are no compelling reasons to keep this Undefined Behavior (UB) at the level of the abstract machine in terms of the validity invariant of Instead, we can make this a library invariant, and leave it as "library UB" or "unspecified behavior". Indeed, this is probably what we always meant by the note in the reference. I hereby propose that we accept that new definition: @rfcbot merge |
Team member @Centril has proposed to merge this. The next step is review by the rest of the tagged team members: No concerns currently listed. Once a majority of reviewers approve (and at most 2 approvals are outstanding), this will enter its final comment period. If you spot a major issue that hasn't been raised at any point in this process, please speak up! See this document for info about what commands tagged team members can give me. |
This leads to an interesting question: what can miri find? It would be great to figure out if there is something like this that miri does detect, even if miri stops checking |
Note that Miri does not check behind references, so while |
I scrolled over
|
This seems entirely reasonable to me. If you never call any of I wonder if we might be able, in the future, to carefully exclude a few of cc @BurntSushi: Would this change potentially simplify |
Yes I think that is definitely possible. It is somewhat similar to, for example, how we promise that |
Shouldn't this be an RFC, rather than just an issue on the Rust tracker? |
The RFC format feels excessive as a matter of "filling in the forms" (I don't know that it would add anything). The design choices here aren't too complicated and it doesn't feel like there's any opposition thus far to the change here, given that LLVM et. al cannot exploit this. The FCP should provide the same chance of community input here I feel given the usual coverage in TWiR etc. |
I think the main point is that all the library's routines are available as public routines to be invoked on any |
I think this change makes a lot of sense. I've always kind of wondered why Also, are there any changes the
Possibly, but it's tricky business. For example, one might reasonably exclude substring searching from UB if neither the needle nor the haystack were valid UTF-8. Then bstr would still need to roll UTF-8 decoding to implement its own version of |
This change sounds reasonable to me from the point of view of how the language is defined. I don’t expect it will have much or any practical impact for users of the language. |
I assume this would also be a stepping-stone to allowing
|
Hmm, actually, we talked about this in the lang meeting today and Josh checked and UTF-8 encoding units do have a niche that However, this proposed change would prevent us from doing that in future, as I'd be allowed to make a slice full of I do strongly agree that the cross-byte UTF-8 well-formedness should not be in the validity invariant. |
@scottmcm Hmm, so I haven't tried it yet but this should work today: #[rustc_layout_scalar_valid_range_end(0xf7)]
struct Utf8EncodingUnit(u8); I'm not sure if we can take advantage of this in any way, but it might be possible to force miri into validating it without any special-casing (just based on the validity range). |
Making even that guarantee would presumably make any use of EDIT: I guess it would fix the second example in the docs to be non-UB, but not being able to pass the |
Miri validates |
Can that niche ever be useful? |
Correct. Those are
(For historical anecdote: Unicode is limited to U+0000..=U+10FFFF because of UTF-16, but the original design of UTF-8 supported up to U+7FFFFFFF (31 bits) with leading bytes |
@SimonSapin That's not the only thing valid ranges do, we also expose them to LLVM. |
I suppose LLVM could then eliminate a branch like |
Actually, @RalfJung, I realize I'm a bit uncertain about how validity invariants interact with borrows -- if I have a local variable that |
Perhaps better to move that off line =) |
@nikomatsakis I think your question is tracked at rust-lang/unsafe-code-guidelines#84 :) |
🔔 This is now entering its final comment period, as per the review above. 🔔 |
Would be great to make extra sure people don't mistake those two definitions of UB and don't start recklessly working with non-utf-8 data in string thinking it's not UB anymore. |
A minor point regarding terminology: Aren't these "UTF8 encoding units" called (UTF-8) code units in established Unicode parlance? Which, incidentally, is probably an argument in favor of adding a named type as it would simply reify an existing concept rather than introducing an ad-hoc one! |
EDIT: Skip my post and read the next one
|
Yes, “code unit” is established in Unicode but the code unit for UTF-8 is The formal definitions for this start at https://www.unicode.org/versions/Unicode13.0.0/ch03.pdf#G7404 |
FCP was originally requested for changing the validity invariant of #[rustc_layout_scalar_valid_range_end(0xf7)]
struct Utf8Byte(u8); (the name of that type is still up for bikeshedding) I am a bit confused now about what this FCP is actually deciding, if/when it completes. |
Mmm, then maybe it makes sense to also remove language-level UB for non Unicode scalar |
Rust does exploit the limited values that can be stored in a Here, returning |
@RalfJung I think there’s consensus that such a definition for Process-wise, my understanding is that an FCP finishing means accepting a proposal as it was when that FCP was proposed. If new information or consensus emerges later (especially if it’s after some votes) and a team member feels the original proposal should not be accepted, they should file a concern or cancel the FCP. (And potentially propose a new FCP for a different proposal.) |
If there's no potential optimizations to apply it seems better to completely remove the validity invariant as proposed, to allow temporarily using the backing buffer from Alternatively, exposing |
I agree with @SimonSapin. I would not bother with the "encoding units" myself, which seems like an extra layer of complexity that doesn't add much of practical value. |
The final comment period, with a disposition to merge, as per the review above, is now complete. As the automated representative of the governance process, I would like to thank the author for their work and everyone else who contributed. The RFC will be merged soon. |
So the FCP that passed means the decision was that |
I believe so. |
This is the Rust-side issue for rust-lang/reference#792 just so that we can use fcpbot. The change description follows.
Ever since Rust 1.0, the reference said that a non-UTF-8
str
causes immediate UB. In terms of today's terminology, that means thatstr
has a validity invariant of being valid UTF-8.However, that seems unnecessary: the compiler does not actually exploit this, nor is there any clear way it could exploit this. Making UTF-8 a library-level safety invariant is more than enough for everything
str
does. Most likely, it was made a validity invariant because we had not yet properly teased apart those two concepts when the document was initially written.This is also the conclusion that the UCG WG arrived at in rust-lang/unsafe-code-guidelines#78.
I therefore propose we remove the UTF-8 clause from the language spec, so that
str
will have the same validity invariant as[u8]
.The text was updated successfully, but these errors were encountered: