-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make InvalidCharError
support UTF-8
#135
base: master
Are you sure you want to change the base?
Conversation
0b41da3
to
8369090
Compare
Currently the `InvalidCharError` error returns a `u8` if an invalid hex character is found. If that invalid character happened to be a `char` that encodes as UTF-8 using more than one byte (i.e. not ASCII) then this means the `u8` returned is not the same as the invalid character in the input string. Add support for invalid hex characters that fall outside of a single ASCII byte. Because there are so many edge cases, this is a best effort. I'm not claiming to have gotten this right for all input strings. We default to just returning the byte. Currently a 4-byte encoded character that falls on an odd position in the hex string is not correctly handled but this patch is definitely an improvement none the less.
8369090
to
ce28e9b
Compare
BTW I did not read up on the intricacies of UTF-8 encoding, I probably should have. |
Oh and I spent waaay to long on this today. Hope it was worth it. |
let hih = (hi as char).to_digit(16).ok_or((hi, true))?; | ||
let loh = (lo as char).to_digit(16).ok_or((lo, false))?; | ||
/// Returns the valid byte or a bool indicating error for `hi` or `lo`. | ||
fn hex_digits_to_byte(hi: u8, lo: u8) -> Result<u8, bool> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In ce28e9b:
I find the word "digit" when used to apply to a number, when we are working with a character encoding that contains actual digits, is extremely confusing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Amusingly I don't even know what hi
and lo
are from just looking at this diff and I worked on this code yesterday for hours. Maybe 'symbol' instead of 'digit' or perhaps hex_ascii_value_to_byte
but the perhaps to_byte
is actually better described by 'decode' because the two hex symbols encode a byte.
Perhaps: decode_hex_ascii_value_pair
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, ascii_value_pair
sounds good to me.
And yeah, hi
and lo
are not great... they are obviously the two nybbles of the byte, but are they ASCII values or numbers? Not easy to tell when we are using u8
for both.
let hi = byte >> 4; | ||
let lo = byte & 0x0f; | ||
|
||
let hi = char::from_digit(hi.into(), 16).unwrap(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In ce28e9b:
This function is so badly named that I consider it a bug in stdlib. We should just do this manually by indexing into a static array containing "0123456789abcdef"
. (I am not interested in filing bugs against stdlib to be told that "this is not Haskell" and I shouldn't complain that about words being used wrong. Let's just forget this function and move on with our lives.)
This would also have clearer semantics regarding capitalization, and to the extent there is any performance difference at all, likely to be faster.
} | ||
} | ||
|
||
#[cfg(feature = "std")] | ||
impl std::error::Error for InvalidCharError {} | ||
|
||
/// Invalid ASCII encoding of a hex digit. | ||
#[derive(Debug, Clone, PartialEq, Eq)] | ||
pub struct InvalidDigitError { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In ce28e9b:
BTW, here you use "digit" correctly to refer to a numeral character.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
...though the structure is still a little confusing since it actually holds two digits. But that's fine.
|
||
impl InvalidDigitError { | ||
/// Returns the invalid byte. | ||
pub fn invalid_byte(&self) -> u8 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In ce28e9b:
Ok, here you have a error which holds two digits, which are supposed to comprise a byte but don't. You have a method called invalid_byte
which returns one of the two digits. I think this name is confusing. It should be called invalid_digit
.
rich_error.unwrap_or(InvalidCharError { | ||
invalid: if self.is_hi { self.hi.into() } else { self.lo.into() }, | ||
pos: self.pos, | ||
}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In ce28e9b:
This logic is really confusing. It seems like if the original "bad digit" was an ASCII byte then that byte can just be promoted to char
and we're done. If not, we need to read more. So you could have an initial if self.invalid_byte() < 128 { return InvalidCharError { invalid: self.invalid_byte().into(), pos: self.pos() }
. (And ofc self.invalid_byte
should be renamed, see above.) So you can do an early return in this case and save yourself at least one level of nesting. (And also fix the bug here self.pos
is wrong and you actually meant to return self.pos()
.)
// Try to map the UTF-8 values in `v` to an invalid character. | ||
fn error(v: &[u8], pos: usize) -> Option<InvalidCharError> { | ||
core::str::from_utf8(v) | ||
.map(|s| InvalidCharError { invalid: s.chars().next().unwrap(), pos }) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In ce28e9b:
It seems like this function will return None
in the case that you give it bad UTF-8 (for example, a non-ASCII character immediately followed by an ASCII character, which never happens in UTF-8 because it would make it ambiguous how to scan backward from the ASCII character to the nearest character boundary.)
But then below you seem to interpret None
as "promote one of the bad digits in self
to a char
". Which, aside from being logically wrong, might not even result in a valid character.
Done first pass on ce28e9b |
Thanks for the review man. Seems this will not only take a lot of dev time but also a lot of review time - is it definitely worth doing? Or should we consider a bandaid solution like an error message that says "invalid character is |
I think it'd be totally fine to say "invalid character I wouldn't call it a "bandaid solution" to accurately describe the error condition without implementing a UTF-8 parser for the sake of providing more context (which might not even be valuable BTW because it assumes that the input came from a UTF-8 string that the user can see in a decoded form). |
For reference, I attempted something along these lines in #124 - we discussed and ended up closing the PR. |
Right -- I complained about exposing an API that returns But for purposes of implementing |
Currently the
InvalidCharError
error returns au8
if an invalid hex character is found. If that invalid character happened to be achar
that encodes as UTF-8 using more than one byte (i.e. not ASCII) then this means theu8
returned is not the same as the invalid character in the input string.Add support for invalid hex characters that fall outside of a single ASCII byte.
Because there are so many edge cases, this is a best effort. I'm not claiming to have gotten this right for all input strings. We default to just returning the byte.
Currently a 4-byte encoded character that falls on an odd position in the hex string is not correctly handled but this patch is definitely an improvement none the less.
Fix: #100