-
Notifications
You must be signed in to change notification settings - Fork 30.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TextDecoder incorrectly decodes 0x92 for Windows-1252 #56542
Comments
Indeed, windows-1252 is not exactly the same as latin-1. How is it that the |
The problem is the latin1 decoder is used for the Windows-1252 codepage. Those encodings are not equivalent, especially for characters 0x80-0x9F. These are not in use for latin1 / ISO 8859-1, but defined for Windows-1252 (see https://en.wikipedia.org/wiki/Windows-1252#Codepage_layout). The latin1 decoder just decodes a number to the unicode codepoint with the same number (eg. 0x65 is decoded to Therefore, the latin1 decoder can't be used to decode Windows-1252. I think the best way to keep the decoder optimization is to add the encoding `iso-8859-1, and all the mappings in lib/internal/encodings.js that currently map to windows-1252 should be mapped to that - except cp1252, windows-1252 and x-cp1252. But, I don't know if the current infrastructure supports that encoding though. cp819 should be equivalent to iso-8859-1 (https://web.archive.org/web/20150531085023/http://www-01.ibm.com/software/globalization/cp/cp00819.html) Current mappings to windows-1252 in encodings.js:
|
Using the
|
TextEncoder in general has many bugs related to not matching the relevant web standard. See this issue I opened some time ago which hasn't gotten any response: #40091 |
From a glance #40091 and this look like good first issues (or maybe more like good second issues?), marked it as such. |
@joyeecheung where can i locate all |
|
@KunalKumar-1 that's right, I hadn't read the standard yet when I wrote that comment. To adhere to the standard, a correct Windows-1252 decoder should be used in all those cases. Using the latin1 decoder from simdutf will result in incorrect decoding when characters in the 0x80-0x9F range are present. |
According to MDN, latin1 is as same as windows-1252: https://developer.mozilla.org/en-US/docs/Web/API/Encoding_API/Encodings. Is this incorrect? |
Windows-1252 is a superset of Latin1/ISO-8859-1, just like it's a superset of ascii, which may explain the MDN Page. Or: If you assume a text contains no invalid characters, you can run ASCII and Latin1/ISO-8859-1 through a Windows-1252 decoder to get Unicode |
It is correct on the web. On the web, all character encodings produce answers for all input byte values. In older RFCs and ISO standards documents, like the one that was used to originally specify latin1, it was acceptable to just not specify what happened for certain bytes. ISO/IEC 8859-1 in particular did not specify what happened for bytes 0x00-1xFF or bytes 0x7F-0x9F. When web browsers started implementing this, they needed to decide what to do if a document contained a byte with value 0x82 or something, while declaring itself latin1. That was undefined behavior according to the latin1 spec. They chose to interpret such bytes as windows-1252 does. So on the web, if you specify latin1 as your encoding, you get the same behavior as windows-1252. This was formalized in the Encoding Standard, so that for any software that adheres to the Encoding Standard, "latin1" is a label for the "windows-1252" encoding. (Other such labels are This is why MDN documents windows-1252 = latin1, because that is true on the web and in any software adhering to the Encoding Standard. Where this gets confusing is that some code that uses latin1 in the name of functions (e.g. So, if you are trying to implement a web standards-compatible latin1 decoder, you need to be very careful about what functions named "latin1" you use from various libraries. Many are not Encoding Standard-compatible. Instead, it is best to choose functions named "windows1252". |
Version
v23.6.0, v22.13.0
Platform
Subsystem
No response
What steps will reproduce the bug?
146 is now decoded as 146, not 8217. This still worked in v23.3.0 and v22.12.0 - might be related to the fix for #56219 ?
How often does it reproduce? Is there a required condition?
It always fails
What is the expected behavior? Why is that the expected behavior?
https://web.archive.org/web/20151027124421/https://msdn.microsoft.com/en-us/library/cc195054.aspx shows 0x92 should indeed be 0x2019
What do you see instead?
something else
Additional information
No response
The text was updated successfully, but these errors were encountered: