Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TextDecoder incorrectly decodes 0x92 for Windows-1252 #56542

Open
unilynx opened this issue Jan 10, 2025 · 11 comments
Open

TextDecoder incorrectly decodes 0x92 for Windows-1252 #56542

unilynx opened this issue Jan 10, 2025 · 11 comments
Labels
confirmed-bug Issues with confirmed bugs. encoding Issues and PRs related to the TextEncoder and TextDecoder APIs. good first issue Issues that are suitable for first-time contributors. v22.x v22.x Issues that can be reproduced on v22.x or PRs targeting the v22.x-staging branch. v23.x v23.x Issues that can be reproduced on v23.x or PRs targeting the v23.x-staging branch.

Comments

@unilynx
Copy link

unilynx commented Jan 10, 2025

Version

v23.6.0, v22.13.0

Platform

Darwin xxx 24.1.0 Darwin Kernel Version 24.1.0: Thu Oct 10 21:00:32 PDT 2024; root:xnu-11215.41.3~2/RELEASE_ARM64_T6030 arm64

(but no problems reproducing it on Linux amd64 or arm64/)

Subsystem

No response

What steps will reproduce the bug?

const decoded = new TextDecoder("Windows-1252").decode(new Uint8Array([146])).charCodeAt(0);;
console[decoded === 8217 ? 'error' : 'log'](`Expected 8217 got ${decoded}`);

146 is now decoded as 146, not 8217. This still worked in v23.3.0 and v22.12.0 - might be related to the fix for #56219 ?

How often does it reproduce? Is there a required condition?

It always fails

What is the expected behavior? Why is that the expected behavior?

https://web.archive.org/web/20151027124421/https://msdn.microsoft.com/en-us/library/cc195054.aspx shows 0x92 should indeed be 0x2019

What do you see instead?

something else

Additional information

No response

@targos
Copy link
Member

targos commented Jan 10, 2025

Indeed, windows-1252 is not exactly the same as latin-1. How is it that the DecodeLatin1 is called for that encoding? @anonrig

@targos targos added confirmed-bug Issues with confirmed bugs. encoding Issues and PRs related to the TextEncoder and TextDecoder APIs. v22.x v22.x Issues that can be reproduced on v22.x or PRs targeting the v22.x-staging branch. v23.x v23.x Issues that can be reproduced on v23.x or PRs targeting the v23.x-staging branch. labels Jan 10, 2025
@blexrob
Copy link

blexrob commented Jan 10, 2025

The problem is the latin1 decoder is used for the Windows-1252 codepage. Those encodings are not equivalent, especially for characters 0x80-0x9F. These are not in use for latin1 / ISO 8859-1, but defined for Windows-1252 (see https://en.wikipedia.org/wiki/Windows-1252#Codepage_layout).

The latin1 decoder just decodes a number to the unicode codepoint with the same number (eg. 0x65 is decoded to \u0065, and 0x92 to \u0092). But Windows-1252 differs from ISO 8859-1 in the 0x80-0x9F range, eg. 0x92 maps to \u2019.

Therefore, the latin1 decoder can't be used to decode Windows-1252.

I think the best way to keep the decoder optimization is to add the encoding `iso-8859-1, and all the mappings in lib/internal/encodings.js that currently map to windows-1252 should be mapped to that - except cp1252, windows-1252 and x-cp1252. But, I don't know if the current infrastructure supports that encoding though.

cp819 should be equivalent to iso-8859-1 (https://web.archive.org/web/20150531085023/http://www-01.ibm.com/software/globalization/cp/cp00819.html)

Current mappings to windows-1252 in encodings.js:

  ['ansi_x3.4-1968', 'windows-1252'],
  ['ascii', 'windows-1252'],
  ['cp1252', 'windows-1252'],
  ['cp819', 'windows-1252'],
  ['csisolatin1', 'windows-1252'],
  ['ibm819', 'windows-1252'],
  ['iso-8859-1', 'windows-1252'],
  ['iso-ir-100', 'windows-1252'],
  ['iso8859-1', 'windows-1252'],
  ['iso88591', 'windows-1252'],
  ['iso_8859-1', 'windows-1252'],
  ['iso_8859-1:1987', 'windows-1252'],
  ['l1', 'windows-1252'],
  ['latin1', 'windows-1252'],
  ['us-ascii', 'windows-1252'],
  ['windows-1252', 'windows-1252'],
  ['x-cp1252', 'windows-1252'],

@blexrob
Copy link

blexrob commented Jan 10, 2025

Using the { stream: true } option in decode() permanently disables the latin1 fast path for a TextDecoder, so the following workaround is possible:

const decoder = new TextDecodere(encoding);
if (["cp1252", "windows-1252", "x-cp1252"].includes(encoding.toLowerCase())) 
  decoder.decode(Buffer.from([]), { stream: true });
const decoded = decoder.decode(todecode);

@domenic
Copy link
Contributor

domenic commented Jan 10, 2025

TextEncoder in general has many bugs related to not matching the relevant web standard. See this issue I opened some time ago which hasn't gotten any response: #40091

@joyeecheung joyeecheung added the good first issue Issues that are suitable for first-time contributors. label Jan 10, 2025
@joyeecheung
Copy link
Member

From a glance #40091 and this look like good first issues (or maybe more like good second issues?), marked it as such.

@KunalKumar-1
Copy link
Contributor

KunalKumar-1 commented Jan 12, 2025

@joyeecheung where can i locate all windows-1252 mappings or the relevent files to correct it?

@KunalKumar-1
Copy link
Contributor

The problem is the latin1 decoder is used for the Windows-1252 codepage. Those encodings are not equivalent, especially for characters 0x80-0x9F. These are not in use for latin1 / ISO 8859-1, but defined for Windows-1252 (see https://en.wikipedia.org/wiki/Windows-1252#Codepage_layout).

The latin1 decoder just decodes a number to the unicode codepoint with the same number (eg. 0x65 is decoded to \u0065, and 0x92 to \u0092). But Windows-1252 differs from ISO 8859-1 in the 0x80-0x9F range, eg. 0x92 maps to \u2019.

Therefore, the latin1 decoder can't be used to decode Windows-1252.

I think the best way to keep the decoder optimization is to add the encoding `iso-8859-1, and all the mappings in lib/internal/encodings.js that currently map to windows-1252 should be mapped to that - except cp1252, windows-1252 and x-cp1252. But, I don't know if the current infrastructure supports that encoding though.

cp819 should be equivalent to iso-8859-1 (https://web.archive.org/web/20150531085023/http://www-01.ibm.com/software/globalization/cp/cp00819.html)

Current mappings to windows-1252 in encodings.js:

  ['ansi_x3.4-1968', 'windows-1252'],
  ['ascii', 'windows-1252'],
  ['cp1252', 'windows-1252'],
  ['cp819', 'windows-1252'],
  ['csisolatin1', 'windows-1252'],
  ['ibm819', 'windows-1252'],
  ['iso-8859-1', 'windows-1252'],
  ['iso-ir-100', 'windows-1252'],
  ['iso8859-1', 'windows-1252'],
  ['iso88591', 'windows-1252'],
  ['iso_8859-1', 'windows-1252'],
  ['iso_8859-1:1987', 'windows-1252'],
  ['l1', 'windows-1252'],
  ['latin1', 'windows-1252'],
  ['us-ascii', 'windows-1252'],
  ['windows-1252', 'windows-1252'],
  ['x-cp1252', 'windows-1252'],

windows-1252 should adhere the standard mentioned in here ?
https://encoding.spec.whatwg.org/#windows-1252

@blexrob
Copy link

blexrob commented Jan 13, 2025

@KunalKumar-1 that's right, I hadn't read the standard yet when I wrote that comment. To adhere to the standard, a correct Windows-1252 decoder should be used in all those cases. Using the latin1 decoder from simdutf will result in incorrect decoding when characters in the 0x80-0x9F range are present.

@anonrig
Copy link
Member

anonrig commented Jan 15, 2025

Indeed, windows-1252 is not exactly the same as latin-1. How is it that the DecodeLatin1 is called for that encoding? @anonrig

According to MDN, latin1 is as same as windows-1252: https://developer.mozilla.org/en-US/docs/Web/API/Encoding_API/Encodings. Is this incorrect?

@unilynx
Copy link
Author

unilynx commented Jan 15, 2025

Indeed, windows-1252 is not exactly the same as latin-1. How is it that the DecodeLatin1 is called for that encoding? @anonrig

According to MDN, latin1 is as same as windows-1252: https://developer.mozilla.org/en-US/docs/Web/API/Encoding_API/Encodings. Is this incorrect?

Windows-1252 is a superset of Latin1/ISO-8859-1, just like it's a superset of ascii, which may explain the MDN Page.

Or: If you assume a text contains no invalid characters, you can run ASCII and Latin1/ISO-8859-1 through a Windows-1252 decoder to get Unicode

@domenic
Copy link
Contributor

domenic commented Jan 16, 2025

According to MDN, latin1 is as same as windows-1252: https://developer.mozilla.org/en-US/docs/Web/API/Encoding_API/Encodings. Is this incorrect?

It is correct on the web.

On the web, all character encodings produce answers for all input byte values.

In older RFCs and ISO standards documents, like the one that was used to originally specify latin1, it was acceptable to just not specify what happened for certain bytes. ISO/IEC 8859-1 in particular did not specify what happened for bytes 0x00-1xFF or bytes 0x7F-0x9F.

When web browsers started implementing this, they needed to decide what to do if a document contained a byte with value 0x82 or something, while declaring itself latin1. That was undefined behavior according to the latin1 spec. They chose to interpret such bytes as windows-1252 does. So on the web, if you specify latin1 as your encoding, you get the same behavior as windows-1252. This was formalized in the Encoding Standard, so that for any software that adheres to the Encoding Standard, "latin1" is a label for the "windows-1252" encoding. (Other such labels are ascii, ibm819, iso88591, etc.)

This is why MDN documents windows-1252 = latin1, because that is true on the web and in any software adhering to the Encoding Standard.

Where this gets confusing is that some code that uses latin1 in the name of functions (e.g. DecodeLatin1) does not adhere to the Encoding Standard. Instead they adhere to the original standard that defines latin1 (ISO/IEC 8859-1), and then make a different choice for the undefined behavior for bytes 0x00-1xFF or bytes 0x7F-0x9F.

So, if you are trying to implement a web standards-compatible latin1 decoder, you need to be very careful about what functions named "latin1" you use from various libraries. Many are not Encoding Standard-compatible. Instead, it is best to choose functions named "windows1252".

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
confirmed-bug Issues with confirmed bugs. encoding Issues and PRs related to the TextEncoder and TextDecoder APIs. good first issue Issues that are suitable for first-time contributors. v22.x v22.x Issues that can be reproduced on v22.x or PRs targeting the v22.x-staging branch. v23.x v23.x Issues that can be reproduced on v23.x or PRs targeting the v23.x-staging branch.
Projects
None yet
Development

No branches or pull requests

7 participants