TextDecoder incorrectly decodes 0x92 for Windows-1252 #56542

unilynx · 2025-01-10T09:23:56Z

Version

v23.6.0, v22.13.0

Platform

Darwin xxx 24.1.0 Darwin Kernel Version 24.1.0: Thu Oct 10 21:00:32 PDT 2024; root:xnu-11215.41.3~2/RELEASE_ARM64_T6030 arm64

(but no problems reproducing it on Linux amd64 or arm64/)

Subsystem

No response

What steps will reproduce the bug?

const decoded = new TextDecoder("Windows-1252").decode(new Uint8Array([146])).charCodeAt(0);;
console[decoded === 8217 ? 'error' : 'log'](`Expected 8217 got ${decoded}`);

146 is now decoded as 146, not 8217. This still worked in v23.3.0 and v22.12.0 - might be related to the fix for #56219 ?

How often does it reproduce? Is there a required condition?

It always fails

What is the expected behavior? Why is that the expected behavior?

https://web.archive.org/web/20151027124421/https://msdn.microsoft.com/en-us/library/cc195054.aspx shows 0x92 should indeed be 0x2019

What do you see instead?

something else

Additional information

No response

The text was updated successfully, but these errors were encountered:

targos · 2025-01-10T09:35:56Z

Indeed, windows-1252 is not exactly the same as latin-1. How is it that the DecodeLatin1 is called for that encoding? @anonrig

blexrob · 2025-01-10T10:00:13Z

The problem is the latin1 decoder is used for the Windows-1252 codepage. Those encodings are not equivalent, especially for characters 0x80-0x9F. These are not in use for latin1 / ISO 8859-1, but defined for Windows-1252 (see https://en.wikipedia.org/wiki/Windows-1252#Codepage_layout).

The latin1 decoder just decodes a number to the unicode codepoint with the same number (eg. 0x65 is decoded to \u0065, and 0x92 to \u0092). But Windows-1252 differs from ISO 8859-1 in the 0x80-0x9F range, eg. 0x92 maps to \u2019.

Therefore, the latin1 decoder can't be used to decode Windows-1252.

I think the best way to keep the decoder optimization is to add the encoding `iso-8859-1, and all the mappings in lib/internal/encodings.js that currently map to windows-1252 should be mapped to that - except cp1252, windows-1252 and x-cp1252. But, I don't know if the current infrastructure supports that encoding though.

cp819 should be equivalent to iso-8859-1 (https://web.archive.org/web/20150531085023/http://www-01.ibm.com/software/globalization/cp/cp00819.html)

Current mappings to windows-1252 in encodings.js:

  ['ansi_x3.4-1968', 'windows-1252'],
  ['ascii', 'windows-1252'],
  ['cp1252', 'windows-1252'],
  ['cp819', 'windows-1252'],
  ['csisolatin1', 'windows-1252'],
  ['ibm819', 'windows-1252'],
  ['iso-8859-1', 'windows-1252'],
  ['iso-ir-100', 'windows-1252'],
  ['iso8859-1', 'windows-1252'],
  ['iso88591', 'windows-1252'],
  ['iso_8859-1', 'windows-1252'],
  ['iso_8859-1:1987', 'windows-1252'],
  ['l1', 'windows-1252'],
  ['latin1', 'windows-1252'],
  ['us-ascii', 'windows-1252'],
  ['windows-1252', 'windows-1252'],
  ['x-cp1252', 'windows-1252'],

blexrob · 2025-01-10T10:13:07Z

Using the { stream: true } option in decode() permanently disables the latin1 fast path for a TextDecoder, so the following workaround is possible:

const decoder = new TextDecodere(encoding);
if (["cp1252", "windows-1252", "x-cp1252"].includes(encoding.toLowerCase())) 
  decoder.decode(Buffer.from([]), { stream: true });
const decoded = decoder.decode(todecode);

domenic · 2025-01-10T10:33:27Z

TextEncoder in general has many bugs related to not matching the relevant web standard. See this issue I opened some time ago which hasn't gotten any response: #40091

joyeecheung · 2025-01-10T17:24:26Z

From a glance #40091 and this look like good first issues (or maybe more like good second issues?), marked it as such.

KunalKumar-1 · 2025-01-12T19:46:30Z

@joyeecheung where can i locate all windows-1252 mappings or the relevent files to correct it?

KunalKumar-1 · 2025-01-12T20:05:13Z

The problem is the latin1 decoder is used for the Windows-1252 codepage. Those encodings are not equivalent, especially for characters 0x80-0x9F. These are not in use for latin1 / ISO 8859-1, but defined for Windows-1252 (see https://en.wikipedia.org/wiki/Windows-1252#Codepage_layout).

The latin1 decoder just decodes a number to the unicode codepoint with the same number (eg. 0x65 is decoded to \u0065, and 0x92 to \u0092). But Windows-1252 differs from ISO 8859-1 in the 0x80-0x9F range, eg. 0x92 maps to \u2019.

Therefore, the latin1 decoder can't be used to decode Windows-1252.

I think the best way to keep the decoder optimization is to add the encoding `iso-8859-1, and all the mappings in lib/internal/encodings.js that currently map to windows-1252 should be mapped to that - except cp1252, windows-1252 and x-cp1252. But, I don't know if the current infrastructure supports that encoding though.

cp819 should be equivalent to iso-8859-1 (https://web.archive.org/web/20150531085023/http://www-01.ibm.com/software/globalization/cp/cp00819.html)

Current mappings to windows-1252 in encodings.js:
  ['ansi_x3.4-1968', 'windows-1252'],
  ['ascii', 'windows-1252'],
  ['cp1252', 'windows-1252'],
  ['cp819', 'windows-1252'],
  ['csisolatin1', 'windows-1252'],
  ['ibm819', 'windows-1252'],
  ['iso-8859-1', 'windows-1252'],
  ['iso-ir-100', 'windows-1252'],
  ['iso8859-1', 'windows-1252'],
  ['iso88591', 'windows-1252'],
  ['iso_8859-1', 'windows-1252'],
  ['iso_8859-1:1987', 'windows-1252'],
  ['l1', 'windows-1252'],
  ['latin1', 'windows-1252'],
  ['us-ascii', 'windows-1252'],
  ['windows-1252', 'windows-1252'],
  ['x-cp1252', 'windows-1252'],

windows-1252 should adhere the standard mentioned in here ?
https://encoding.spec.whatwg.org/#windows-1252

blexrob · 2025-01-13T08:02:50Z

@KunalKumar-1 that's right, I hadn't read the standard yet when I wrote that comment. To adhere to the standard, a correct Windows-1252 decoder should be used in all those cases. Using the latin1 decoder from simdutf will result in incorrect decoding when characters in the 0x80-0x9F range are present.

anonrig · 2025-01-15T20:32:20Z

Indeed, windows-1252 is not exactly the same as latin-1. How is it that the DecodeLatin1 is called for that encoding? @anonrig

According to MDN, latin1 is as same as windows-1252: https://developer.mozilla.org/en-US/docs/Web/API/Encoding_API/Encodings. Is this incorrect?

unilynx · 2025-01-15T21:04:21Z

Indeed, windows-1252 is not exactly the same as latin-1. How is it that the DecodeLatin1 is called for that encoding? @anonrig

According to MDN, latin1 is as same as windows-1252: https://developer.mozilla.org/en-US/docs/Web/API/Encoding_API/Encodings. Is this incorrect?

Windows-1252 is a superset of Latin1/ISO-8859-1, just like it's a superset of ascii, which may explain the MDN Page.

Or: If you assume a text contains no invalid characters, you can run ASCII and Latin1/ISO-8859-1 through a Windows-1252 decoder to get Unicode

domenic · 2025-01-16T03:43:01Z

According to MDN, latin1 is as same as windows-1252: https://developer.mozilla.org/en-US/docs/Web/API/Encoding_API/Encodings. Is this incorrect?

It is correct on the web.

On the web, all character encodings produce answers for all input byte values.

In older RFCs and ISO standards documents, like the one that was used to originally specify latin1, it was acceptable to just not specify what happened for certain bytes. ISO/IEC 8859-1 in particular did not specify what happened for bytes 0x00-1xFF or bytes 0x7F-0x9F.

When web browsers started implementing this, they needed to decide what to do if a document contained a byte with value 0x82 or something, while declaring itself latin1. That was undefined behavior according to the latin1 spec. They chose to interpret such bytes as windows-1252 does. So on the web, if you specify latin1 as your encoding, you get the same behavior as windows-1252. This was formalized in the Encoding Standard, so that for any software that adheres to the Encoding Standard, "latin1" is a label for the "windows-1252" encoding. (Other such labels are ascii, ibm819, iso88591, etc.)

This is why MDN documents windows-1252 = latin1, because that is true on the web and in any software adhering to the Encoding Standard.

Where this gets confusing is that some code that uses latin1 in the name of functions (e.g. DecodeLatin1) does not adhere to the Encoding Standard. Instead they adhere to the original standard that defines latin1 (ISO/IEC 8859-1), and then make a different choice for the undefined behavior for bytes 0x00-1xFF or bytes 0x7F-0x9F.

So, if you are trying to implement a web standards-compatible latin1 decoder, you need to be very careful about what functions named "latin1" you use from various libraries. Many are not Encoding Standard-compatible. Instead, it is best to choose functions named "windows1252".

joyeecheung added the good first issue Issues that are suitable for first-time contributors. label Jan 10, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TextDecoder incorrectly decodes 0x92 for Windows-1252 #56542

TextDecoder incorrectly decodes 0x92 for Windows-1252 #56542

unilynx commented Jan 10, 2025

targos commented Jan 10, 2025

blexrob commented Jan 10, 2025 •

edited

Loading

blexrob commented Jan 10, 2025

domenic commented Jan 10, 2025

joyeecheung commented Jan 10, 2025

KunalKumar-1 commented Jan 12, 2025 •

edited

Loading

KunalKumar-1 commented Jan 12, 2025

blexrob commented Jan 13, 2025

anonrig commented Jan 15, 2025

unilynx commented Jan 15, 2025

domenic commented Jan 16, 2025 •

edited

Loading

TextDecoder incorrectly decodes 0x92 for Windows-1252 #56542

TextDecoder incorrectly decodes 0x92 for Windows-1252 #56542

Comments

unilynx commented Jan 10, 2025

Version

Platform

Subsystem

What steps will reproduce the bug?

How often does it reproduce? Is there a required condition?

What is the expected behavior? Why is that the expected behavior?

What do you see instead?

Additional information

targos commented Jan 10, 2025

blexrob commented Jan 10, 2025 • edited Loading

blexrob commented Jan 10, 2025

domenic commented Jan 10, 2025

joyeecheung commented Jan 10, 2025

KunalKumar-1 commented Jan 12, 2025 • edited Loading

KunalKumar-1 commented Jan 12, 2025

blexrob commented Jan 13, 2025

anonrig commented Jan 15, 2025

unilynx commented Jan 15, 2025

domenic commented Jan 16, 2025 • edited Loading

blexrob commented Jan 10, 2025 •

edited

Loading

KunalKumar-1 commented Jan 12, 2025 •

edited

Loading

domenic commented Jan 16, 2025 •

edited

Loading