Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

remove BOM? #432

Closed
Pomax opened this issue Mar 29, 2018 · 13 comments · Fixed by #463
Closed

remove BOM? #432

Pomax opened this issue Mar 29, 2018 · 13 comments · Fixed by #463

Comments

@Pomax
Copy link

Pomax commented Mar 29, 2018

/~https://github.com/eligrey/FileSaver.js/blob/master/src/FileSaver.js#L69 notes that the auto_bom function "prepend[s] BOM for UTF-8 XML and text/* types (including HTML)", but UTF8-encoded documents don't need a byte ordering mark, since UTF8 does not consist of "ordered bytes" like UTF16/32, but is a byte-aligned bit sequence instead, with the same ordering on all systems.

@wadjeroudi
Copy link

True, there shouldn't be a BOM for utf-8 documents, but that's not a big deal because you can set the flag to no_autobom when saving.

saveAs(blob, 'file.txt', true);

@Pomax
Copy link
Author

Pomax commented May 10, 2018

flags should be reserved for overriding the expected default behaviour, so no BOM should be written unless explicitly told to do so, as per Unicode's recommendation.

@jdhines
Copy link

jdhines commented Jun 11, 2018

Just ran into this where we're designing a replacement system that creates a text file for a downstream process, and now that process is bombing on the file due to it reading it as utf-8-bom instead of utf-8.

@mvasilkov
Copy link

I second this, adding BOM should be opt-in.

@jimmywarting
Copy link
Collaborator

as per Unicode's recommendation

On what ground? source?

@Pomax
Copy link
Author

Pomax commented Sep 21, 2018

http://www.unicode.org/versions/Unicode10.0.0/ch02.pdf page 40,

[...] Use of a BOM is neither required nor recommended for UTF-8, but may be encountered [...]

A statement that has been in effect since 2003 with the introduction of Unicode 4.0 (http://www.unicode.org/versions/Unicode4.0.0/ch02.pdf, pp33), because that's when UTF8 was brought in line with UTF16 through https://tools.ietf.org/html/rfc3629, and became an official encoding scheme for Unicode data.

http://unicode.org/faq/utf_bom.html#utf8-3 gives the less formal explanation that the Byte Order Mark has no meaning for UTF8 because UTF8 isn't a byte-ordered encoding. Systems that read/write Unicode content using the UTF8 scheme must all do so in the exact same way, irrespective of their Endian-ness. But of course, the FAQ is not the authority, the spec is.

As a BOM for UTF-8 is formally both neither required nor recommended, writing one by default is essentially a bug because it contravenes the spec. Thankfully, it's an easy to fix bug, too: flip the default, spin a new major release for that single change (because it's a breaking change) and everyone wins.

@eligrey
Copy link
Owner

eligrey commented Sep 21, 2018

It's for browser charset sniffing because no browsers ever implemented support for the charset mime parameter in blobs, so all text/plain;charset=UTF-8 blobs are saved as ASCII by the browser without a BOM on Windows.

If a blob being saved loads as a new tab instead of a download, it will not display properly unless the charset is sniffed through the BOM.

@eligrey
Copy link
Owner

eligrey commented Sep 21, 2018

As a BOM for UTF-8 is formally both neither required nor recommended, writing one by default is essentially a bug because it contravenes the spec

@Pomax The problem here is that this isn't a BOM for UTF-8, it's a UTF-8 BOM for ASCII→UTF- 8 coalescation as the charset parameter is ignored in Windows.

Try the following code:

location.href = URL.createObjectURL(new Blob(["①"], {type:"text/plain;charset=UTF-8"}));

The auto-BOM code is a workaround for an OS bug.

@eligrey
Copy link
Owner

eligrey commented Sep 21, 2018

I should probably change the behavior to only apply this mitigation on Windows user agents, and provide a global config option to disable the behavior unless opted-in.

@Pomax
Copy link
Author

Pomax commented Sep 21, 2018

I don't see auto_bom() used outside of saveAs, in which case the location.href example you give seems the wrong example: this function is invoked to save a file to the user's device, not to open documents in the browser, so as a download, the associated content type can simply always be application/octet-stream and the charset will be irrelevant.

edit: @jimmywarting also makes a good point about the content change invalidating any digests that the user might run for their content in parallel.

@jimmywarting
Copy link
Collaborator

jimmywarting commented Sep 21, 2018

If we automatically add BOM then we are changing the content of the source they are trying to save. a hash sum of the file wouldn't be the same as what you are downloading.

isn't the BOM only necessary when viewing it in a new tab?
...if the a[download] work properly and don't open a new tab

@eligrey
Copy link
Owner

eligrey commented Sep 21, 2018

@jimmywarting Correct. Also thanks for the latest PR!

@jimmywarting
Copy link
Collaborator

jimmywarting commented Sep 21, 2018

I begun to thing that the noAutoBom should be reversed too.
kinda feel like it dose some unexpected things.

ppl might wonder "whata heck is BOM?" ignore it and just use the first two arguments.

a change like this should be a major version update

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants