Skip to content

Commit

Permalink
Fix handling with "xml:" prefixed namespace (#208)
Browse files Browse the repository at this point in the history
I found parsing XHTML documents like below fails since v3.3.3:

```xml
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html>
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
  <head>
    <title>XHTML Document</title>
  </head>
  <body>
    <h1>XHTML Document</h1>
    <p xml:lang="ja" lang="ja">この段落は日本語です。</p>
  </body>
</html>
```

[XML namespace spec][spec] is a little bit ambiguous but document above
is valid according to an [article W3C serves][article].

I fixed the parsing algorithm. Can you review it?

As an aside, `<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en"
lang="en">` style language declaration is often used in XHTML files
included in EPUB files because [sample EPUB files][samples] provided by
IDPF, former EPUB spec authority, use the style.

[spec]: https://www.w3.org/TR/REC-xml-names/#defaulting
[article]:
https://www.w3.org/International/questions/qa-html-language-declarations#attributes
[samples]: /~https://github.com/IDPF/epub3-samples
  • Loading branch information
KitaitiMakoto authored Sep 29, 2024
1 parent 2e1cd64 commit 78f8712
Show file tree
Hide file tree
Showing 2 changed files with 38 additions and 2 deletions.
5 changes: 3 additions & 2 deletions lib/rexml/parsers/baseparser.rb
Original file line number Diff line number Diff line change
Expand Up @@ -156,6 +156,7 @@ module Private
default_entities.each do |term|
DEFAULT_ENTITIES_PATTERNS[term] = /&#{term};/
end
XML_PREFIXED_NAMESPACE = "http://www.w3.org/XML/1998/namespace"
end
private_constant :Private

Expand Down Expand Up @@ -185,7 +186,7 @@ def stream=( source )
@tags = []
@stack = []
@entities = []
@namespaces = {}
@namespaces = {"xml" => Private::XML_PREFIXED_NAMESPACE}
@namespaces_restore_stack = []
end

Expand Down Expand Up @@ -790,7 +791,7 @@ def parse_attributes(prefixes)
@source.match(/\s*/um, true)
if prefix == "xmlns"
if local_part == "xml"
if value != "http://www.w3.org/XML/1998/namespace"
if value != Private::XML_PREFIXED_NAMESPACE
msg = "The 'xml' prefix must not be bound to any other namespace "+
"(http://www.w3.org/TR/REC-xml-names/#ns-decl)"
raise REXML::ParseException.new( msg, @source, self )
Expand Down
35 changes: 35 additions & 0 deletions test/parser/test_base_parser.rb
Original file line number Diff line number Diff line change
Expand Up @@ -23,5 +23,40 @@ def test_large_xml
parser.position < xml.bytesize
end
end

def test_attribute_prefixed_by_xml
xml = <<-XML
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html>
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
<head>
<title>XHTML Document</title>
</head>
<body>
<h1>XHTML Document</h1>
<p xml:lang="ja" lang="ja">この段落は日本語です。</p>
</body>
</html>
XML

parser = REXML::Parsers::BaseParser.new(xml)
5.times {parser.pull}

html = parser.pull
assert_equal([:start_element,
"html",
{"xmlns" => "http://www.w3.org/1999/xhtml",
"xml:lang" => "en",
"lang" => "en"}],
html)

15.times {parser.pull}

p = parser.pull
assert_equal([:start_element,
"p",
{"xml:lang" => "ja", "lang" => "ja"}],
p)
end
end
end

0 comments on commit 78f8712

Please sign in to comment.