Skip to content

Commit

Permalink
fix: set resolve_entities=False in partition_xml (#3088)
Browse files Browse the repository at this point in the history
### Summary

Closes #3078. Sets `resolve_entities=False` for parsing XML with `lxml`
in `partition_xml` to avoid text being dynamically injected into the
document.

### Testing

`pytest test_unstructured/partition/test_xml.py` continues to pass with
the update.
  • Loading branch information
MthwRobinson authored May 23, 2024
1 parent 9b83330 commit 171b5df
Show file tree
Hide file tree
Showing 3 changed files with 5 additions and 3 deletions.
4 changes: 3 additions & 1 deletion CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
## 0.14.3-dev1
## 0.14.3-dev2

### Enhancements

Expand All @@ -8,6 +8,8 @@

### Fixes

**Turn off XML resolve entities** Sets `resolve_entities=False` for XML parsing with `lxml`
to avoid text being dynamically injected into the XML document.
* Add the missing `form_extraction_skip_tables` argument to the `partition_pdf_or_image` call.

## 0.14.2
Expand Down
2 changes: 1 addition & 1 deletion unstructured/__version__.py
Original file line number Diff line number Diff line change
@@ -1 +1 @@
__version__ = "0.14.3-dev1" # pragma: no cover
__version__ = "0.14.3-dev2" # pragma: no cover
2 changes: 1 addition & 1 deletion unstructured/partition/xml.py
Original file line number Diff line number Diff line change
Expand Up @@ -51,7 +51,7 @@ def _get_leaf_elements(
"""Parse the XML tree in a memory efficient manner if possible."""
element_stack = []

element_iterator = etree.iterparse(file, events=("start", "end"))
element_iterator = etree.iterparse(file, events=("start", "end"), resolve_entities=False)
# NOTE(alan) If xml_path is used for filtering, I've yet to find a good way to stream
# elements through in a memory efficient way, so we bite the bullet and load it all into
# memory.
Expand Down

0 comments on commit 171b5df

Please sign in to comment.