-
Notifications
You must be signed in to change notification settings - Fork 6
/
Copy pathNotesonin-orderparsing.txt
17 lines (14 loc) · 4.89 KB
/
Notesonin-orderparsing.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
Initial Text Parsing Goals and Requirements - notes on parsing ‘in order’
1. Output text in correct logical order. This means keeping text together that flows between columns, and that occasionally breaks column layout.
1. The (encoded, binary, usually compressed) streams that include text elements in a PDF do not have to appear in logical order (i.e. human reading order) in that PDF file. Also, a word or sentence or paragraph *can* show up as one text element but in our PDFs they rarely do… we have streams of text in which each text/String element usually only includes a single character with positioning/kerning/etc. information attached to that (additionally, these elements don't even need to be encoded the same way as one another -- which is one reason that having large well-tested libraries for PDF text extraction is essential).
2. We are in luck that the text elements in our PDFs do seem to ‘show up’ in very close to the correct order, so that parsers that *don’t* attempt to do too much positional magic get things mostly right (those that *do* try to capture layout, though, are likely to mangle columns -- reading the page straight across left to right -- since they have no way to understand where column breaks are supposed to appear).
1. Text ordering is a different kind of problem in different sections of the underlying documents.
1. DIRECTORY The most common place in these PDFs for text elements to appear out of order seems to be on the second page of the PDF files, in the Directory of City Officials. At least in the time span for which we have PDF, this is *always* on page 2 and is *always* a symmetrical two-column layout. Because this is an area where the linearity of the PDF elements is likely to break-down and is a simple and consistent layout, I expect to parse this page one column at a time using ‘positional’ parsing within those columns (i.e. reconstructing text units across lines based only on their coordinate positions and not on their flow within the PDF document, telling the parser to work in a defined page region at a time).
2. MAIN BODY But that approach is unlikely to work within the main part of the document (from the 3rd page of each PDF through to the end). These pages are mainly **but not entirely** in a three-column layout. And “linearity” is mostly preserved, so that parsing text out of the PDF in the order in which it appears does almost always produce text that flows correctly in columnar order.
1. The main place where this breaks down is where the 3-column layout is occasionally broken for an item -- for example a particular ordinance -- to run at the full page width. When this happens it seems to be read, in the ‘linear’ approach, as part of the first column. This will cause the text from this item to appear out-of-order, usually within the flow of text pertaining to another item. The result can be especially confusing when the ‘full width’ item spans over a page break.
2. At this point I am still looking for a sufficiently reliable heuristic to identify the lines of text from these ‘full width’ items to either process them through a ‘fix’ at the parsing stage or at least flag them for restoration at the next stage. For example, is there a consistent pattern in the length of parsed-out lines? Is there consistently a graphical element (like a horizontal rule) on the page that we can locate and use to spot these? The answer may even be a specific pattern in the underlying text, in which case we might rectify these at the next stage.
3. But without foreknowledge of where the ‘full width’ items appear, I think we’ll have even more trouble coping with them if we were to attempt to “force” column-at-a-time parsing for this part of the document.
4. Fortunately, there ONLY seem to be breaks in the three-column arrangement to run a particular item at full width. This should be much easier to work with than would a wildly irregular pattern of sidebars or call-outs.
1. INDEX The index section at the end of each City Record issue returns to a single-column layout. If the indexes always started after a page break this would be a non-issue. Because they can begin part-way down the last page of enacted ordinances, this is likely to pose the same challenge as do the full-width items that appear within the regular flow of the main body section.
And some things are just plain weird. PDFMiner seems more prone to these than other libraries I’ve experimented with (notably PDFBox). I think because it always applies at least some kind of ‘layout engine’ approach. On occasion it simply seems to get the x,y position of occasional elements ‘wrong’ in an unpredictable way where a more naive ‘read it in order’ approach doesn’t.
[[An alternative test will be to try a very *strongly* positional approach and hope that white-space in the resulting text is useful for downstream processing]]