Skip to content

Commit

Permalink
Merge pull request #303 from VikParuchuri/dev
Browse files Browse the repository at this point in the history
Fix bug, improve quality
  • Loading branch information
VikParuchuri authored Oct 18, 2024
2 parents ea845fd + 41ce77b commit 361f9b5
Show file tree
Hide file tree
Showing 6 changed files with 219 additions and 197 deletions.
14 changes: 14 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -147,6 +147,20 @@ METADATA_FILE=../pdf_meta.json NUM_DEVICES=4 NUM_WORKERS=15 marker_chunk_convert

Note that the env variables above are specific to this script, and cannot be set in `local.env`.


## Use from python

See the `convert_single_pdf` function for additional arguments that can be passed.

```python
from marker.convert import convert_single_pdf
from marker.models import load_all_models

fpath = "FILEPATH"
model_lst = load_all_models()
full_text, images, out_meta = convert_single_pdf(fpath, model_lst)
```

# Output format

The output will be a markdown file, but there will also be a metadata json file that gives information about the conversion process. It has these fields:
Expand Down
11 changes: 6 additions & 5 deletions marker/cleaners/headings.py
Original file line number Diff line number Diff line change
Expand Up @@ -115,11 +115,12 @@ def infer_heading_levels(pages: List[Page], height_tol=.99):
continue

block_heights = [l.height for l in block.lines] # Account for rotation
avg_height = sum(block_heights) / len(block_heights)
for idx, (min_height, max_height) in enumerate(heading_ranges):
if avg_height >= min_height * height_tol:
block.heading_level = idx + 1
break
if len(block_heights) > 0:
avg_height = sum(block_heights) / len(block_heights)
for idx, (min_height, max_height) in enumerate(heading_ranges):
if avg_height >= min_height * height_tol:
block.heading_level = idx + 1
break

if block.heading_level is None:
block.heading_level = settings.HEADING_DEFAULT_LEVEL
Expand Down
8 changes: 7 additions & 1 deletion marker/layout/layout.py
Original file line number Diff line number Diff line change
Expand Up @@ -63,12 +63,18 @@ def annotate_block_types(pages: List[Page]):
if min_dist_idx is None or dist < min_dist:
min_dist = dist
min_dist_idx = j
for line in block2.lines:
dist = block2.distance(line.bbox)
if min_dist_idx is None or dist < min_dist:
min_dist = dist
min_dist_idx = j

if min_dist_idx is not None:
block.block_type = page.blocks[min_dist_idx].block_type

for i, block in enumerate(page.blocks):
if block.block_type is None:
block.block_type = "text"
block.block_type = "Text"

# Merge blocks together, preserving pdf order
curr_layout_idx = None
Expand Down
Loading

0 comments on commit 361f9b5

Please sign in to comment.