Question about cluster comparison function #1059

rhavyyz · 2025-02-25T22:15:34Z

rhavyyz
Feb 25, 2025

Im working with document layout analysis and was searching for techniques to resolve overlapping / concurrent layout classes in a same region. I started to investigate the approach you guys used to attack this issue and found the function that follows.

Im curious about how you guys came up with this approach. The area/confidence base cases seems pretty reasonably but the item vs text and code vs everyone specifics made me curious.

Also about the threshold values how did you guys came up with them? Empirical tests? If so what criteria you guys used to evaluate the best decisions?

    def _should_prefer_cluster(
        self, candidate: Cluster, other: Cluster, params: dict
    ) -> bool:
        """Determine if candidate cluster should be preferred over other cluster based on rules.
        Returns True if candidate should be preferred, False if not."""

        # Rule 1: LIST_ITEM vs TEXT
        if (
            candidate.label == DocItemLabel.LIST_ITEM
            and other.label == DocItemLabel.TEXT
        ):
            # Check if areas are similar (within 20% of each other)
            area_ratio = candidate.bbox.area() / other.bbox.area()
            area_similarity = abs(1 - area_ratio) < 0.2
            if area_similarity:
                return True

        # Rule 2: CODE vs others
        if candidate.label == DocItemLabel.CODE:
            # Calculate how much of the other cluster is contained within the CODE cluster
            overlap = other.bbox.intersection_area_with(candidate.bbox)
            containment = overlap / other.bbox.area()
            if containment > 0.8:  # other is 80% contained within CODE
                return True

        # If no label-based rules matched, fall back to area/confidence thresholds
        area_ratio = candidate.bbox.area() / other.bbox.area()
        conf_diff = other.confidence - candidate.confidence
        
        if (
            area_ratio <= params["area_threshold"]
            and conf_diff > params["conf_threshold"]
        ):
            return False

        return True  # Default to keeping candidate if no rules triggered rejection

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question about cluster comparison function #1059

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 0 comments

Select a reply

Question about cluster comparison function #1059

rhavyyz Feb 25, 2025

Replies: 0 comments

rhavyyz
Feb 25, 2025