Skip to content

fix: correct table annotation offsets under vertical padding#1

Draft
jimbofreedman wants to merge 1 commit into
masterfrom
fix/table-annotation-offsets-issue-92
Draft

fix: correct table annotation offsets under vertical padding#1
jimbofreedman wants to merge 1 commit into
masterfrom
fix/table-annotation-offsets-issue-92

Conversation

@jimbofreedman

Copy link
Copy Markdown

Fork PR for the inscriptis table-annotation bug currently worked around by spotship/mavis#1519 (Sentry: MAVIS-TW).

What

TableCell.get_annotations (multi-line path) built line_break_pos from the post-padded self.line_width (which contains zero-length entries for top and bottom padding) and then indexed annotation_lines with no + self.vertical_padding. Because no was already a padded line index, adding vertical_padding on top double-counted the top padding.

Symptoms:

valign Pre-fix behaviour
bottom (cell shorter than row) IndexError: list index out of range when annotations spanned multiple content lines — this is what mavis hit in prod
middle (cell shorter than row) Silent: annotations placed on bottom-padding lines below the content; text[start:end] returned whitespace or text from the next column. This is the symptom reported by upstream issue weblyzard/inscriptis#92 (open since Feb 2025).
top Unaffected (vertical_padding == 0).

Fix

Track the pre-padding row count as _content_height in normalize_blocks, then scan only the content-width slice of line_width when assigning annotations to lines, so no is a content-line index 0..rows-1. annotation_lines[no + top_pad] then correctly references the output line containing the content.

Tests

tests/test_table_annotation_vertical_padding.py (new) covers:

All 4 tests fail against master, pass on this branch. Full upstream suite: 68 passed (excluding test_web_service.py which needs fastapi, unrelated).

Upstream

We may also open this as a PR against weblyzard/inscriptis once reviewed here — flagging that decision separately.

…rd#92)

The multi-line path in `TableCell.get_annotations` built `line_break_pos`
from the post-padded `self.line_width` (which contains zero-length entries
for top and/or bottom padding) and then indexed `annotation_lines` by
`no + self.vertical_padding`. Because `no` was already a padded line index,
adding `vertical_padding` on top double-counted the top padding:

* `valign=bottom` cells shorter than their row could write past the end of
  `annotation_lines`, raising `IndexError` on the second or later content
  line's annotations.
* `valign=middle` cells silently shifted annotations onto the bottom-
  padding lines below the actual content, so `text[start:end]` returned
  whitespace or text from the next column.

Fix: scan only the content-width slice of `line_width` (using a new
`_content_height` tracked in `normalize_blocks`) so `no` is a content-line
index 0..rows-1, and the destination `annotation_lines[no + top_pad]` then
correctly references the output line containing the content.

Regression tests in tests/test_table_annotation_vertical_padding.py cover
the `IndexError` crash and the silent middle-/bottom-align offset bugs.
@jimbofreedman

jimbofreedman commented May 19, 2026

Copy link
Copy Markdown
Author

image

@helena-skowronska

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes incorrect table-cell annotation line mapping when vertical padding is introduced by vertical-align: middle / bottom, preventing both IndexError crashes and silent misplacement of annotation spans onto padding lines (upstream issue weblyzard#92 / production crash scenario described in the PR).

Changes:

  • Track the pre-padding content height for each TableCell during normalize_blocks().
  • In TableCell.get_annotations() (multi-line path), compute line-break positions from content widths only, then offset destination lines by the top padding.
  • Add a dedicated regression test module covering bottom/middle vertical padding cases and the upstream issue weblyzard#92 structure.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File Description
src/inscriptis/model/table.py Fixes multi-line annotation-to-line assignment by distinguishing content lines from padding lines.
tests/test_table_annotation_vertical_padding.py Adds regression tests for bottom/middle vertical padding annotation offsets and the prior crash case.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

# on, then offset the destination by the top padding to land on the
# correct output line.
top_pad = self.vertical_padding
content_widths = self.line_width[top_pad : top_pad + self._content_height]

text = result["text"]
li_slices = [text[s:e] for s, e, t in result["label"] if t == "li"]
for item in ("item1", "item2", "item5", "item6", "item7", "item8", "item9", "item10"):
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants