fix: correct table annotation offsets under vertical padding#1
Draft
jimbofreedman wants to merge 1 commit into
Draft
fix: correct table annotation offsets under vertical padding#1jimbofreedman wants to merge 1 commit into
jimbofreedman wants to merge 1 commit into
Conversation
…rd#92) The multi-line path in `TableCell.get_annotations` built `line_break_pos` from the post-padded `self.line_width` (which contains zero-length entries for top and/or bottom padding) and then indexed `annotation_lines` by `no + self.vertical_padding`. Because `no` was already a padded line index, adding `vertical_padding` on top double-counted the top padding: * `valign=bottom` cells shorter than their row could write past the end of `annotation_lines`, raising `IndexError` on the second or later content line's annotations. * `valign=middle` cells silently shifted annotations onto the bottom- padding lines below the actual content, so `text[start:end]` returned whitespace or text from the next column. Fix: scan only the content-width slice of `line_width` (using a new `_content_height` tracked in `normalize_blocks`) so `no` is a content-line index 0..rows-1, and the destination `annotation_lines[no + top_pad]` then correctly references the output line containing the content. Regression tests in tests/test_table_annotation_vertical_padding.py cover the `IndexError` crash and the silent middle-/bottom-align offset bugs.
Author
There was a problem hiding this comment.
Pull request overview
This PR fixes incorrect table-cell annotation line mapping when vertical padding is introduced by vertical-align: middle / bottom, preventing both IndexError crashes and silent misplacement of annotation spans onto padding lines (upstream issue weblyzard#92 / production crash scenario described in the PR).
Changes:
- Track the pre-padding content height for each
TableCellduringnormalize_blocks(). - In
TableCell.get_annotations()(multi-line path), compute line-break positions from content widths only, then offset destination lines by the top padding. - Add a dedicated regression test module covering bottom/middle vertical padding cases and the upstream issue weblyzard#92 structure.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
src/inscriptis/model/table.py |
Fixes multi-line annotation-to-line assignment by distinguishing content lines from padding lines. |
tests/test_table_annotation_vertical_padding.py |
Adds regression tests for bottom/middle vertical padding annotation offsets and the prior crash case. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| # on, then offset the destination by the top padding to land on the | ||
| # correct output line. | ||
| top_pad = self.vertical_padding | ||
| content_widths = self.line_width[top_pad : top_pad + self._content_height] |
|
|
||
| text = result["text"] | ||
| li_slices = [text[s:e] for s, e, t in result["label"] if t == "li"] | ||
| for item in ("item1", "item2", "item5", "item6", "item7", "item8", "item9", "item10"): |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.

Fork PR for the inscriptis table-annotation bug currently worked around by spotship/mavis#1519 (Sentry: MAVIS-TW).
What
TableCell.get_annotations(multi-line path) builtline_break_posfrom the post-paddedself.line_width(which contains zero-length entries for top and bottom padding) and then indexedannotation_lineswithno + self.vertical_padding. Becausenowas already a padded line index, addingvertical_paddingon top double-counted the top padding.Symptoms:
bottom(cell shorter than row)IndexError: list index out of rangewhen annotations spanned multiple content lines — this is what mavis hit in prodmiddle(cell shorter than row)text[start:end]returned whitespace or text from the next column. This is the symptom reported by upstream issue weblyzard/inscriptis#92 (open since Feb 2025).topvertical_padding == 0).Fix
Track the pre-padding row count as
_content_heightinnormalize_blocks, then scan only the content-width slice ofline_widthwhen assigning annotations to lines, sonois a content-line index0..rows-1.annotation_lines[no + top_pad]then correctly references the output line containing the content.Tests
tests/test_table_annotation_vertical_padding.py(new) covers:IndexErrorcrash undervalign=bottomwith multi-line contentvalign=middleoffset shift onto padding linesvalign=bottomannotation on the last content lineAll 4 tests fail against
master, pass on this branch. Full upstream suite: 68 passed (excludingtest_web_service.pywhich needsfastapi, unrelated).Upstream
We may also open this as a PR against
weblyzard/inscriptisonce reviewed here — flagging that decision separately.