Skip to content

pdf: Improve text with characters outside embedded font limits#30512

Merged
QuLogic merged 4 commits intomatplotlib:text-overhaulfrom
QuLogic:pdf-text-subsets
Sep 26, 2025
Merged

pdf: Improve text with characters outside embedded font limits#30512
QuLogic merged 4 commits intomatplotlib:text-overhaulfrom
QuLogic:pdf-text-subsets

Conversation

@QuLogic
Copy link
Copy Markdown
Member

@QuLogic QuLogic commented Sep 4, 2025

PR summary

For character codes outside the embedded font limits (256 for type 3 and 65536 for type 42), we output them as XObjects instead of using text commands. But there is nothing in the PDF spec that requires any specific encoding like this.

Since we now support subsetting all fonts before embedding, split each font into groups based on the maximum character code (e.g., 256-entry groups for type 3), then switch text strings to a different font subset and re-map character codes to it when necessary.

This means all text is true text (albeit with some strange encoding), and we no longer need any XObjects for glyphs. For users of non-English text, this means it will become selectable and copyable again.

There are 3 steps to achieve this change:

  1. Track both character codes and glyphs in CharacterTracker. This class takes care of splitting characters into subsets that fit the desired PDF font type limits. -> moved to pdf/ps: Track full character map in CharacterTracker #30566
  2. Output each used font block as a separate subsetted font. Also change the subset prefix to use the glyph indices, which are unique, unlike the character codes. -> first commit here
  3. Generate a ToUnicode dictionary for the subset font. We already did this for type 42 fonts, but the implementation was incorrect as it didn't correctly handle non-BMP characters. For type 3, support was added in PDF 1.2, but we produce 1.4; there is a fallback to the glyph names, but it is inconsistent and probably depends on the original font having the right names. -> second commit here

In the future, we may wish to extend the implementation in CharacterTracker to "compress" the character map it produces (i.e., if you use 255 characters all from a different 256-sized block with type 3, you get 255 fonts, but we could compress that to a single font.) I tried to avoid hard-coding any assumptions that the mapping is block-by-block, but it is possible that something slipped through, so I do not want to spend too much time on that right now.

Formerly, with multi_font_type3.pdf (after adding the emoji to the test), copying the text in evince would produce:

There are basic characters
ABCDEFGHIJKLMNOPQRSTUVWXYZ abcdefghijklmnopqrstuvwxyz
0123456789 !”#$%&’()*+,-./:;¡=¿?@[“]ˆ˙‘—–˝˜
and accented characters
ÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖרÙÚÛÜÝÞß
àáâãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿ
in between!

and with multi_font_type42.pdf:

There are basic characters
ABCDEFGHIJKLMNOPQRSTUVWXYZ abcdefghijklmnopqrstuvwxyz
0123456789 !"#$%&'()*+,-./:;<=>?@[\]^_`{|}~
and accented characters
ÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖרÙÚÛÜÝÞß
àáâãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿ
ĀāĂ㥹ĆćĈĉĊċČčĎďĐđĒēĔĕĖėĘęĚěĜĝĞğ
ĠġĢģĤĥĦħĨĩĪīĬĭĮįİıIJijĴĵĶķĸĹĺĻļĽľĿ
ŀŁłŃńŅņŇňʼnŊŋŌōŎŏŐőŒœŔŕŖŗŘřŚśŜŝŞş
ŠšŢţŤťŦŧŨũŪūŬŭŮůŰűŲųŴŵŶŷŸŹźŻżŽžſ
ƀƁƂƃƄƅƆƇƈƉƊƋƌƍƎƏƐƑƒƓƔƕƖƗƘƙƚƛƜƝƞƟ
ƠơƢƣƤƥƦƧƨƩƪƫƬƭƮƯưƱƲƳƴƵƶƷƸƹƺƻƼƽƾƿ
ǀǁǂǃDŽDždžLJLjljNJNjnjǍǎǏǐǑǒǓǔǕǖǗǘǙǚǛǜǝǞǟ
ǠǡǢǣǤǥǦǧǨǩǪǫǬǭǮǯǰDZDzdzǴǵǶǷǸǹǺǻǼǽǾǿ
ȀȁȂȃȄȅȆȇȈȉȊȋȌȍȎȏȐȑȒȓȔȕȖȗȘșȚțȜȝȞȟ
ȠȡȢȣȤȥȦȧȨȩȪȫȬȭȮȯȰȱȲȳȴȵȶȷȸȹȺȻȼȽȾȿ
ɀɁɂɃɄɅɆɇɈɉɊɋɌɍɎɏ
in between!

and now we get for both type 3 and 42:

There are basic characters
ABCDEFGHIJKLMNOPQRSTUVWXYZ abcdefghijklmnopqrstuvwxyz
0123456789 !"#$%&'()*+,-./:;<=>?@[\]^_`{|}~
and accented characters
ÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖרÙÚÛÜÝÞß
àáâãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿ
ĀāĂ㥹ĆćĈĉĊċČčĎďĐđĒēĔĕĖėĘęĚěĜĝĞğ
ĠġĢģĤĥĦħĨĩĪīĬĭĮįİıIJijĴĵĶķĸĹĺĻļĽľĿ
ŀŁłŃńŅņŇňʼnŊŋŌōŎŏŐőŒœŔŕŖŗŘřŚśŜŝŞş
ŠšŢţŤťŦŧŨũŪūŬŭŮůŰűŲųŴŵŶŷŸŹźŻżŽžſ
ƀƁƂƃƄƅƆƇƈƉƊƋƌƍƎƏƐƑƒƓƔƕƖƗƘƙƚƛƜƝƞƟ
ƠơƢƣƤƥƦƧƨƩƪƫƬƭƮƯưƱƲƳƴƵƶƷƸƹƺƻƼƽƾƿ
ǀǁǂǃDŽDždžLJLjljNJNjnjǍǎǏǐǑǒǓǔǕǖǗǘǙǚǛǜǝǞǟ
ǠǡǢǣǤǥǦǧǨǩǪǫǬǭǮǯǰDZDzdzǴǵǶǷǸǹǺǻǼǽǾǿ
ȀȁȂȃȄȅȆȇȈȉȊȋȌȍȎȏȐȑȒȓȔȕȖȗȘșȚțȜȝȞȟ
ȠȡȢȣȤȥȦȧȨȩȪȫȬȭȮȯȰȱȲȳȴȵȶȷȸȹȺȻȼȽȾȿ
ɀɁɂɃɄɅɆɇɈɉɊɋɌɍɎɏ😀😁😂😃😄😅😆😇😈😉😊😋😌😍😎😏
in between!

Note how in the third line for type 3:

  1. the quotes are 'curly' instead of straight quotes
  2. the chevrons <> are inverted exclamation/question marks ¡¿
  3. the backslash \ is a curly opening double quote
  4. the caret ^, underscore _, and tilde ~ are {circumflex, dot, tilde} accents/smaller glyphs ˆ˙˜
  5. the braces {} are em-dash and curly quotes —˝
  6. the pipe | is en-dash
    Everything from the seventh to second-last line is missing in type 3 since it's outside of the 256 limit, and all the emoji are missing from type 42 since that's outside the 65536 limit.

This depends on #30520, #30335, #30566, and #30567.

PR checklist

@anntzer
Copy link
Copy Markdown
Contributor

anntzer commented Sep 4, 2025

This is great and would also allow getting rid of _get_pdf_charprocs. I'll try to have a look at #30335 to start...

@anntzer
Copy link
Copy Markdown
Contributor

anntzer commented Sep 5, 2025

The first two commits (the loop merge and the Type3 encoding change) seem independent from the rest (even from the switch to glyph index tracking) and could be merged first via a separate PR? (I can probably approve them right away.)
I still need to properly review the next one (charmap tracking) but that can also come next by itself?

@QuLogic
Copy link
Copy Markdown
Member Author

QuLogic commented Sep 6, 2025

I split the type3 encoding to #30520, but the loop merge has conflicts with the glyph index change.

@QuLogic
Copy link
Copy Markdown
Member Author

QuLogic commented Sep 16, 2025

The first two commits (the loop merge and the Type3 encoding change) seem independent from the rest (even from the switch to glyph index tracking) and could be merged first via a separate PR? (I can probably approve them right away.)

Split the loop merge as well.

For character codes outside the embedded font limits (256 for type 3 and
65536 for type 42), we output them as XObjects instead of using text
commands. But there is nothing in the PDF spec that requires any
specific encoding like this.

Since we now support subsetting all fonts before embedding, split each
font into groups based on the maximum character code (e.g., 256-entry
groups for type 3), then switch text strings to a different font subset
and re-map character codes to it when necessary.

This means all text is true text (albeit with some strange encoding),
and we no longer need any XObjects for glyphs. For users of non-English
text, this means it will become selectable and copyable again.

Fixes matplotlib#21797
For Type 3 fonts, add a `ToUnicode` mapping (which was added in PDF
1.2), and for Type 42 fonts, correct the Unicode encoding, which should
be UTF-16BE, not UCS2.
These characters are outside the BMP and should test subset splitting
for type 42 output in PDF.
@QuLogic
Copy link
Copy Markdown
Member Author

QuLogic commented Sep 26, 2025

Rebased without images (moved to text-overhaul-figures branch) so that it can be merged.

@QuLogic QuLogic merged commit a1ed4ef into matplotlib:text-overhaul Sep 26, 2025
34 of 35 checks passed
@github-project-automation github-project-automation bot moved this from Ready for Review to Done in Font and text overhaul Sep 26, 2025
@QuLogic QuLogic deleted the pdf-text-subsets branch September 26, 2025 01:49
wavebyrd pushed a commit to wavebyrd/matplotlib that referenced this pull request Mar 13, 2026
pdf: Improve text with characters outside embedded font limits
QuLogic added a commit to QuLogic/matplotlib that referenced this pull request Apr 10, 2026
This includes images changes for the following pull requests / commits:

* [Fix center of rotation with
  rotation_mode='anchor'](matplotlib#29199)
  (c44db77)
* [Remove ttconv backwards-compatibility
  code](matplotlib#30145)
  (8caff88)
* [Remove kerning_factor from
  tests](matplotlib#29816)
  (7b4d725)
* [Set text hinting to
  defaults](matplotlib#29816)
  (8255ae2)
* [Update FreeType to
  2.13.3](matplotlib#29816)
  (89c054d)
* [Implement text shaping with
  libraqm](matplotlib#30000)
  (b0ded3a,
  9813523)
* [Add language parameter to Text
  objects](matplotlib#29794)
  (7ce8eae)
* [Fix auto-sized glyphs with BaKoMa
  fonts](matplotlib#29936)
  (3ba2c13)
* [pdf: Improve text with characters outside embedded font
  limits](matplotlib#30512)
  (b70fb88,
  6cedcf7)
* [Prepare `CharacterTracker` for advanced font
  features](matplotlib#30608)
  (8274e17,
  70dc388,
  df670cf,
  ed5e074)
* [Add font feature API to
  Text](matplotlib#29695)
  (972a688)
* [Fix spacing in r"$\max
  f$"](matplotlib#30715)
  (4a99a83)
* [Implement libraqm for vector
  outputs](matplotlib#30607)
  (bd17cd4)
* [Drop the FT2Font intermediate
  buffer](matplotlib#30059)
  (9d7d7b4)
* [Rasterize dvi files without
  dvipng](matplotlib#30039)
  (7627118)
* [Update bundled FreeType and HarfBuzz
  libraries](matplotlib#30938)
  (a161658,
  9619bcc)
* [Fix positioning of wide mathtext
  accents](matplotlib#31069)
  (c2fa7ba)
* [Refactor RendererAgg.draw_{mathtext,text,tex} to use same base
  algorithm](matplotlib#31085)
  (931bcf3)
* [Implement TeX's fraction and script
  alignment](matplotlib#31046)
  (94ff452,
  4bfa0f9,
  1cd8510)
* [Fix confusion between text height and ascent in metrics
  calculations](matplotlib#31107)
  (60f2310)
* [mathtext: Fetch quad width & axis height from font
  metrics](matplotlib#31110)
  (692df3f,
  383028b)
* [mathtext: add mathnormal and distinguish between normal and italic
  family](matplotlib#31121)
  (a6913f3)
* [ENH: Ignore empty text for
  tightbbox](matplotlib#31285)
  (d772043)
* [Drop axis_artist tickdir image compat, due to text-overhaul
  merge](matplotlib#31281)
  (2057583)
* [text: Use font metrics to determine line
  heights](matplotlib#31291)
  (3ab6a27,
  d961462,
  97f4943)
* [ps/pdf: Override font height metrics to support AFM
  files](matplotlib#31371)
  (e0913d4)
* [TST: Cleanup back-compat code in tests touched by text
  overhaul](matplotlib#31295)
  (7c33379)
* [TST: Set tests touched by text overhaul to mpl20
  style](matplotlib#31300)
  (41c4d8d)
@QuLogic QuLogic mentioned this pull request Apr 10, 2026
4 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

[Bug]: Math fonts (Type 3) incorrectly embedded in PDF?

3 participants