pdf: Improve text with characters outside embedded font limits by QuLogic · Pull Request #30512 · matplotlib/matplotlib

QuLogic · 2025-09-04T05:50:08Z

PR summary

For character codes outside the embedded font limits (256 for type 3 and 65536 for type 42), we output them as XObjects instead of using text commands. But there is nothing in the PDF spec that requires any specific encoding like this.

Since we now support subsetting all fonts before embedding, split each font into groups based on the maximum character code (e.g., 256-entry groups for type 3), then switch text strings to a different font subset and re-map character codes to it when necessary.

This means all text is true text (albeit with some strange encoding), and we no longer need any XObjects for glyphs. For users of non-English text, this means it will become selectable and copyable again.

There are 3 steps to achieve this change:

Track both character codes and glyphs in CharacterTracker. This class takes care of splitting characters into subsets that fit the desired PDF font type limits. -> moved to pdf/ps: Track full character map in CharacterTracker #30566
Output each used font block as a separate subsetted font. Also change the subset prefix to use the glyph indices, which are unique, unlike the character codes. -> first commit here
Generate a ToUnicode dictionary for the subset font. We already did this for type 42 fonts, but the implementation was incorrect as it didn't correctly handle non-BMP characters. For type 3, support was added in PDF 1.2, but we produce 1.4; there is a fallback to the glyph names, but it is inconsistent and probably depends on the original font having the right names. -> second commit here

In the future, we may wish to extend the implementation in CharacterTracker to "compress" the character map it produces (i.e., if you use 255 characters all from a different 256-sized block with type 3, you get 255 fonts, but we could compress that to a single font.) I tried to avoid hard-coding any assumptions that the mapping is block-by-block, but it is possible that something slipped through, so I do not want to spend too much time on that right now.

Formerly, with multi_font_type3.pdf (after adding the emoji to the test), copying the text in evince would produce:

There are basic characters
ABCDEFGHIJKLMNOPQRSTUVWXYZ abcdefghijklmnopqrstuvwxyz
0123456789 !”#$%&’()*+,-./:;¡=¿?@[“]ˆ˙‘—–˝˜
and accented characters
ÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞß
àáâãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿ
in between!

and with multi_font_type42.pdf:

There are basic characters
ABCDEFGHIJKLMNOPQRSTUVWXYZ abcdefghijklmnopqrstuvwxyz
0123456789 !"#$%&'()*+,-./:;<=>?@[\]^_`{|}~
and accented characters
ÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞß
àáâãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿ
ĀāĂăĄąĆćĈĉĊċČčĎďĐđĒēĔĕĖėĘęĚěĜĝĞğ
ĠġĢģĤĥĦħĨĩĪīĬĭĮįİıĲĳĴĵĶķĸĹĺĻļĽľĿ
ŀŁłŃńŅņŇňŉŊŋŌōŎŏŐőŒœŔŕŖŗŘřŚśŜŝŞş
ŠšŢţŤťŦŧŨũŪūŬŭŮůŰűŲųŴŵŶŷŸŹźŻżŽžſ
ƀƁƂƃƄƅƆƇƈƉƊƋƌƍƎƏƐƑƒƓƔƕƖƗƘƙƚƛƜƝƞƟ
ƠơƢƣƤƥƦƧƨƩƪƫƬƭƮƯưƱƲƳƴƵƶƷƸƹƺƻƼƽƾƿ
ǀǁǂǃǄǅǆǇǈǉǊǋǌǍǎǏǐǑǒǓǔǕǖǗǘǙǚǛǜǝǞǟ
ǠǡǢǣǤǥǦǧǨǩǪǫǬǭǮǯǰǱǲǳǴǵǶǷǸǹǺǻǼǽǾǿ
ȀȁȂȃȄȅȆȇȈȉȊȋȌȍȎȏȐȑȒȓȔȕȖȗȘșȚțȜȝȞȟ
ȠȡȢȣȤȥȦȧȨȩȪȫȬȭȮȯȰȱȲȳȴȵȶȷȸȹȺȻȼȽȾȿ
ɀɁɂɃɄɅɆɇɈɉɊɋɌɍɎɏ
in between!

and now we get for both type 3 and 42:

There are basic characters
ABCDEFGHIJKLMNOPQRSTUVWXYZ abcdefghijklmnopqrstuvwxyz
0123456789 !"#$%&'()*+,-./:;<=>?@[\]^_`{|}~
and accented characters
ÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞß
àáâãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿ
ĀāĂăĄąĆćĈĉĊċČčĎďĐđĒēĔĕĖėĘęĚěĜĝĞğ
ĠġĢģĤĥĦħĨĩĪīĬĭĮįİıĲĳĴĵĶķĸĹĺĻļĽľĿ
ŀŁłŃńŅņŇňŉŊŋŌōŎŏŐőŒœŔŕŖŗŘřŚśŜŝŞş
ŠšŢţŤťŦŧŨũŪūŬŭŮůŰűŲųŴŵŶŷŸŹźŻżŽžſ
ƀƁƂƃƄƅƆƇƈƉƊƋƌƍƎƏƐƑƒƓƔƕƖƗƘƙƚƛƜƝƞƟ
ƠơƢƣƤƥƦƧƨƩƪƫƬƭƮƯưƱƲƳƴƵƶƷƸƹƺƻƼƽƾƿ
ǀǁǂǃǄǅǆǇǈǉǊǋǌǍǎǏǐǑǒǓǔǕǖǗǘǙǚǛǜǝǞǟ
ǠǡǢǣǤǥǦǧǨǩǪǫǬǭǮǯǰǱǲǳǴǵǶǷǸǹǺǻǼǽǾǿ
ȀȁȂȃȄȅȆȇȈȉȊȋȌȍȎȏȐȑȒȓȔȕȖȗȘșȚțȜȝȞȟ
ȠȡȢȣȤȥȦȧȨȩȪȫȬȭȮȯȰȱȲȳȴȵȶȷȸȹȺȻȼȽȾȿ
ɀɁɂɃɄɅɆɇɈɉɊɋɌɍɎɏ😀😁😂😃😄😅😆😇😈😉😊😋😌😍😎😏
in between!

Note how in the third line for type 3:

the quotes are 'curly' instead of straight quotes
the chevrons <> are inverted exclamation/question marks ¡¿
the backslash \ is a curly opening double quote “
the caret ^, underscore _, and tilde ~ are {circumflex, dot, tilde} accents/smaller glyphs ˆ˙˜
the braces {} are em-dash and curly quotes —˝
the pipe | is en-dash –
Everything from the seventh to second-last line is missing in type 3 since it's outside of the 256 limit, and all the emoji are missing from type 42 since that's outside the 65536 limit.

~~This depends on #30520, #30335, #30566, and #30567.~~

PR checklist

"closes #0000" is in the body of the PR description to link the related issue
new and changed code is tested
[n/a] Plotting related features are demonstrated in an example
New Features and API Changes are noted with a directive and release note
Documentation complies with general and docstring guidelines

anntzer · 2025-09-04T09:16:30Z

This is great and would also allow getting rid of _get_pdf_charprocs. I'll try to have a look at #30335 to start...

anntzer · 2025-09-05T09:01:13Z

The first two commits (the loop merge and the Type3 encoding change) seem independent from the rest (even from the switch to glyph index tracking) and could be merged first via a separate PR? (I can probably approve them right away.)
I still need to properly review the next one (charmap tracking) but that can also come next by itself?

QuLogic · 2025-09-06T00:04:07Z

I split the type3 encoding to #30520, but the loop merge has conflicts with the glyph index change.

lib/matplotlib/backends/_backend_pdf_ps.py

QuLogic · 2025-09-16T06:59:52Z

The first two commits (the loop merge and the Type3 encoding change) seem independent from the rest (even from the switch to glyph index tracking) and could be merged first via a separate PR? (I can probably approve them right away.)

Split the loop merge as well.

lib/matplotlib/backends/backend_pdf.py

For character codes outside the embedded font limits (256 for type 3 and 65536 for type 42), we output them as XObjects instead of using text commands. But there is nothing in the PDF spec that requires any specific encoding like this. Since we now support subsetting all fonts before embedding, split each font into groups based on the maximum character code (e.g., 256-entry groups for type 3), then switch text strings to a different font subset and re-map character codes to it when necessary. This means all text is true text (albeit with some strange encoding), and we no longer need any XObjects for glyphs. For users of non-English text, this means it will become selectable and copyable again. Fixes matplotlib#21797

For Type 3 fonts, add a `ToUnicode` mapping (which was added in PDF 1.2), and for Type 42 fonts, correct the Unicode encoding, which should be UTF-16BE, not UCS2.

These characters are outside the BMP and should test subset splitting for type 42 output in PDF.

QuLogic · 2025-09-26T00:09:52Z

Rebased without images (moved to text-overhaul-figures branch) so that it can be merged.

pdf: Improve text with characters outside embedded font limits

This includes images changes for the following pull requests / commits: * [Fix center of rotation with rotation_mode='anchor'](matplotlib#29199) (c44db77) * [Remove ttconv backwards-compatibility code](matplotlib#30145) (8caff88) * [Remove kerning_factor from tests](matplotlib#29816) (7b4d725) * [Set text hinting to defaults](matplotlib#29816) (8255ae2) * [Update FreeType to 2.13.3](matplotlib#29816) (89c054d) * [Implement text shaping with libraqm](matplotlib#30000) (b0ded3a, 9813523) * [Add language parameter to Text objects](matplotlib#29794) (7ce8eae) * [Fix auto-sized glyphs with BaKoMa fonts](matplotlib#29936) (3ba2c13) * [pdf: Improve text with characters outside embedded font limits](matplotlib#30512) (b70fb88, 6cedcf7) * [Prepare `CharacterTracker` for advanced font features](matplotlib#30608) (8274e17, 70dc388, df670cf, ed5e074) * [Add font feature API to Text](matplotlib#29695) (972a688) * [Fix spacing in r"$\max f$"](matplotlib#30715) (4a99a83) * [Implement libraqm for vector outputs](matplotlib#30607) (bd17cd4) * [Drop the FT2Font intermediate buffer](matplotlib#30059) (9d7d7b4) * [Rasterize dvi files without dvipng](matplotlib#30039) (7627118) * [Update bundled FreeType and HarfBuzz libraries](matplotlib#30938) (a161658, 9619bcc) * [Fix positioning of wide mathtext accents](matplotlib#31069) (c2fa7ba) * [Refactor RendererAgg.draw_{mathtext,text,tex} to use same base algorithm](matplotlib#31085) (931bcf3) * [Implement TeX's fraction and script alignment](matplotlib#31046) (94ff452, 4bfa0f9, 1cd8510) * [Fix confusion between text height and ascent in metrics calculations](matplotlib#31107) (60f2310) * [mathtext: Fetch quad width & axis height from font metrics](matplotlib#31110) (692df3f, 383028b) * [mathtext: add mathnormal and distinguish between normal and italic family](matplotlib#31121) (a6913f3) * [ENH: Ignore empty text for tightbbox](matplotlib#31285) (d772043) * [Drop axis_artist tickdir image compat, due to text-overhaul merge](matplotlib#31281) (2057583) * [text: Use font metrics to determine line heights](matplotlib#31291) (3ab6a27, d961462, 97f4943) * [ps/pdf: Override font height metrics to support AFM files](matplotlib#31371) (e0913d4) * [TST: Cleanup back-compat code in tests touched by text overhaul](matplotlib#31295) (7c33379) * [TST: Set tests touched by text overhaul to mpl20 style](matplotlib#31300) (41c4d8d)

QuLogic added this to the v3.11.0 milestone Sep 4, 2025

QuLogic added this to Font and text overhaul Sep 4, 2025

QuLogic added the status: waiting for other PR label Sep 4, 2025

github-project-automation bot moved this to Waiting for other PR in Font and text overhaul Sep 4, 2025

github-actions bot added topic: text backend: ps backend: pdf backend: svg backend: cairo topic: text/mathtext labels Sep 4, 2025

QuLogic force-pushed the pdf-text-subsets branch from 7ffffb5 to 3fc92f4 Compare September 4, 2025 06:06

QuLogic mentioned this pull request Sep 4, 2025

TST: Remove redundant font tests #30513

Merged

1 task

QuLogic mentioned this pull request Sep 4, 2025

Use glyph indices for font tracking in vector formats #30335

Merged

1 task

QuLogic mentioned this pull request Sep 6, 2025

pdf: Simplify Type 3 font character encoding #30520

Merged

1 task

github-actions bot added the status: needs rebase label Sep 8, 2025

QuLogic force-pushed the pdf-text-subsets branch from 3fc92f4 to 275fb16 Compare September 16, 2025 05:46

github-actions bot removed the status: needs rebase label Sep 16, 2025

anntzer reviewed Sep 16, 2025

View reviewed changes

lib/matplotlib/backends/_backend_pdf_ps.py Outdated Show resolved Hide resolved

anntzer reviewed Sep 16, 2025

View reviewed changes

lib/matplotlib/backends/_backend_pdf_ps.py Show resolved Hide resolved

This was referenced Sep 16, 2025

pdf/ps: Track full character map in CharacterTracker #30566

Merged

pdf: Merge loops for single byte text chunk output #30567

Merged

QuLogic force-pushed the pdf-text-subsets branch from 275fb16 to 60f3a4f Compare September 16, 2025 06:59

anntzer reviewed Sep 16, 2025

View reviewed changes

lib/matplotlib/backends/backend_pdf.py Outdated Show resolved Hide resolved

anntzer reviewed Sep 16, 2025

View reviewed changes

lib/matplotlib/backends/backend_pdf.py Show resolved Hide resolved

QuLogic force-pushed the pdf-text-subsets branch from 60f3a4f to af3ea7f Compare September 17, 2025 01:38

github-actions bot removed the topic: text label Sep 17, 2025

github-actions bot removed backend: svg backend: cairo topic: text/mathtext labels Sep 17, 2025

QuLogic linked an issue Sep 17, 2025 that may be closed by this pull request

[Bug]: Math fonts (Type 3) incorrectly embedded in PDF? #21797

Closed

QuLogic force-pushed the pdf-text-subsets branch from af3ea7f to e86ca1e Compare September 19, 2025 07:01

QuLogic removed the status: waiting for other PR label Sep 19, 2025

QuLogic moved this from Waiting for other PR to Ready for Review in Font and text overhaul Sep 19, 2025

QuLogic force-pushed the pdf-text-subsets branch from e86ca1e to ad319c7 Compare September 19, 2025 07:34

QuLogic marked this pull request as ready for review September 19, 2025 07:36

github-actions bot added the status: needs rebase label Sep 20, 2025

QuLogic force-pushed the pdf-text-subsets branch from ad319c7 to cf9aff6 Compare September 22, 2025 21:20

github-actions bot removed the status: needs rebase label Sep 22, 2025

tacaswell approved these changes Sep 25, 2025

View reviewed changes

anntzer approved these changes Sep 25, 2025

View reviewed changes

QuLogic added 4 commits September 25, 2025 19:05

pdf: Correct Unicode mapping for out-of-range font chunks

1c4af68

For Type 3 fonts, add a `ToUnicode` mapping (which was added in PDF 1.2), and for Type 42 fonts, correct the Unicode encoding, which should be UTF-16BE, not UCS2.

TST: Add emoji to multi-font text

6cedcf7

These characters are outside the BMP and should test subset splitting for type 42 output in PDF.

DOC: Add a release note for PDF font embedding fixes

c908bbf

QuLogic force-pushed the pdf-text-subsets branch from cf9aff6 to c908bbf Compare September 26, 2025 00:08

QuLogic merged commit a1ed4ef into matplotlib:text-overhaul Sep 26, 2025
34 of 35 checks passed

github-project-automation bot moved this from Ready for Review to Done in Font and text overhaul Sep 26, 2025

QuLogic deleted the pdf-text-subsets branch September 26, 2025 01:49

QuLogic mentioned this pull request Oct 3, 2025

Remove forced fallback from FT2Font::load_char #30627

Merged

1 task

wavebyrd pushed a commit to wavebyrd/matplotlib that referenced this pull request Mar 13, 2026

Merge pull request matplotlib#30512 from QuLogic/pdf-text-subsets

2ad9be1

pdf: Improve text with characters outside embedded font limits

QuLogic mentioned this pull request Apr 10, 2026

Font and text overhaul #30161

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

pdf: Improve text with characters outside embedded font limits#30512

pdf: Improve text with characters outside embedded font limits#30512
QuLogic merged 4 commits intomatplotlib:text-overhaulfrom
QuLogic:pdf-text-subsets

QuLogic commented Sep 4, 2025 •

edited

Loading

Uh oh!

anntzer commented Sep 4, 2025

Uh oh!

anntzer commented Sep 5, 2025 •

edited

Loading

Uh oh!

QuLogic commented Sep 6, 2025

Uh oh!

Uh oh!

Uh oh!

QuLogic commented Sep 16, 2025

Uh oh!

Uh oh!

Uh oh!

QuLogic commented Sep 26, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

QuLogic commented Sep 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR summary

PR checklist

Uh oh!

anntzer commented Sep 4, 2025

Uh oh!

anntzer commented Sep 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

QuLogic commented Sep 6, 2025

Uh oh!

Uh oh!

Uh oh!

QuLogic commented Sep 16, 2025

Uh oh!

Uh oh!

Uh oh!

QuLogic commented Sep 26, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

QuLogic commented Sep 4, 2025 •

edited

Loading

anntzer commented Sep 5, 2025 •

edited

Loading