Skip to content

Commit 9ad8a1b

Browse files
[3.14] gh-134837: Correct and improve base85 documentation for base64 module (GH-145843) (GH-149743)
(cherry picked from commit e667d62) Co-authored-by: David Huggins-Daines <dhd@ecolingui.ca>
1 parent 76ecef8 commit 9ad8a1b

2 files changed

Lines changed: 81 additions & 43 deletions

File tree

Doc/library/base64.rst

Lines changed: 59 additions & 28 deletions
Original file line numberDiff line numberDiff line change
@@ -16,8 +16,10 @@
1616
This module provides functions for encoding binary data to printable
1717
ASCII characters and decoding such encodings back to binary data.
1818
This includes the :ref:`encodings specified in <base64-rfc-4648>`
19-
:rfc:`4648` (Base64, Base32 and Base16)
20-
and the non-standard :ref:`Base85 encodings <base64-base-85>`.
19+
:rfc:`4648` (Base64, Base32 and Base16), the :ref:`Base85 encoding
20+
<base64-base-85>` specified in `PDF 2.0
21+
<https://pdfa.org/resource/iso-32000-2/>`_, and non-standard variants
22+
of Base85 used elsewhere.
2123

2224
There are two interfaces provided by this module. The modern interface
2325
supports encoding :term:`bytes-like objects <bytes-like object>` to ASCII
@@ -189,19 +191,28 @@ POST request.
189191
Base85 Encodings
190192
-----------------
191193

192-
Base85 encoding is not formally specified but rather a de facto standard,
193-
thus different systems perform the encoding differently.
194+
Base85 encoding is a family of algorithms which represent four bytes
195+
using five ASCII characters. Originally implemented in the Unix
196+
``btoa(1)`` utility, a version of it was later adopted by Adobe in the
197+
PostScript language and is standardized in PDF 2.0 (ISO 32000-2).
198+
This version, in both its ``btoa`` and PDF variants, is implemented by
199+
:func:`a85encode`.
194200

195-
The :func:`a85encode` and :func:`b85encode` functions in this module are two implementations of
196-
the de facto standard. You should call the function with the Base85
197-
implementation used by the software you intend to work with.
201+
A separate version, using a different output character set, was
202+
defined as an April Fool's joke in :rfc:`1924` but is now used by Git
203+
and other software. This version is implemented by :func:`b85encode`.
198204

199-
The two functions present in this module differ in how they handle the following:
205+
Finally, a third version, using yet another output character set
206+
designed for safe inclusion in programming language strings, is
207+
defined by ZeroMQ and implemented here by :func:`z85encode`.
200208

201-
* Whether to include enclosing ``<~`` and ``~>`` markers
202-
* Whether to include newline characters
203-
* The set of ASCII characters used for encoding
204-
* Handling of null bytes
209+
The functions present in this module differ in how they handle the following:
210+
211+
* Whether to include and expect enclosing ``<~`` and ``~>`` markers.
212+
* Whether to fold the input into multiple lines.
213+
* The set of ASCII characters used for encoding.
214+
* Compact encodings of sequences of spaces and null bytes.
215+
* The encoding of zero-padding bytes applied to the input.
205216

206217
Refer to the documentation of the individual functions for more information.
207218

@@ -212,17 +223,22 @@ Refer to the documentation of the individual functions for more information.
212223

213224
*foldspaces* is an optional flag that uses the special short sequence 'y'
214225
instead of 4 consecutive spaces (ASCII 0x20) as supported by 'btoa'. This
215-
feature is not supported by the "standard" Ascii85 encoding.
226+
feature is not supported by the standard encoding used in PDF.
216227

217228
*wrapcol* controls whether the output should have newline (``b'\n'``)
218229
characters added to it. If this is non-zero, each output line will be
219230
at most this many characters long, excluding the trailing newline.
220231

221-
*pad* controls whether the input is padded to a multiple of 4
222-
before encoding. Note that the ``btoa`` implementation always pads.
232+
*pad* controls whether zero-padding applied to the end of the input
233+
is fully retained in the output encoding, as done by ``btoa``,
234+
producing an exact multiple of 5 bytes of output. This is not part
235+
of the standard encoding used in PDF, as it does not preserve the
236+
length of the data.
223237

224-
*adobe* controls whether the encoded byte sequence is framed with ``<~``
225-
and ``~>``, which is used by the Adobe implementation.
238+
*adobe* controls whether the encoded byte sequence is framed with
239+
``<~`` and ``~>``, as in a PostScript base-85 string literal. Note
240+
that while ASCII85Decode streams in PDF documents *must* be
241+
terminated with ``~>``, they *must not* use a leading ``<~``.
226242

227243
.. versionadded:: 3.4
228244

@@ -234,10 +250,12 @@ Refer to the documentation of the individual functions for more information.
234250

235251
*foldspaces* is a flag that specifies whether the 'y' short sequence
236252
should be accepted as shorthand for 4 consecutive spaces (ASCII 0x20).
237-
This feature is not supported by the "standard" Ascii85 encoding.
253+
This feature is not supported by the standard Ascii85 encoding used in
254+
PDF and PostScript.
238255

239-
*adobe* controls whether the input sequence is in Adobe Ascii85 format
240-
(i.e. is framed with <~ and ~>).
256+
*adobe* controls whether the ``<~`` and ``~>`` markers are
257+
present. While the leading ``<~`` is not required, the input must
258+
end with ``~>``, or a :exc:`ValueError` is raised.
241259

242260
*ignorechars* should be a byte string containing characters to ignore
243261
from the input. This should only contain whitespace characters, and by
@@ -251,35 +269,40 @@ Refer to the documentation of the individual functions for more information.
251269
Encode the :term:`bytes-like object` *b* using base85 (as used in e.g.
252270
git-style binary diffs) and return the encoded :class:`bytes`.
253271

254-
If *pad* is true, the input is padded with ``b'\0'`` so its length is a
255-
multiple of 4 bytes before encoding.
272+
The input is padded with ``b'\0'`` so its length is a multiple of 4
273+
bytes before encoding. If *pad* is true, all the resulting
274+
characters are retained in the output, which will always be a
275+
multiple of 5 bytes, and thus the length of the data may not be
276+
preserved on decoding.
256277

257278
.. versionadded:: 3.4
258279

259280

260281
.. function:: b85decode(b)
261282

262283
Decode the base85-encoded :term:`bytes-like object` or ASCII string *b* and
263-
return the decoded :class:`bytes`. Padding is implicitly removed, if
264-
necessary.
284+
return the decoded :class:`bytes`.
265285

266286
.. versionadded:: 3.4
267287

268288

269289
.. function:: z85encode(s)
270290

271291
Encode the :term:`bytes-like object` *s* using Z85 (as used in ZeroMQ)
272-
and return the encoded :class:`bytes`. See `Z85 specification
273-
<https://rfc.zeromq.org/spec/32/>`_ for more information.
292+
and return the encoded :class:`bytes`.
293+
294+
The `ZeroMQ specification <https://rfc.zeromq.org/spec/32/>`_
295+
requires the length of Z85-encoded data to be a multiple of 5
296+
bytes. To produce compliant data frames, you must pad the input
297+
data to this function to a multiple of 4 bytes.
274298

275299
.. versionadded:: 3.13
276300

277301

278302
.. function:: z85decode(s)
279303

280304
Decode the Z85-encoded :term:`bytes-like object` or ASCII string *s* and
281-
return the decoded :class:`bytes`. See `Z85 specification
282-
<https://rfc.zeromq.org/spec/32/>`_ for more information.
305+
return the decoded :class:`bytes`.
283306

284307
.. versionadded:: 3.13
285308

@@ -352,3 +375,11 @@ recommended to review the security section for any code deployed to production.
352375
Section 5.2, "Base64 Content-Transfer-Encoding," provides the definition of the
353376
base64 encoding.
354377

378+
`ISO 32000-2 Portable document format - Part 2: PDF 2.0 <https://pdfa.org/resource/iso-32000-2/>`_
379+
Section 7.4.3, "ASCII85Decode Filter," provides the definition
380+
of the Ascii85 encoding used in PDF and PostScript, including
381+
the output character set and the details of data length preservation
382+
using zero-padding and partial output groups.
383+
384+
`ZeroMQ RFC 32/Z85 <https://rfc.zeromq.org/spec/32/>`_
385+
The "Formal Specification" section provides the character set used in Z85.

Lib/base64.py

Lines changed: 22 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -325,17 +325,20 @@ def a85encode(b, *, foldspaces=False, wrapcol=0, pad=False, adobe=False):
325325
326326
foldspaces is an optional flag that uses the special short sequence 'y'
327327
instead of 4 consecutive spaces (ASCII 0x20) as supported by 'btoa'. This
328-
feature is not supported by the "standard" Adobe encoding.
328+
feature is not supported by the standard encoding used in PDF.
329329
330-
wrapcol controls whether the output should have newline (b'\\n') characters
331-
added to it. If this is non-zero, each output line will be at most this
332-
many characters long, excluding the trailing newline.
330+
If wrapcol is non-zero, insert a newline (b'\\n') character after at most
331+
every wrapcol characters.
333332
334-
pad controls whether the input is padded to a multiple of 4 before
335-
encoding. Note that the btoa implementation always pads.
333+
pad controls whether zero-padding applied to the end of the input
334+
is fully retained in the output encoding, as done by btoa,
335+
producing an exact multiple of 5 bytes of output.
336+
337+
adobe controls whether the encoded byte sequence is framed with <~
338+
and ~>, as in a PostScript base-85 string literal. Note that
339+
while ASCII85Decode streams in PDF documents must be terminated
340+
with ~>, they must not use a leading <~.
336341
337-
adobe controls whether the encoded byte sequence is framed with <~ and ~>,
338-
which is used by the Adobe implementation.
339342
"""
340343
global _a85chars, _a85chars2
341344
# Delay the initialization of tables to not waste memory
@@ -364,12 +367,14 @@ def a85encode(b, *, foldspaces=False, wrapcol=0, pad=False, adobe=False):
364367
def a85decode(b, *, foldspaces=False, adobe=False, ignorechars=b' \t\n\r\v'):
365368
"""Decode the Ascii85 encoded bytes-like object or ASCII string b.
366369
367-
foldspaces is a flag that specifies whether the 'y' short sequence should be
368-
accepted as shorthand for 4 consecutive spaces (ASCII 0x20). This feature is
369-
not supported by the "standard" Adobe encoding.
370+
foldspaces is a flag that specifies whether the 'y' short sequence
371+
should be accepted as shorthand for 4 consecutive spaces (ASCII
372+
0x20). This feature is not supported by the standard Ascii85
373+
encoding used in PDF and PostScript.
370374
371-
adobe controls whether the input sequence is in Adobe Ascii85 format (i.e.
372-
is framed with <~ and ~>).
375+
adobe controls whether the <~ and ~> markers are present. While
376+
the leading <~ is not required, the input must end with ~>, or a
377+
ValueError is raised.
373378
374379
ignorechars should be a byte string containing characters to ignore from the
375380
input. This should only contain whitespace characters, and by default
@@ -442,8 +447,10 @@ def a85decode(b, *, foldspaces=False, adobe=False, ignorechars=b' \t\n\r\v'):
442447
def b85encode(b, pad=False):
443448
"""Encode bytes-like object b in base85 format and return a bytes object.
444449
445-
If pad is true, the input is padded with b'\\0' so its length is a multiple of
446-
4 bytes before encoding.
450+
The input is padded with b'\0' so its length is a multiple of 4
451+
bytes before encoding. If pad is true, all the resulting
452+
characters are retained in the output, which will always be a
453+
multiple of 5 bytes.
447454
"""
448455
global _b85chars, _b85chars2
449456
# Delay the initialization of tables to not waste memory

0 commit comments

Comments
 (0)