Skip to content

Commit 5667d2b

Browse files
committed
Improved base codecs
1 parent c24ab65 commit 5667d2b

12 files changed

Lines changed: 359 additions & 226 deletions

File tree

README.md

Lines changed: 31 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -211,14 +211,33 @@ o
211211

212212
## :page_with_curl: List of codecs
213213

214-
#### BaseXX
215-
216-
- [X] `baseN`: see [base encodings](https://python-codext.readthedocs.io/en/latest/enc/base.html) (incl [z]base32, 36, 45, 58, 62, 63, 64, [z]85, 91, 100, 122)
214+
#### [BaseXX](https://python-codext.readthedocs.io/en/latest/enc/base.html)
215+
216+
- [X] `base1`: useless, but for the sake of completeness
217+
- [X] `base2`: simple conversion to binary (with a variant with a reversed alphabet)
218+
- [X] `base3`: conversion to ternary (with a variant with a reversed alphabet)
219+
- [X] `base4`: conversion to quarternary (with a variant with a reversed alphabet)
220+
- [X] `base8`: simple conversion to octal (with a variant with a reversed alphabet)
221+
- [X] `base10`: simple conversion to decimal
222+
- [X] `base16`: simple conversion to hexadecimal (with a variant holding an alphabet with digits and letters inverted)
223+
- [X] `base26`: conversion to alphabet letters
224+
- [X] `base32`: classical conversion according to the RFC4648 with all its variants ([zbase32](https://philzimmermann.com/docs/human-oriented-base-32-encoding.txt), extended hexadecimal, [geohash](https://en.wikipedia.org/wiki/Geohash), [Crockford](https://www.crockford.com/base32.html))
225+
- [X] `base36`: [Base36](https://en.wikipedia.org/wiki/Base36) conversion to letters and digits (with a variant inverting both groups)
226+
- [X] `base45`: [Base45](https://datatracker.ietf.org/doc/html/draft-faltstrom-base45-04.txt) DRAFT algorithm (with a variant inverting letters and digits)
227+
- [X] `base58`: multiple versions of [Base58](https://en.bitcoinwiki.org/wiki/Base58) (bitcoin, flickr, ripple)
228+
- [X] `base62`: [Base62](https://en.wikipedia.org/wiki/Base62) conversion to lower- and uppercase letters and digits (with a variant with letters and digits inverted)
229+
- [X] `base63`: similar to `base62` with the "`_`" added
230+
- [X] `base64`: classical conversion according to RFC4648 with its variant URL (or *file*) (it also holds a variant with letters and digits inverted)
231+
- [X] `base67`: custom conversion using some more special characters (also with a variant with letters and digits inverted)
232+
- [X] `base85`: all variants of Base85 ([Ascii85](https://fr.wikipedia.org/wiki/Ascii85), [z85](https://rfc.zeromq.org/spec/32), [Adobe](https://dencode.com/string/ascii85), [(x)btoa](https://dencode.com/string/ascii85), [RFC1924](https://datatracker.ietf.org/doc/html/rfc1924), [XML](https://datatracker.ietf.org/doc/html/draft-kwiatkowski-base85-for-xml-00))
233+
- [X] `base91`: [Base91](http://base91.sourceforge.net) custom conversion
234+
- [X] `base100` (or *emoji*): [Base100](https://github.com/AdamNiederer/base100) custom conversion
235+
- [X] `base122`: [Base100](http://blog.kevinalbs.com/base122) custom conversion
217236
- [X] `base-genericN`: see [base encodings](https://python-codext.readthedocs.io/en/latest/enc/base.html) ; supports any possible base
218237

219238
This category also contains `ascii85`, `adobe`, `[x]btoa`, `zeromq` with the `base85` codec.
220239

221-
#### Binary
240+
#### [Binary](https://python-codext.readthedocs.io/en/latest/enc/binary.html)
222241

223242
- [X] `baudot`: supports CCITT-1, CCITT-2, EU/FR, ITA1, ITA2, MTK-2 (Python3 only), UK, ...
224243
- [X] `baudot-spaced`: variant of `baudot` ; groups of 5 bits are whitespace-separated
@@ -232,17 +251,17 @@ This category also contains `ascii85`, `adobe`, `[x]btoa`, `zeromq` with the `ba
232251
- [X] `manchester-inverted`: variant of `manchester` ; XORes each bit of the input with `10`
233252
- [X] `rotateN`: rotates characters by the specified number of bits (*N* belongs to [1, 7] ; Python 3 only)
234253

235-
#### Common
254+
#### [Common](https://python-codext.readthedocs.io/en/latest/enc/common.html)
236255

237256
- [X] `a1z26`: keeps words whitespace-separated and uses a custom character separator
238257
- [X] `cases`: set of case-related encodings (including camel-, kebab-, lower-, pascal-, upper-, snake- and swap-case, slugify, capitalize, title)
239-
- [X] `dummy`: set of simple encodings (including replace, reverse, word-reverse, substite and strip-spaces)
258+
- [X] `dummy`: set of simple encodings (including integer, replace, reverse, word-reverse, substite and strip-spaces)
240259
- [X] `octal`: dummy octal conversion (converts to 3-digits groups)
241260
- [X] `octal-spaced`: variant of `octal` ; dummy octal conversion, handling whitespace separators
242261
- [X] `ordinal`: dummy character ordinals conversion (converts to 3-digits groups)
243262
- [X] `ordinal-spaced`: variant of `ordinal` ; dummy character ordinals conversion, handling whitespace separators
244263

245-
#### Compression
264+
#### [Compression](https://python-codext.readthedocs.io/en/latest/enc/compressions.html)
246265

247266
- [X] `gzip`: standard Gzip compression/decompression
248267
- [X] `lz77`: compresses the given data with the algorithm of Lempel and Ziv of 1977
@@ -253,7 +272,7 @@ This category also contains `ascii85`, `adobe`, `[x]btoa`, `zeromq` with the `ba
253272

254273
> :warning: Compression functions are of course definitely **NOT** encoding functions ; they are implemented for leveraging the `.encode(...)` API from `codecs`.
255274
256-
#### Cryptography
275+
#### [Cryptography](https://python-codext.readthedocs.io/en/latest/enc/crypto.html)
257276

258277
- [X] `affine`: aka Affine Cipher
259278
- [X] `atbash`: aka Atbash Cipher
@@ -268,7 +287,7 @@ This category also contains `ascii85`, `adobe`, `[x]btoa`, `zeromq` with the `ba
268287

269288
> :warning: Crypto functions are of course definitely **NOT** encoding functions ; they are implemented for leveraging the `.encode(...)` API from `codecs`.
270289
271-
#### Hashing
290+
#### [Hashing](https://python-codext.readthedocs.io/en/latest/enc/hashing.html)
272291

273292
- [X] `blake`: includes BLAKE2b and BLAKE2s (Python 3 only ; relies on `hashlib`)
274293
- [X] `checksums`: includes Adler32 and CRC32 (relies on `zlib`)
@@ -279,7 +298,7 @@ This category also contains `ascii85`, `adobe`, `[x]btoa`, `zeromq` with the `ba
279298

280299
> :warning: Hash functions are of course definitely **NOT** encoding functions ; they are implemented for convenience with the `.encode(...)` API from `codecs` and useful for chaning codecs.
281300
282-
#### Languages
301+
#### [Languages](https://python-codext.readthedocs.io/en/latest/enc/languages.html)
283302

284303
- [X] `braille`: well-known braille language (Python 3 only)
285304
- [X] `ipsum`: aka lorem ipsum
@@ -293,15 +312,15 @@ This category also contains `ascii85`, `adobe`, `[x]btoa`, `zeromq` with the `ba
293312
- [X] `tap`: converts text to tap/knock code, commonly used by prisoners
294313
- [X] `tomtom`: similar to `morse`, using slashes and backslashes
295314

296-
#### Others
315+
#### [Others](https://python-codext.readthedocs.io/en/latest/enc/others.html)
297316

298317
- [X] `dna`: implements the 8 rules of DNA sequences (N belongs to [1,8])
299318
- [X] `html`: implements entities according to [this reference](https://dev.w3.org/html5/html-author/charref)
300319
- [X] `letter-indices`: encodes consonants and/or vowels with their corresponding indices
301320
- [X] `markdown`: unidirectional encoding from Markdown to HTML
302321
- [X] `url`: aka URL encoding
303322

304-
#### Steganography
323+
#### [Steganography](https://python-codext.readthedocs.io/en/latest/enc/stegano.html)
305324

306325
- [X] `hexagram`: uses Base64 and encodes the result to a charset of [I Ching hexagrams](https://en.wikipedia.org/wiki/Hexagram_%28I_Ching%29) (as implemented [here](https://github.com/qntm/hexagram-encode))
307326
- [X] `klopf`: aka Klopf code ; Polybius square with trivial alphabetical distribution

codext/base/__init__.py

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,6 @@
22
from argparse import ArgumentParser, RawTextHelpFormatter
33
from types import MethodType
44

5-
from .ascii85 import *
65
from .base45 import *
76
from .base85 import *
87
from .base91 import *

codext/base/_base.py

Lines changed: 51 additions & 29 deletions
Original file line numberDiff line numberDiff line change
@@ -10,9 +10,13 @@
1010
from types import FunctionType, MethodType
1111

1212
from ..__common__ import *
13+
from ..__common__ import _set_exc
1314
from ..__info__ import __version__
1415

1516

17+
_set_exc("BaseError")
18+
_set_exc("BaseEncodeError")
19+
_set_exc("BaseDecodeError")
1620
"""
1721
Curve fitting:
1822
@@ -44,18 +48,7 @@
4448
[ 0.02827357 0.00510124 -0.99999984 0.01536941]
4549
"""
4650
EXPANSION_FACTOR = lambda base: 0.02827357 / (base**0.00510124-0.99999984) + 0.01536941
47-
48-
49-
class BaseError(ValueError):
50-
pass
51-
52-
53-
class BaseDecodeError(BaseError):
54-
pass
55-
56-
57-
class BaseEncodeError(BaseError):
58-
pass
51+
SIZE_LIMIT = 1024 * 1024 * 1024
5952

6053

6154
def _generate_charset(n):
@@ -95,22 +88,27 @@ def _get_charset(charset, p=""):
9588
except KeyError:
9689
pass
9790
# or handle [p]arameter as a pattern
98-
default, n = None, None
91+
default, n, best = None, None, None
9992
for pattern, cset in charset.items():
10093
n = len(cset)
101-
if pattern == "":
94+
if re.match(pattern, ""):
10295
default = cset
10396
continue
104-
if re.match(pattern, p):
105-
return cset
97+
m = re.match(pattern, p)
98+
if m: # find the longest match from the patterns
99+
s, e = m.span()
100+
if e - s > len(best or ""):
101+
best = pattern
102+
if best:
103+
return charset[best]
106104
# special case: the given [p]arameter can be the charset itself if it has the right length
107105
p = re.sub(r"^[-_]+", "", p)
108106
if len(p) == n:
109107
return p
110108
# or simply rely on key ''
111109
if default is not None:
112110
return default
113-
raise ValueError("Bad charset descriptor")
111+
raise ValueError("Bad charset descriptor ('%s')" % p)
114112

115113

116114
# generic base en/decoding functions
@@ -123,6 +121,12 @@ def base_encode(input, charset, errors="strict", exc=BaseEncodeError):
123121
:param exc: exception to be raised in case of error
124122
"""
125123
i, n, r = input if isinstance(input, integer_types) else s2i(input), len(charset), ""
124+
if n == 1:
125+
if i > SIZE_LIMIT:
126+
raise InputSizeLimitError("Input exceeded size limit")
127+
return i * charset[0]
128+
if n == 10:
129+
return str(i) if charset == digits else "".join(charset[int(x)] for x in str(i))
126130
while i > 0:
127131
i, c = divmod(i, n)
128132
r = charset[c] + r
@@ -138,11 +142,15 @@ def base_decode(input, charset, errors="strict", exc=BaseDecodeError):
138142
:param exc: exception to be raised in case of error
139143
"""
140144
i, n, dec = 0, len(charset), lambda n: base_encode(n, [chr(x) for x in range(256)], errors, exc)
145+
if n == 1:
146+
return i2s(len(input))
147+
if n == 10:
148+
return i2s(int(input)) if charset == digits else "".join(str(charset.index(c)) for c in input)
141149
for k, c in enumerate(input):
142150
try:
143151
i = i * n + charset.index(c)
144152
except ValueError:
145-
handle_error("base", errors, exc, decode=True)(c, k, dec(i))
153+
handle_error("base", errors, exc, decode=True)(c, k, dec(i), "base%d" % n)
146154
return dec(i)
147155

148156

@@ -162,15 +170,19 @@ def base(charset, pattern, pow2=False, encode_template=base_encode, decode_templ
162170
raise BaseError("Bad charset ; {} is not a power of 2".format(n))
163171

164172
def encode(param="", *args):
165-
a = _get_charset(charset, param)
173+
a = _get_charset(charset, args[0] if len(args) > 0 and args[0] else param)
166174
def _encode(input, errors="strict"):
175+
if len(input) == 0:
176+
return "", 0
167177
return encode_template(input, a, errors), len(input)
168178
return _encode
169179

170180
def decode(param="", *args):
171-
a = _get_charset(charset, param)
181+
a = _get_charset(charset, args[0] if len(args) > 0 and args[0] else param)
172182
sl, sc = "\n" not in a, "\n" not in a and not "\r" in a
173183
def _decode(input, errors="strict"):
184+
if len(input) == 0:
185+
return "", 0
174186
input = _stripl(input, sc, sl)
175187
return decode_template(input, a, errors), len(input)
176188
return _decode
@@ -205,10 +217,14 @@ def _decode(input, errors="strict"):
205217
expansion_factor=lambda f, n: (EXPANSION_FACTOR(int(n.split("-")[0][4:])), .05))
206218

207219

208-
def main(n, ref=None, alt=None, inv=True):
220+
def main(n, ref=None, alt=None, inv=True, swap=True):
209221
base = str(n) + ("-" + alt.lstrip("-") if alt else "")
210222
src = "The data are encoded as described for the base%(base)s alphabet in %(reference)s.\n" % \
211223
{'base': base, 'reference': "\n" + ref if len(ref) > 20 else ref} if ref else ""
224+
text = "%(source)sWhen decoding, the input may contain newlines in addition to the bytes of the formal base" \
225+
"%(base)s alphabet. Use --ignore-garbage to attempt to recover from any other non-alphabet bytes in the" \
226+
" encoded stream." % {'base': base, 'source': src}
227+
text = "\n".join(x for x in wrap(text, 74))
212228
descr = """Usage: base%(base)s [OPTION]... [FILE]
213229
Base%(base)s encode or decode FILE, or standard input, to standard output.
214230
@@ -217,20 +233,19 @@ def main(n, ref=None, alt=None, inv=True):
217233
Mandatory arguments to long options are mandatory for short options too.
218234
-d, --decode decode data
219235
-i, --ignore-garbage when decoding, ignore non-alphabet characters
220-
%(inv)s -w, --wrap=COLS wrap encoded lines after COLS character (default 76).
236+
%(inv)s%(swap)s -w, --wrap=COLS wrap encoded lines after COLS character (default 76).
221237
Use 0 to disable line wrapping
222238
223239
--help display this help and exit
224240
--version output version information and exit
225241
226-
%(source)sWhen decoding, the input may contain newlines in addition to the bytes of
227-
the formal base%(base)s alphabet. Use --ignore-garbage to attempt to recover
228-
from any other non-alphabet bytes in the encoded stream.
242+
%(text)s
229243
230244
Report base%(base)s translation bugs to <https://github.com/dhondta/python-codext/issues/new>
231245
Full documentation at: <https://python-codext.readthedocs.io/en/latest/enc/base.html>
232-
""" % {'base': base, 'source': src,
233-
'inv': ["", " -I, --invert invert charsets from the base alphabet (e.g. lower- and uppercase)\n"][inv]}
246+
""" % {'base': base, 'text': text,
247+
'inv': ["", " -I, --invert invert charsets from the base alphabet (e.g. digits and letters)\n"][inv],
248+
'swap': ["", " -s, --swapcase swap the case\n"][swap]}
234249

235250
def _main():
236251
p = ArgumentParser(description=descr, formatter_class=RawTextHelpFormatter, add_help=False)
@@ -240,6 +255,8 @@ def _main():
240255
p.add_argument("-i", "--ignore-garbage", action="store_true")
241256
if inv:
242257
p.add_argument("-I", "--invert", action="store_true")
258+
if swap:
259+
p.add_argument("-s", "--swapcase", action="store_true")
243260
p.add_argument("-w", "--wrap", type=int, default=76)
244261
p.add_argument("--help", action="help")
245262
p.add_argument("--version", action="version")
@@ -249,14 +266,19 @@ def _main():
249266
args.wrap = 0
250267
args.invert = getattr(args, "invert", False)
251268
c, f = _input(args.file), [encode, decode][args.decode]
252-
c = c.rstrip("\r\n") if isinstance(c, str) else c.rstrip(b"\r\n")
269+
if swap and args.decode:
270+
c = codecs.decode(c, "swapcase")
271+
c = b(c).rstrip(b"\r\n")
253272
try:
254273
c = f(c, "base" + base + ["", "-inv"][getattr(args, "invert", False)],
255274
["strict", "ignore"][args.ignore_garbage])
256275
except Exception as err:
257276
print("%sbase%s: invalid input" % (getattr(err, "output", ""), base))
258277
return 1
259-
for l in (wrap(ensure_str(c), args.wrap) if args.wrap > 0 else [ensure_str(c)]):
278+
c = ensure_str(c)
279+
if swap and not args.decode:
280+
c = codecs.encode(c, "swapcase")
281+
for l in (wrap(c, args.wrap) if args.wrap > 0 else [c]):
260282
print(l)
261283
return 0
262284
return _main

codext/base/_base2n.py

Lines changed: 9 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -5,23 +5,16 @@
55
from math import ceil, log
66

77
from ..__common__ import *
8-
from ._base import base, _get_charset, BaseError
8+
from ..__common__ import _set_exc
9+
from ._base import base, _get_charset
910

1011

1112
_bin = lambda x: bin(x if isinstance(x, int) else ord(x))
1213

1314

1415
# base en/decoding functions for N a power of 2
15-
class Base2NError(BaseError):
16-
pass
17-
18-
19-
class Base2NDecodeError(BaseError):
20-
pass
21-
22-
23-
class Base2NEncodeError(BaseError):
24-
pass
16+
_set_exc("Base2NDecodeError")
17+
_set_exc("Base2NEncodeError")
2518

2619

2720
def base2n(charset, pattern=None, name=None, **kwargs):
@@ -35,13 +28,12 @@ def base2n(charset, pattern=None, name=None, **kwargs):
3528
base(charset, pattern, True, base2n_encode, base2n_decode, name, **kwargs)
3629

3730

38-
def base2n_encode(string, charset, errors="strict", exc=Base2NEncodeError):
31+
def base2n_encode(string, charset, errors="strict"):
3932
""" 8-bits characters to base-N encoding for N a power of 2.
4033
4134
:param string: string to be decoded
4235
:param charset: base-N characters set
4336
:param errors: errors handling marker
44-
:param exc: exception to be raised in case of error
4537
"""
4638
bs, r, n = "", "", len(charset)
4739
# find the number of bits for the given character set and the quantum
@@ -66,13 +58,12 @@ def base2n_encode(string, charset, errors="strict", exc=Base2NEncodeError):
6658
return r + int(l / nb_out - len(r)) * "="
6759

6860

69-
def base2n_decode(string, charset, errors="strict", exc=Base2NDecodeError):
61+
def base2n_decode(string, charset, errors="strict"):
7062
""" Base-N to 8-bits characters decoding for N a power of 2.
7163
7264
:param string: string to be decoded
7365
:param charset: base-N characters set
7466
:param errors: errors handling marker
75-
:param exc: exception to be raised in case of error
7667
"""
7768
bs, r, n = "", "", len(charset)
7869
# particular case: for hex, ensure the right case in the charset ; not that this way, if mixed cases are used, it
@@ -95,7 +86,9 @@ def base2n_decode(string, charset, errors="strict", exc=Base2NDecodeError):
9586
bs += ("{:0>%d}" % nb_in).format(_bin(charset.index(c))[2:])
9687
except ValueError:
9788
if errors == "strict":
98-
raise exc("'base' codec can't decode character '{}' in position {}".format(c, i))
89+
e = Base2NDecodeError("'base%d' codec can't decode character '%s' in position %d" % (n, c, i))
90+
e.__cause__ = e # block exceptions chaining
91+
raise e
9992
elif errors == "replace":
10093
bs += "0" * nb_in
10194
elif errors == "ignore":

0 commit comments

Comments
 (0)