Skip to content

Commit 2d92041

Browse files
committed
This patch changes the way the string .encode() method works slightly
and introduces a new method .decode(). The major change is that strg.encode() will no longer try to convert Unicode returns from the codec into a string, but instead pass along the Unicode object as-is. The same is now true for all other codec return types. The underlying C APIs were changed accordingly. Note that even though this does have the potential of breaking existing code, the chances are low since conversion from Unicode previously took place using the default encoding which is normally set to ASCII rendering this auto-conversion mechanism useless for most Unicode encodings. The good news is that you can now use .encode() and .decode() with much greater ease and that the door was opened for better accessibility of the builtin codecs. As demonstration of the new feature, the patch includes a few new codecs which allow string to string encoding and decoding (rot13, hex, zip, uu, base64). Written by Marc-Andre Lemburg. Copyright assigned to the PSF.
1 parent 2e0a654 commit 2d92041

File tree

11 files changed

+585
-30
lines changed

11 files changed

+585
-30
lines changed

Doc/api/api.tex

Lines changed: 21 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -2326,30 +2326,44 @@ \subsection{String Objects \label{stringObjects}}
23262326
int size,
23272327
const char *encoding,
23282328
const char *errors}
2329-
Create a string object by decoding \var{size} bytes of the encoded
2330-
buffer \var{s}. \var{encoding} and \var{errors} have the same meaning
2329+
Creates an object by decoding \var{size} bytes of the encoded
2330+
buffer \var{s} using the codec registered
2331+
for \var{encoding}. \var{encoding} and \var{errors} have the same meaning
23312332
as the parameters of the same name in the unicode() builtin
23322333
function. The codec to be used is looked up using the Python codec
23332334
registry. Returns \NULL{} in case an exception was raised by the
23342335
codec.
23352336
\end{cfuncdesc}
23362337

2337-
\begin{cfuncdesc}{PyObject*}{PyString_Encode}{const Py_UNICODE *s,
2338+
\begin{cfuncdesc}{PyObject*}{PyString_AsDecodedObject}{PyObject *str,
2339+
const char *encoding,
2340+
const char *errors}
2341+
Decodes a string object by passing it to the codec registered
2342+
for \var{encoding} and returns the result as Python
2343+
object. \var{encoding} and \var{errors} have the same meaning as the
2344+
parameters of the same name in the string .encode() method. The codec
2345+
to be used is looked up using the Python codec registry. Returns
2346+
\NULL{} in case an exception was raised by the codec.
2347+
\end{cfuncdesc}
2348+
2349+
\begin{cfuncdesc}{PyObject*}{PyString_Encode}{const char *s,
23382350
int size,
23392351
const char *encoding,
23402352
const char *errors}
2341-
Encodes the \ctype{Py_UNICODE} buffer of the given size and returns a
2342-
Python string object. \var{encoding} and \var{errors} have the same
2353+
Encodes the \ctype{char} buffer of the given size by passing it to
2354+
the codec registered for \var{encoding} and returns a Python object.
2355+
\var{encoding} and \var{errors} have the same
23432356
meaning as the parameters of the same name in the string .encode()
23442357
method. The codec to be used is looked up using the Python codec
23452358
registry. Returns \NULL{} in case an exception was raised by the
23462359
codec.
23472360
\end{cfuncdesc}
23482361

2349-
\begin{cfuncdesc}{PyObject*}{PyString_AsEncodedString}{PyObject *unicode,
2362+
\begin{cfuncdesc}{PyObject*}{PyString_AsEncodedObject}{PyObject *str,
23502363
const char *encoding,
23512364
const char *errors}
2352-
Encodes a string object and returns the result as Python string
2365+
Encodes a string object using the codec registered
2366+
for \var{encoding} and returns the result as Python
23532367
object. \var{encoding} and \var{errors} have the same meaning as the
23542368
parameters of the same name in the string .encode() method. The codec
23552369
to be used is looked up using the Python codec registry. Returns

Include/stringobject.h

Lines changed: 40 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -78,7 +78,7 @@ extern DL_IMPORT(void) _Py_ReleaseInternedStrings(void);
7878

7979
/* --- Generic Codecs ----------------------------------------------------- */
8080

81-
/* Create a string object by decoding the encoded string s of the
81+
/* Create an object by decoding the encoded string s of the
8282
given size. */
8383

8484
extern DL_IMPORT(PyObject*) PyString_Decode(
@@ -89,7 +89,7 @@ extern DL_IMPORT(PyObject*) PyString_Decode(
8989
);
9090

9191
/* Encodes a char buffer of the given size and returns a
92-
Python string object. */
92+
Python object. */
9393

9494
extern DL_IMPORT(PyObject*) PyString_Encode(
9595
const char *s, /* string char buffer */
@@ -98,15 +98,52 @@ extern DL_IMPORT(PyObject*) PyString_Encode(
9898
const char *errors /* error handling */
9999
);
100100

101-
/* Encodes a string object and returns the result as Python string
101+
/* Encodes a string object and returns the result as Python
102102
object. */
103103

104+
extern DL_IMPORT(PyObject*) PyString_AsEncodedObject(
105+
PyObject *str, /* string object */
106+
const char *encoding, /* encoding */
107+
const char *errors /* error handling */
108+
);
109+
110+
/* Encodes a string object and returns the result as Python string
111+
object.
112+
113+
If the codec returns an Unicode object, the object is converted
114+
back to a string using the default encoding.
115+
116+
DEPRECATED - use PyString_AsEncodedObject() instead. */
117+
104118
extern DL_IMPORT(PyObject*) PyString_AsEncodedString(
105119
PyObject *str, /* string object */
106120
const char *encoding, /* encoding */
107121
const char *errors /* error handling */
108122
);
109123

124+
/* Decodes a string object and returns the result as Python
125+
object. */
126+
127+
extern DL_IMPORT(PyObject*) PyString_AsDecodedObject(
128+
PyObject *str, /* string object */
129+
const char *encoding, /* encoding */
130+
const char *errors /* error handling */
131+
);
132+
133+
/* Decodes a string object and returns the result as Python string
134+
object.
135+
136+
If the codec returns an Unicode object, the object is converted
137+
back to a string using the default encoding.
138+
139+
DEPRECATED - use PyString_AsDecodedObject() instead. */
140+
141+
extern DL_IMPORT(PyObject*) PyString_AsDecodedString(
142+
PyObject *str, /* string object */
143+
const char *encoding, /* encoding */
144+
const char *errors /* error handling */
145+
);
146+
110147
/* Provides access to the internal data buffer and size of a string
111148
object or the default encoded version of an Unicode object. Passing
112149
NULL as *len parameter will force the string buffer to be

Lib/UserString.py

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -72,6 +72,14 @@ def capitalize(self): return self.__class__(self.data.capitalize())
7272
def center(self, width): return self.__class__(self.data.center(width))
7373
def count(self, sub, start=0, end=sys.maxint):
7474
return self.data.count(sub, start, end)
75+
def decode(self, encoding=None, errors=None): # XXX improve this?
76+
if encoding:
77+
if errors:
78+
return self.__class__(self.data.decode(encoding, errors))
79+
else:
80+
return self.__class__(self.data.decode(encoding))
81+
else:
82+
return self.__class__(self.data.decode())
7583
def encode(self, encoding=None, errors=None): # XXX improve this?
7684
if encoding:
7785
if errors:

Lib/encodings/aliases.py

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -79,4 +79,13 @@
7979
'tis260': 'tactis',
8080
'sjis': 'shift_jis',
8181

82+
# Content transfer/compression encodings
83+
'rot13': 'rot_13',
84+
'base64': 'base64_codec',
85+
'base_64': 'base64_codec',
86+
'zlib': 'zlib_codec',
87+
'zip': 'zlib_codec',
88+
'hex': 'hex_codec',
89+
'uu': 'uu_codec',
90+
8291
}

Lib/encodings/base64_codec.py

Lines changed: 60 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,60 @@
1+
""" Python 'base64_codec' Codec - base64 content transfer encoding
2+
3+
Unlike most of the other codecs which target Unicode, this codec
4+
will return Python string objects for both encode and decode.
5+
6+
Written by Marc-Andre Lemburg (mal@lemburg.com).
7+
8+
"""
9+
import codecs, base64
10+
11+
### Codec APIs
12+
13+
def base64_encode(input,errors='strict'):
14+
15+
""" Encodes the object input and returns a tuple (output
16+
object, length consumed).
17+
18+
errors defines the error handling to apply. It defaults to
19+
'strict' handling which is the only currently supported
20+
error handling for this codec.
21+
22+
"""
23+
assert errors == 'strict'
24+
output = base64.encodestring(input)
25+
return (output, len(input))
26+
27+
def base64_decode(input,errors='strict'):
28+
29+
""" Decodes the object input and returns a tuple (output
30+
object, length consumed).
31+
32+
input must be an object which provides the bf_getreadbuf
33+
buffer slot. Python strings, buffer objects and memory
34+
mapped files are examples of objects providing this slot.
35+
36+
errors defines the error handling to apply. It defaults to
37+
'strict' handling which is the only currently supported
38+
error handling for this codec.
39+
40+
"""
41+
assert errors == 'strict'
42+
output = base64.decodestring(input)
43+
return (output, len(input))
44+
45+
class Codec(codecs.Codec):
46+
47+
encode = base64_encode
48+
decode = base64_decode
49+
50+
class StreamWriter(Codec,codecs.StreamWriter):
51+
pass
52+
53+
class StreamReader(Codec,codecs.StreamReader):
54+
pass
55+
56+
### encodings module API
57+
58+
def getregentry():
59+
60+
return (base64_encode,base64_decode,StreamReader,StreamWriter)

Lib/encodings/hex_codec.py

Lines changed: 60 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,60 @@
1+
""" Python 'hex_codec' Codec - 2-digit hex content transfer encoding
2+
3+
Unlike most of the other codecs which target Unicode, this codec
4+
will return Python string objects for both encode and decode.
5+
6+
Written by Marc-Andre Lemburg (mal@lemburg.com).
7+
8+
"""
9+
import codecs, binascii
10+
11+
### Codec APIs
12+
13+
def hex_encode(input,errors='strict'):
14+
15+
""" Encodes the object input and returns a tuple (output
16+
object, length consumed).
17+
18+
errors defines the error handling to apply. It defaults to
19+
'strict' handling which is the only currently supported
20+
error handling for this codec.
21+
22+
"""
23+
assert errors == 'strict'
24+
output = binascii.b2a_hex(input)
25+
return (output, len(input))
26+
27+
def hex_decode(input,errors='strict'):
28+
29+
""" Decodes the object input and returns a tuple (output
30+
object, length consumed).
31+
32+
input must be an object which provides the bf_getreadbuf
33+
buffer slot. Python strings, buffer objects and memory
34+
mapped files are examples of objects providing this slot.
35+
36+
errors defines the error handling to apply. It defaults to
37+
'strict' handling which is the only currently supported
38+
error handling for this codec.
39+
40+
"""
41+
assert errors == 'strict'
42+
output = binascii.a2b_hex(input)
43+
return (output, len(input))
44+
45+
class Codec(codecs.Codec):
46+
47+
encode = hex_encode
48+
decode = hex_decode
49+
50+
class StreamWriter(Codec,codecs.StreamWriter):
51+
pass
52+
53+
class StreamReader(Codec,codecs.StreamReader):
54+
pass
55+
56+
### encodings module API
57+
58+
def getregentry():
59+
60+
return (hex_encode,hex_decode,StreamReader,StreamWriter)

Lib/encodings/rot_13.py

Lines changed: 107 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,107 @@
1+
#!/usr/local/bin/python2.1
2+
""" Python Character Mapping Codec for ROT13.
3+
4+
See http://ucsub.colorado.edu/~kominek/rot13/ for details.
5+
6+
Written by Marc-Andre Lemburg (mal@lemburg.com).
7+
8+
"""#"
9+
10+
import codecs
11+
12+
### Codec APIs
13+
14+
class Codec(codecs.Codec):
15+
16+
def encode(self,input,errors='strict'):
17+
18+
return codecs.charmap_encode(input,errors,encoding_map)
19+
20+
def decode(self,input,errors='strict'):
21+
22+
return codecs.charmap_decode(input,errors,decoding_map)
23+
24+
class StreamWriter(Codec,codecs.StreamWriter):
25+
pass
26+
27+
class StreamReader(Codec,codecs.StreamReader):
28+
pass
29+
30+
### encodings module API
31+
32+
def getregentry():
33+
34+
return (Codec().encode,Codec().decode,StreamReader,StreamWriter)
35+
36+
### Decoding Map
37+
38+
decoding_map = codecs.make_identity_dict(range(256))
39+
decoding_map.update({
40+
0x0041: 0x004e,
41+
0x0042: 0x004f,
42+
0x0043: 0x0050,
43+
0x0044: 0x0051,
44+
0x0045: 0x0052,
45+
0x0046: 0x0053,
46+
0x0047: 0x0054,
47+
0x0048: 0x0055,
48+
0x0049: 0x0056,
49+
0x004a: 0x0057,
50+
0x004b: 0x0058,
51+
0x004c: 0x0059,
52+
0x004d: 0x005a,
53+
0x004e: 0x0041,
54+
0x004f: 0x0042,
55+
0x0050: 0x0043,
56+
0x0051: 0x0044,
57+
0x0052: 0x0045,
58+
0x0053: 0x0046,
59+
0x0054: 0x0047,
60+
0x0055: 0x0048,
61+
0x0056: 0x0049,
62+
0x0057: 0x004a,
63+
0x0058: 0x004b,
64+
0x0059: 0x004c,
65+
0x005a: 0x004d,
66+
0x0061: 0x006e,
67+
0x0062: 0x006f,
68+
0x0063: 0x0070,
69+
0x0064: 0x0071,
70+
0x0065: 0x0072,
71+
0x0066: 0x0073,
72+
0x0067: 0x0074,
73+
0x0068: 0x0075,
74+
0x0069: 0x0076,
75+
0x006a: 0x0077,
76+
0x006b: 0x0078,
77+
0x006c: 0x0079,
78+
0x006d: 0x007a,
79+
0x006e: 0x0061,
80+
0x006f: 0x0062,
81+
0x0070: 0x0063,
82+
0x0071: 0x0064,
83+
0x0072: 0x0065,
84+
0x0073: 0x0066,
85+
0x0074: 0x0067,
86+
0x0075: 0x0068,
87+
0x0076: 0x0069,
88+
0x0077: 0x006a,
89+
0x0078: 0x006b,
90+
0x0079: 0x006c,
91+
0x007a: 0x006d,
92+
})
93+
94+
### Encoding Map
95+
96+
encoding_map = {}
97+
for k,v in decoding_map.items():
98+
encoding_map[v] = k
99+
100+
### Filter API
101+
102+
def rot13(infile, outfile):
103+
outfile.write(infile.read().encode('rot-13'))
104+
105+
if __name__ == '__main__':
106+
import sys
107+
rot13(sys.stdin, sys.stdout)

0 commit comments

Comments
 (0)