Skip to content

Commit 1a19ab7

Browse files
committed
Allow unicode formulas on Py2 in limited circumstances.
Apparently our disallowing unicode formula strings on Py2 was being rather annoying for people using `from __future__ import unicode_literals`. Start allowing them in limited circumstances. Fixes pydatagh-53.
1 parent 870d680 commit 1a19ab7

File tree

4 files changed

+53
-4
lines changed

4 files changed

+53
-4
lines changed

doc/changes.rst

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -6,6 +6,14 @@ Changes
66
v0.4.1
77
------
88

9+
New features:
10+
11+
* On Python 2, accept ``unicode`` strings containing only ASCII
12+
characters as valid formula descriptions in
13+
the high-level formula API (:func:`dmatrix` and friends). This is
14+
intended as a convenience for people using Python 2 with ``from
15+
__future__ import unicode_literals``. (See :ref:`py2-versus-py3`.)
16+
917
Bug fixes:
1018

1119
* Accept ``long`` as a valid integer type in the new

doc/py2-versus-py3.rst

Lines changed: 15 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,5 @@
1+
.. _py2-versus-py3:
2+
13
Python 2 versus Python 3
24
========================
35

@@ -6,11 +8,11 @@ Python 2 versus Python 3
68
The biggest difference between Python 2 and Python 3 is in their
79
string handling, and this is particularly relevant to Patsy since
810
it parses user input. We follow a simple rule: input to Patsy
9-
should always be of type `str`. That means that on Python 2, you
11+
should always be of type ``str``. That means that on Python 2, you
1012
should pass byte-strings (not unicode), and on Python 3, you should
1113
pass unicode strings (not byte-strings). Similarly, when Patsy
1214
passes text back (e.g. :attr:`DesignInfo.column_names`), it's always
13-
in the form of a `str`.
15+
in the form of a ``str``.
1416

1517
In addition to this being the most convenient for users (you never
1618
need to use any b"weird" u"prefixes" when writing a formula string),
@@ -20,3 +22,14 @@ byte-strings, and that's the only form of input accepted by the
2022
:mod:`tokenize` module. On the other hand, Python 3's tokenizer and
2123
parser use unicode, and since Patsy processes Python code, it has
2224
to follow suit.
25+
26+
There is one exception to this rule: on Python 2, as a convenience for
27+
those using ``from __future__ import unicode_literals``, the
28+
high-level API functions :func:`dmatrix`, :func:`dmatrices`,
29+
:func:`incr_dbuilders`, and :func:`incr_dbuilder` do accept
30+
``unicode`` strings -- BUT these unicode string objects are still
31+
required to contain only ASCII characters; if they contain any
32+
non-ASCII characters then an error will be raised. If you really need
33+
non-ASCII in your formulas, then you should consider upgrading to
34+
Python 3. Low-level APIs like :meth:`ModelDesc.from_formula` continue
35+
to insist on ``str`` objects only.

patsy/highlevel.py

Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -13,6 +13,7 @@
1313
# ModelDesign doesn't work -- need to work with the builder set
1414
# want to be able to return either a matrix or a pandas dataframe
1515

16+
import six
1617
import numpy as np
1718
from patsy import PatsyError
1819
from patsy.design_info import DesignMatrix, DesignInfo
@@ -45,6 +46,18 @@ def _try_incr_builders(formula_like, data_iter_maker, eval_env,
4546
raise PatsyError("bad value from %r.__patsy_get_model_desc__"
4647
% (formula_like,))
4748
# fallthrough
49+
if not six.PY3 and isinstance(formula_like, unicode):
50+
# Included for the convenience of people who are using py2 with
51+
# __future__.unicode_literals.
52+
try:
53+
formula_like = formula_like.encode("ascii")
54+
except UnicodeEncodeError:
55+
raise PatsyError(
56+
"On Python 2, formula strings must be either 'str' objects, "
57+
"or else 'unicode' objects containing only ascii "
58+
"characters. You passed a unicode string with non-ascii "
59+
"characters. I'm afraid you'll have to either switch to "
60+
"ascii-only, or else upgrade to Python 3.")
4861
if isinstance(formula_like, str):
4962
formula_like = ModelDesc.from_formula(formula_like)
5063
# fallthrough

patsy/test_highlevel.py

Lines changed: 17 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -6,6 +6,7 @@
66

77
import sys
88
import __future__
9+
import six
910
import numpy as np
1011
from nose.tools import assert_raises
1112
from patsy import PatsyError
@@ -74,7 +75,7 @@ def t(formula_like, data, depth,
7475
depth += 1
7576
def data_iter_maker():
7677
return iter([data])
77-
if (isinstance(formula_like, (str, ModelDesc, DesignInfo))
78+
if (isinstance(formula_like, six.string_types + (ModelDesc, DesignInfo))
7879
or (isinstance(formula_like, tuple)
7980
and isinstance(formula_like[0], DesignInfo))
8081
or hasattr(formula_like, "__patsy_get_model_desc__")):
@@ -258,7 +259,21 @@ def __patsy_get_model_desc__(self, data):
258259
t("x + y", {"y": [1, 2], "x": [3, 4]}, 0,
259260
True,
260261
[[1, 3, 1], [1, 4, 2]], ["Intercept", "x", "y"])
261-
262+
263+
# unicode objects on py2 (must be ascii only)
264+
if not six.PY3:
265+
# ascii is fine
266+
t(unicode("y ~ x"),
267+
{"y": [1, 2], "x": [3, 4]}, 0,
268+
True,
269+
[[1, 3], [1, 4]], ["Intercept", "x"],
270+
[[1], [2]], ["y"])
271+
# non-ascii is not (even if this would be valid on py3 with its less
272+
# restrict variable naming rules)
273+
eacute = "\xc3\xa9".decode("utf-8")
274+
assert isinstance(eacute, unicode)
275+
assert_raises(PatsyError, dmatrix, eacute, data={eacute: [1, 2]})
276+
262277
# ModelDesc
263278
desc = ModelDesc([], [Term([LookupFactor("x")])])
264279
t(desc, {"x": [1.5, 2.5, 3.5]}, 0,

0 commit comments

Comments
 (0)