Skip to content

fix hashing string-casting error #21187

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 10 commits into from
Jun 21, 2018
7 changes: 2 additions & 5 deletions pandas/_libs/hashing.pyx
Original file line number Diff line number Diff line change
Expand Up @@ -8,8 +8,7 @@ import numpy as np
from numpy cimport ndarray, uint8_t, uint32_t, uint64_t

from util cimport _checknull
from cpython cimport (PyString_Check,
PyBytes_Check,
from cpython cimport (PyBytes_Check,
PyUnicode_Check)
from libc.stdlib cimport malloc, free

Expand Down Expand Up @@ -62,9 +61,7 @@ def hash_object_array(ndarray[object] arr, object key, object encoding='utf8'):
cdef list datas = []
for i in range(n):
val = arr[i]
if PyString_Check(val):
data = <bytes>val.encode(encoding)
elif PyBytes_Check(val):
if PyBytes_Check(val):
data = <bytes>val
elif PyUnicode_Check(val):
data = <bytes>val.encode(encoding)
Expand Down
30 changes: 30 additions & 0 deletions pandas/tests/series/test_repr.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,7 @@

import sys

import pytest
import numpy as np
import pandas as pd

Expand Down Expand Up @@ -202,6 +203,35 @@ def test_latex_repr(self):

class TestCategoricalRepr(object):

@pytest.mark.skipif(compat.PY3, reason="Decoding failure only in PY2")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we need a test for py3 as well that uses utf8 as the encoding

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, couldn't hurt.

def test_categorical_repr_unicode(self):
# GH#21002 if len(index) > 60, sys.getdefaultencoding()=='ascii',
# and we are working in PY2, then rendering a Categorical could raise
# UnicodeDecodeError by trying to decode when it shouldn't
from pandas.core.base import StringMixin
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can import at the top


class County(StringMixin):
name = u'San Sebastián'
state = u'PR'

def __unicode__(self):
return self.name + u', ' + self.state

cat = pd.Categorical([County() for n in range(61)])
idx = pd.Index(cat)
ser = idx.to_series()

# set sys.defaultencoding to ascii, then change it back after the test
enc = sys.getdefaultencoding()
reload(sys) # noqa:F821
sys.setdefaultencoding('ascii')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there is a context manager this i think

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a context manager for locale in pd.util.testing. Can that be used here or do you have something else in mind? (I agree it would be prettier)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes pls use that

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like tm.set_locale doesn't change sys.getdefaultencoding(). I could make a new contextmanager specifically for this (which I guess would be a no-op in py3?)

try:
repr(ser)
str(ser)
finally:
# restore encoding
sys.setdefaultencoding(enc)

def test_categorical_repr(self):
a = Series(Categorical([1, 2, 3, 4]))
exp = u("0 1\n1 2\n2 3\n3 4\n" +
Expand Down