-
-
Notifications
You must be signed in to change notification settings - Fork 18.6k
BUG: unicode characters when reading JSON lines #15149
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: unicode characters when reading JSON lines #15149
Conversation
@@ -959,6 +959,25 @@ def test_read_jsonl(self): | |||
expected = DataFrame([[1, 2], [1, 2]], columns=['a', 'b']) | |||
assert_frame_equal(result, expected) | |||
|
|||
# GH15132: ascii input of non-ascii unicode characters for Python 2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
make this a separatly named test
expected = DataFrame([[u"foo\u201d", "bar"], ["foo", "bar"]], | ||
columns=['a', 'b']) | ||
assert_frame_equal(result, expected) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add the same exact test for PY3
assert_frame_equal(result, expected) | ||
|
||
# simulate string | ||
json = ('{"a": "foo\xe2\x80\x9d", "b": "bar"}\n' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should work in PY3 as well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the feedback, @jreback.
To make this work for PY3, is it OK to include # -*- coding: utf-8 -*-
at the top of the pandas/io/tests/json/test_pandas.py
file? If so, I can simply use the unicode character directly in the json
string.
# -*- coding: utf-8 -*-
...
json = '{"a": "foo”", "b": "bar"}\n{"a": "foo", "b": "bar"}\n' # note the ” character
If not, I would suggest using a bytes string that is utf-8 decoded if PY3:
json = b'{"a": "foo\xe2\x80\x9d", "b": "bar"}\n{"a": "foo", "b": "bar"}\n'
if compat.PY3:
# PY3 will not interpret the ascii-equivalent \xe2\x80\x9d correctly without decoding.
json = json.decode('utf-8')
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yep, you can include the coding (we do it in some files already)
Fixes UnicodeDecodeError bug when reading JSON lines input with Ascii decoder, which is often the default setting in Python 2.7. Avoids issues with mixing unicode and ascii strings. xref pandas-dev#15132
fd5184f
to
e117889
Compare
Current coverage is 85.54% (diff: 100%)@@ master #15149 diff @@
==========================================
Files 145 145
Lines 51404 51404
Methods 0 0
Messages 0 0
Branches 0 0
==========================================
+ Hits 43975 43976 +1
+ Misses 7429 7428 -1
Partials 0 0
|
lgtm. ping on green. |
@jreback: all green here. |
thanks! |
Fixes UnicodeDecodeError bug when reading JSON lines input with Ascii decoder, which is often the default setting in Python 2.7. Avoids issues with mixing unicode and ascii strings. closes pandas-dev#15132 Author: Rouz Azari <[email protected]> Closes pandas-dev#15149 from rouzazari/GH_15132_json_lines_with_unicode_chars_py2 and squashes the following commits: e117889 [Rouz Azari] BUG: unicode characters when reading JSON lines
Fixes UnicodeDecodeError bug when reading JSON lines input with Ascii decoder, which is often the default setting in Python 2.7. Avoids issues with mixing unicode and ascii strings.
xref #15132
git diff upstream/master | flake8 --diff