Skip to content

BUG: unicode characters when reading JSON lines #15149

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

rouzazari
Copy link
Contributor

Fixes UnicodeDecodeError bug when reading JSON lines input with Ascii decoder, which is often the default setting in Python 2.7. Avoids issues with mixing unicode and ascii strings.

xref #15132

@jorisvandenbossche jorisvandenbossche added Bug IO JSON read_json, to_json, json_normalize labels Jan 18, 2017
@@ -959,6 +959,25 @@ def test_read_jsonl(self):
expected = DataFrame([[1, 2], [1, 2]], columns=['a', 'b'])
assert_frame_equal(result, expected)

# GH15132: ascii input of non-ascii unicode characters for Python 2
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make this a separatly named test

expected = DataFrame([[u"foo\u201d", "bar"], ["foo", "bar"]],
columns=['a', 'b'])
assert_frame_equal(result, expected)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add the same exact test for PY3

assert_frame_equal(result, expected)

# simulate string
json = ('{"a": "foo\xe2\x80\x9d", "b": "bar"}\n'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should work in PY3 as well

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the feedback, @jreback.

To make this work for PY3, is it OK to include # -*- coding: utf-8 -*- at the top of the pandas/io/tests/json/test_pandas.py file? If so, I can simply use the unicode character directly in the json string.

# -*- coding: utf-8 -*-
...
json = '{"a": "foo”", "b": "bar"}\n{"a": "foo", "b": "bar"}\n'  # note the ” character

If not, I would suggest using a bytes string that is utf-8 decoded if PY3:

json = b'{"a": "foo\xe2\x80\x9d", "b": "bar"}\n{"a": "foo", "b": "bar"}\n'
if compat.PY3:
     # PY3 will not interpret the ascii-equivalent \xe2\x80\x9d correctly without decoding.
    json = json.decode('utf-8')

Copy link
Contributor

@jreback jreback Jan 18, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep, you can include the coding (we do it in some files already)

Fixes UnicodeDecodeError bug when reading JSON lines input with Ascii decoder, which is often the default setting in Python 2.7. Avoids issues with mixing unicode and ascii strings.

xref pandas-dev#15132
@rouzazari rouzazari force-pushed the GH_15132_json_lines_with_unicode_chars_py2 branch from fd5184f to e117889 Compare January 18, 2017 19:42
@codecov-io
Copy link

codecov-io commented Jan 18, 2017

Current coverage is 85.54% (diff: 100%)

Merging #15149 into master will increase coverage by <.01%

@@             master     #15149   diff @@
==========================================
  Files           145        145          
  Lines         51404      51404          
  Methods           0          0          
  Messages          0          0          
  Branches          0          0          
==========================================
+ Hits          43975      43976     +1   
+ Misses         7429       7428     -1   
  Partials          0          0          

Powered by Codecov. Last update 8e13da2...e117889

@jreback jreback added this to the 0.20.0 milestone Jan 18, 2017
@jreback
Copy link
Contributor

jreback commented Jan 18, 2017

lgtm. ping on green.

@rouzazari
Copy link
Contributor Author

@jreback: all green here.

@jreback
Copy link
Contributor

jreback commented Jan 19, 2017

thanks!

@jreback jreback closed this in 7d5c354 Jan 19, 2017
AnkurDedania pushed a commit to AnkurDedania/pandas that referenced this pull request Mar 21, 2017
Fixes UnicodeDecodeError bug when reading JSON lines input with Ascii
decoder, which is often the default setting in Python 2.7. Avoids
issues with mixing unicode and ascii strings.

closes pandas-dev#15132

Author: Rouz Azari <[email protected]>

Closes pandas-dev#15149 from rouzazari/GH_15132_json_lines_with_unicode_chars_py2 and squashes the following commits:

e117889 [Rouz Azari] BUG: unicode characters when reading JSON lines
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug IO JSON read_json, to_json, json_normalize
Projects
None yet
Development

Successfully merging this pull request may close these issues.

pd.read_json(file, lines=True) does not work if json has quotes inside it
4 participants