Skip to content

Commit 7d5c354

Browse files
rouzazarijreback
authored andcommitted
BUG: unicode characters when reading JSON lines
Fixes UnicodeDecodeError bug when reading JSON lines input with Ascii decoder, which is often the default setting in Python 2.7. Avoids issues with mixing unicode and ascii strings. closes #15132 Author: Rouz Azari <[email protected]> Closes #15149 from rouzazari/GH_15132_json_lines_with_unicode_chars_py2 and squashes the following commits: e117889 [Rouz Azari] BUG: unicode characters when reading JSON lines
1 parent 684c4d5 commit 7d5c354

File tree

3 files changed

+22
-1
lines changed

3 files changed

+22
-1
lines changed

doc/source/whatsnew/v0.20.0.txt

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -436,3 +436,4 @@ Bug Fixes
436436
- Bug in ``pd.read_csv()`` for the C engine where ``usecols`` were being indexed incorrectly with ``parse_dates`` (:issue:`14792`)
437437

438438
- Bug in ``Series.dt.round`` inconsistent behaviour on NAT's with different arguments (:issue:`14940`)
439+
- Bug in ``.read_json()`` for Python 2 where ``lines=True`` and contents contain non-ascii unicode characters (:issue:`15132`)

pandas/io/json.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -274,7 +274,7 @@ def read_json(path_or_buf=None, orient=None, typ='frame', dtype=True,
274274
# If given a json lines file, we break the string into lines, add
275275
# commas and put it in a json list to make a valid json object.
276276
lines = list(StringIO(json.strip()))
277-
json = u'[' + u','.join(lines) + u']'
277+
json = '[' + ','.join(lines) + ']'
278278

279279
obj = None
280280
if typ == 'frame':

pandas/io/tests/json/test_pandas.py

Lines changed: 20 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,4 @@
1+
# -*- coding: utf-8 -*-
12
# pylint: disable-msg=W0612,E1101
23
import nose
34
from pandas.compat import range, lrange, StringIO, OrderedDict
@@ -960,6 +961,25 @@ def test_read_jsonl(self):
960961
expected = DataFrame([[1, 2], [1, 2]], columns=['a', 'b'])
961962
assert_frame_equal(result, expected)
962963

964+
def test_read_jsonl_unicode_chars(self):
965+
# GH15132: non-ascii unicode characters
966+
# \u201d == RIGHT DOUBLE QUOTATION MARK
967+
968+
# simulate file handle
969+
json = '{"a": "foo”", "b": "bar"}\n{"a": "foo", "b": "bar"}\n'
970+
json = StringIO(json)
971+
result = read_json(json, lines=True)
972+
expected = DataFrame([[u"foo\u201d", "bar"], ["foo", "bar"]],
973+
columns=['a', 'b'])
974+
assert_frame_equal(result, expected)
975+
976+
# simulate string
977+
json = '{"a": "foo”", "b": "bar"}\n{"a": "foo", "b": "bar"}\n'
978+
result = read_json(json, lines=True)
979+
expected = DataFrame([[u"foo\u201d", "bar"], ["foo", "bar"]],
980+
columns=['a', 'b'])
981+
assert_frame_equal(result, expected)
982+
963983
def test_to_jsonl(self):
964984
# GH9180
965985
df = DataFrame([[1, 2], [1, 2]], columns=['a', 'b'])

0 commit comments

Comments
 (0)