Skip to content

Commit fd5184f

Browse files
committed
BUG: unicode characters when reading JSON lines
Fixes UnicodeDecodeError bug when reading JSON lines input with Ascii decoder, which is often the default setting in Python 2.7. Avoids issues with mixing unicode and ascii strings. xref #15132
1 parent 362e78d commit fd5184f

File tree

3 files changed

+21
-1
lines changed

3 files changed

+21
-1
lines changed

doc/source/whatsnew/v0.20.0.txt

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -384,3 +384,4 @@ Bug Fixes
384384
- Bug in ``pd.read_csv()`` for the C engine where ``usecols`` were being indexed incorrectly with ``parse_dates`` (:issue:`14792`)
385385

386386
- Bug in ``Series.dt.round`` inconsistent behaviour on NAT's with different arguments (:issue:`14940`)
387+
- Bug in ``.read_json()`` for Python 2 where ``lines=True`` and contents contain non-ascii unicode characters (:issue:`15132`)

pandas/io/json.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -274,7 +274,7 @@ def read_json(path_or_buf=None, orient=None, typ='frame', dtype=True,
274274
# If given a json lines file, we break the string into lines, add
275275
# commas and put it in a json list to make a valid json object.
276276
lines = list(StringIO(json.strip()))
277-
json = u'[' + u','.join(lines) + u']'
277+
json = '[' + ','.join(lines) + ']'
278278

279279
obj = None
280280
if typ == 'frame':

pandas/io/tests/json/test_pandas.py

Lines changed: 19 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -959,6 +959,25 @@ def test_read_jsonl(self):
959959
expected = DataFrame([[1, 2], [1, 2]], columns=['a', 'b'])
960960
assert_frame_equal(result, expected)
961961

962+
# GH15132: ascii input of non-ascii unicode characters for Python 2
963+
# \xe2\x80\x9d == \u201d == RIGHT DOUBLE QUOTATION MARK
964+
if compat.PY2:
965+
# simulate file handle
966+
json = StringIO('{"a": "foo\xe2\x80\x9d", "b": "bar"}\n'
967+
'{"a": "foo", "b": "bar"}\n')
968+
result = read_json(json, lines=True)
969+
expected = DataFrame([[u"foo\u201d", "bar"], ["foo", "bar"]],
970+
columns=['a', 'b'])
971+
assert_frame_equal(result, expected)
972+
973+
# simulate string
974+
json = ('{"a": "foo\xe2\x80\x9d", "b": "bar"}\n'
975+
'{"a": "foo", "b": "bar"}\n')
976+
result = read_json(json, lines=True)
977+
expected = DataFrame([[u"foo\u201d", "bar"], ["foo", "bar"]],
978+
columns=['a', 'b'])
979+
assert_frame_equal(result, expected)
980+
962981
def test_to_jsonl(self):
963982
# GH9180
964983
df = DataFrame([[1, 2], [1, 2]], columns=['a', 'b'])

0 commit comments

Comments
 (0)