Skip to content

BUG: fix nested meta path bug (GH 27220) #27667

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 13 commits into from
Closed
Show file tree
Hide file tree
Changes from 7 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion doc/source/whatsnew/v0.25.1.rst
Original file line number Diff line number Diff line change
Expand Up @@ -107,7 +107,7 @@ MultiIndex
I/O
^^^

-
- Fix bug in :meth:`io.json.json_normalize` when nested meta paths with a nested record path. (:issue:`27220`)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move to 0.25.2 at this point

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just FYI, I think we're being conservative with 0.25.2 backports. Just 3.8 compat and fixes for regressions in 0.25.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good - @yanglinlee let's go 1.0.0 instead then

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good. This is not very urgent issue. Thanks for the reviews~

-
-

Expand Down
26 changes: 22 additions & 4 deletions pandas/io/json/_normalize.py
Original file line number Diff line number Diff line change
Expand Up @@ -288,12 +288,13 @@ def _recursive_extract(data, path, seen_meta, level=0):
if len(path) > 1:
for obj in data:
for val, key in zip(meta, meta_keys):
if level + 1 == len(val):
seen_meta[key] = _pull_field(obj, val[-1])
# Pull value for all the keys in case meta path and
# record path are on two branches
seen_meta[key] = _pull_field(obj, val[0])

_recursive_extract(obj[path[0]], path[1:], seen_meta, level=level + 1)
else:
for obj in data:
for ind, obj in enumerate(data):
recs = _pull_field(obj, path[0])
recs = [
nested_to_record(r, sep=sep, max_level=max_level)
Expand All @@ -305,8 +306,24 @@ def _recursive_extract(data, path, seen_meta, level=0):
# For repeating the metadata later
lengths.append(len(recs))
for val, key in zip(meta, meta_keys):
# Extract the value of the key when the level
# is at the meta path end
if level + 1 > len(val):
meta_val = seen_meta[key]
meta_vals[key].append(meta_val)
# Extract the value of the key from seen_meta when
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you put a blank line fore these comments (and below and above), basically easier to read if they are paragraph like.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definitely. Will update this tomorrow.

# meta path and record path are on two branches
elif seen_meta:
meta_val = seen_meta[key]
meta_vals[key] += [
# The list case
meta_val[ind][val[level]]
if isinstance(meta_val, list)
# The dict case
else meta_val[val[level]]
]
# At top level, seen_meta is empty, pull from data
# directly and raise KeyError if not found
else:
try:
meta_val = _pull_field(obj, val[level:])
Expand All @@ -319,7 +336,8 @@ def _recursive_extract(data, path, seen_meta, level=0):
"errors='ignore' as key "
"{err} is not always present".format(err=e)
)
meta_vals[key].append(meta_val)
meta_vals[key].append(meta_val)

records.extend(recs)

_recursive_extract(data, record_path, {}, level=0)
Expand Down
25 changes: 25 additions & 0 deletions pandas/tests/io/json/test_normalize.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
import json
import sys

import numpy as np
import pytest
Expand Down Expand Up @@ -287,6 +288,30 @@ def test_shallow_nested(self):
expected = DataFrame(ex_data, columns=result.columns)
tm.assert_frame_equal(result, expected)

@pytest.mark.skipif(sys.version_info < (3, 6), reason="drop support for 3.5 soon")
def test_nested_meta_path_with_nested_record_path(self, state_data):
# GH 27220
result = json_normalize(
state_data,
["counties", "name"],
["state", "shortname", ["info", "governor"]],
errors="ignore",
)
ex_data = {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of using a dict can you construct this with literal values? If you use a list of lists can circumvent 3.5 ordering issues with a dict

If you can simplify expectation would help a lot as well so as not to write out the same values 21 times

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the suggestion. It is indeed better. I simply copied other test cases. I will change accordingly.

0: [
i
for word in ["Dade", "Broward", "Palm Beach", "Summit", "Cuyahoga"]
for i in word
],
"state": ["Florida"] * 21 + ["Ohio"] * 14,
"shortname": ["FL"] * 21 + ["OH"] * 14,
"info.governor": ["Rick Scott"] * 21 + ["John Kasich"] * 14,
}
expected = DataFrame(
ex_data, columns=[0, "state", "shortname", "info.governor"]
)
tm.assert_frame_equal(result, expected)

def test_meta_name_conflict(self):
data = [
{
Expand Down