Skip to content

BUG: can't concatenate DataFrame with Series with duplicate keys #33805

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
May 1, 2020
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions doc/source/whatsnew/v1.1.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -665,6 +665,7 @@ Reshaping
- Bug in :meth:`concat` where when passing a non-dict mapping as ``objs`` would raise a ``TypeError`` (:issue:`32863`)
- :meth:`DataFrame.agg` now provides more descriptive ``SpecificationError`` message when attempting to aggregating non-existant column (:issue:`32755`)
- Bug in :meth:`DataFrame.unstack` when MultiIndexed columns and MultiIndexed rows were used (:issue:`32624`, :issue:`24729` and :issue:`28306`)
- Bug in :func:`concat` was not allowing for concatenation of ``DataFrame`` and ``Series`` with duplicate keys (:issue:`33654`)


Sparse
Expand Down
8 changes: 4 additions & 4 deletions pandas/core/reshape/concat.py
Original file line number Diff line number Diff line change
Expand Up @@ -619,10 +619,10 @@ def _make_concat_multiindex(indexes, keys, levels=None, names=None) -> MultiInde
for hlevel, level in zip(zipped, levels):
to_concat = []
for key, index in zip(hlevel, indexes):
try:
i = level.get_loc(key)
except KeyError as err:
raise ValueError(f"Key {key} not in level {level}") from err
mask = level == key
if not any(mask):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use mask.any()

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it’s a bit more idiomatic to use .get_indexer here which handles duplicates - if u can make that work

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback thanks for the review. I've looked into that, but it seems it errors with duplicates:

>>> pd.Index(['a', 'b']).get_indexer(['a'])                                 
array([0])
>>> pd.Index(['a', 'b', 'b']).get_indexer(['a'])
---------------------------------------------------------------------------
InvalidIndexError                         Traceback (most recent call last)
<ipython-input-10-0a1bc1d27a1b> in <module>
----> 1 pd.Index(['a', 'b', 'b']).get_indexer(['a'])

~/pandas-dev/pandas/core/indexes/base.py in get_indexer(self, target, method, limit, tolerance)
   2926 
   2927         if not self.is_unique:
-> 2928             raise InvalidIndexError(
   2929                 "Reindexing only valid with uniquely valued Index objects"
   2930             )

InvalidIndexError: Reindexing only valid with uniquely valued Index objects

Should I open an issue for get_indexer first and make sure that deals with duplicates before coming back to this one?

raise ValueError(f"Key {key} not in level {level}")
i = np.nonzero(level == key)[0][0]

to_concat.append(np.repeat(i, len(index)))
codes_list.append(np.concatenate(to_concat))
Expand Down
15 changes: 15 additions & 0 deletions pandas/tests/reshape/test_concat.py
Original file line number Diff line number Diff line change
Expand Up @@ -2802,3 +2802,18 @@ def test_concat_multiindex_datetime_object_index():
)
result = concat([s, s2], axis=1)
tm.assert_frame_equal(result, expected)


@pytest.mark.parametrize("keys", [["e", "f", "f"], ["f", "e", "f"]])
def test_duplicate_keys(keys):
# GH 33654
df = DataFrame({"a": [1, 2, 3], "b": [4, 5, 6]})
s1 = Series([7, 8, 9], name="c")
s2 = Series([10, 11, 12], name="d")
result = concat([df, s1, s2], axis=1, keys=keys)
expected_values = [[1, 4, 7, 10], [2, 5, 8, 11], [3, 6, 9, 12]]
expected_columns = pd.MultiIndex.from_tuples(
[(keys[0], "a"), (keys[0], "b"), (keys[1], "c"), (keys[2], "d")]
)
expected = DataFrame(expected_values, columns=expected_columns)
tm.assert_frame_equal(result, expected)