Skip to content

Make DataFrame arithmetic ops with 2D arrays behave like numpy analogues #23000

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 14 commits into from
Oct 7, 2018
Merged
Show file tree
Hide file tree
Changes from 5 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
25 changes: 22 additions & 3 deletions pandas/core/ops.py
Original file line number Diff line number Diff line change
Expand Up @@ -1799,14 +1799,33 @@ def to_series(right):
right = to_series(right)

elif right.ndim == 2:
if left.shape != right.shape:
if right.shape == left.shape:
right = left._constructor(right, index=left.index,
columns=left.columns)

elif right.shape[0] == left.shape[0] and right.shape[1] == 1:
# Broadcast across columns
try:
right = np.broadcast_to(right, left.shape)
except AttributeError:
# numpy < 1.10.0
right = np.tile(right, (1, left.shape[1]))

right = left._constructor(right,
index=left.index,
columns=left.columns)
# TODO: Double-check this doesn't make copies
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this relevant?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For performance, if the answer is that it does make copies, then yes. At least in the sufficiently-new numpy case, we're passing a view in left._constructor.

a = np.arange(3)
b = a.reshape(3, 1)
c = np.broadcast_to(b, (3, 2))
d = c.copy()

df = pd.DataFrame(c)
df2 = pd.DataFrame(d)

>>> df.values.base is a  # <-- the concern is that this comes back False
True

>>> df2.values.base is d
True

In this example its OK. I left the comment to do a more thorough check. Are you confident this is always OK?


elif right.shape[1] == left.shape[1] and right.shape[0] == 1:
# Broadcast along rows
right = to_series(right[0, :])

else:
raise ValueError("Unable to coerce to DataFrame, shape "
"must be {req_shape}: given {given_shape}"
.format(req_shape=left.shape,
given_shape=right.shape))

right = left._constructor(right, index=left.index,
columns=left.columns)
elif right.ndim > 2:
raise ValueError('Unable to coerce to Series/DataFrame, dim '
'must be <= 2: {dim}'.format(dim=right.shape))
Expand Down
38 changes: 38 additions & 0 deletions pandas/tests/frame/test_arithmetic.py
Original file line number Diff line number Diff line change
Expand Up @@ -99,6 +99,44 @@ def test_df_flex_cmp_constant_return_types_empty(self, opname):
# Arithmetic

class TestFrameFlexArithmetic(object):
# TODO: tests for other arithmetic ops
def test_df_add_2d_array_rowlike_broadcasts(self):
# GH#
arr = np.arange(6).reshape(3, 2)
df = pd.DataFrame(arr, columns=[True, False], index=['A', 'B', 'C'])

rowlike = arr[[1], :] # shape --> (1, ncols)
expected = pd.DataFrame([[2, 4],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you assert on the shape of rowlike here (like your comment but more explict)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea; I had messed up one of them.

[4, 6],
[6, 8]],
columns=df.columns, index=df.index,
# specify dtype explicitly to avoid failing
# on 32bit builds
dtype=arr.dtype)
result = df + rowlike
tm.assert_frame_equal(result, expected)
result = rowlike + df
tm.assert_frame_equal(result, expected)

# TODO: tests for other arithmetic ops
def test_df_add_2d_array_collike_broadcasts(self):
# GH#
arr = np.arange(6).reshape(3, 2)
df = pd.DataFrame(arr, columns=[True, False], index=['A', 'B', 'C'])

collike = arr[[1], :] # shape --> (nrows, 1)
expected = pd.DataFrame([[2, 4],
[4, 6],
[6, 8]],
columns=df.columns, index=df.index,
# specify dtype explicitly to avoid failing
# on 32bit builds
dtype=arr.dtype)
result = df + collike
tm.assert_frame_equal(result, expected)
result = collike + df
tm.assert_frame_equal(result, expected)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we have sufficient converage for a broadcast op with a non-homogenous frame?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

its pretty scattered. specifically within this module its pretty bare

def test_df_add_td64_columnwise(self):
# GH#22534 Check that column-wise addition broadcasts correctly
dti = pd.date_range('2016-01-01', periods=10)
Expand Down