-
-
Notifications
You must be signed in to change notification settings - Fork 18.6k
ENH: Add more dt property/method support for ArrowDtype(timestamp) #52503
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@phofl do you prefer backporting this too? |
No real opinion here, I think the string equivalent is used more widely so was better to back port the split pr |
pandas/core/arrays/arrow/array.py
Outdated
@property | ||
def _dt_is_quarter_start(self): | ||
is_correct_month = pc.is_in(pc.month(self._pa_array), pa.array([1, 4, 7, 10])) | ||
is_first_day = pc.equal(pc.day(self._pa_array), 1) | ||
return type(self)(pc.and_(is_correct_month, is_first_day)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you use pc.floor_temporal
? E.g:
return pc.equal(pc.floor_temporal(self._pa_array, unit='quarter'), self._pa_array)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For is_quarter_start
, even if the date has non zero hour, minute, etc components we want to return true if the month and day are correct, so unfortunately we don't want the flooring/ceiling of those components e.g.
In [15]: ser
Out[15]:
0 2023-11-30 03:00:00
1 2023-01-01 03:00:00
2 2023-03-31 03:00:00
3 NaT
dtype: datetime64[ns]
In [16]: ser.dt.is_quarter_start
Out[16]:
0 False
1 True
2 False
3 False
dtype: bool
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it. How about:
pc.equal(pc.floor_temporal(self._pa_array, unit='quarter'), pc.floor_temporal(self._pa_array, unit='day'))
pandas/core/arrays/arrow/array.py
Outdated
@property | ||
def _dt_is_quarter_end(self): | ||
is_correct_month = pc.is_in(pc.month(self._pa_array), pa.array([3, 6, 9, 12])) | ||
plus_one_day = pc.add(self._pa_array, pa.scalar(datetime.timedelta(days=1))) | ||
is_first_day = pc.equal(pc.day(plus_one_day), 1) | ||
return type(self)(pc.and_(is_correct_month, is_first_day)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As above but with pc.ceil_temporal
.
pandas/core/arrays/arrow/array.py
Outdated
@property | ||
def _dt_days_in_month(self): | ||
pa_array = self._pa_array | ||
if self.dtype.pyarrow_dtype.unit != "ns": | ||
pa_array = self._pa_array.cast( | ||
pa.timestamp("ns", tz=self.dtype.pyarrow_dtype.tz) | ||
) | ||
pa_array_int = pa_array.cast(pa.int64()) | ||
if self._hasna: | ||
pa_array_int = pa_array_int.fill_null(iNaT) | ||
np_result = get_date_field(pa_array_int.to_numpy(), "dim") | ||
mask = np_result == -1 | ||
return type(self)(pa.array(np_result, mask=mask)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This might make sense to implement upstream. Otherwise this might work:
pc.days_between(pc.floor_temporal(self._pa_array, unit='month'), pc.ceil_temporal(self._pa_array, unit='month'))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! This works perfectly
pandas/core/arrays/arrow/array.py
Outdated
if ambiguous != "raise": | ||
raise NotImplementedError(f"{ambiguous=} is not supported") | ||
if nonexistent != "raise": | ||
raise NotImplementedError(f"{nonexistent=} is not supported") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are these checks really required? Do we actually need ambiguous/nonexistent
?
pandas/core/arrays/arrow/array.py
Outdated
if tz is None: | ||
result = self._pa_array.cast(pa.timestamp(current_unit, "UTC")).cast( | ||
pa.timestamp(current_unit) | ||
) | ||
else: | ||
pa_tz = str(tz) | ||
result = self._pa_array.cast(pa.timestamp(current_unit, pa_tz)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could probably reduce this to just:
if tz is None: | |
result = self._pa_array.cast(pa.timestamp(current_unit, "UTC")).cast( | |
pa.timestamp(current_unit) | |
) | |
else: | |
pa_tz = str(tz) | |
result = self._pa_array.cast(pa.timestamp(current_unit, pa_tz)) | |
result = self._pa_array.cast(pa.timestamp(current_unit, pa_tz)) |
pandas/core/arrays/arrow/array.py
Outdated
def _dt_is_quarter_end(self): | ||
result = pc.equal( | ||
pc.ceil_temporal(self._pa_array, unit="quarter"), | ||
pc.ceil_temporal(self._pa_array, unit="day"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would this have to be:
pc.ceil_temporal(self._pa_array, unit="day"), | |
pc.floor_temporal(self._pa_array, unit="day"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, so I just noticed pc.ceil_temporal(self._pa_array, unit="quarter")
will return the start of the new quarter
(Pdb) self._pa_array
<pyarrow.lib.ChunkedArray object at 0x14764cb30>
[
[
2023-11-30 03:00:00.000000,
2023-01-01 03:00:00.000000,
2023-03-31 03:00:00.000000,
null
]
]
(Pdb) pc.ceil_temporal(self._pa_array, unit="quarter")
<pyarrow.lib.ChunkedArray object at 0x1478e1860>
[
[
2024-01-01 00:00:00.000000,
2023-04-01 00:00:00.000000,
2023-04-01 00:00:00.000000,
null
]
]
(Pdb) pc.floor_temporal(self._pa_array, unit="day")
<pyarrow.lib.ChunkedArray object at 0x1478e1950>
[
[
2023-11-30 00:00:00.000000,
2023-01-01 00:00:00.000000,
2023-03-31 00:00:00.000000,
null
]
]
so I would need to check if pc.floor_temporal(self._pa_array, unit="day")
+ 1 day = pc.ceil_temporal(self._pa_array, unit="quarter")
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed! So something like:
return pc.equal(pc.days_between(pc.floor_temporal(array, unit="day"), pc.ceil_temporal(array, unit="quarter")), 1)
Thanks for all the reviews @rok! |
Going to merge this in. Can address any follow ups if needed |
…andas-dev#52503) * Add more properties & attributes * Add issue number * Add xfails * Simplify days_in_month * Add tz_convert * Undo quarter * Add another issue * simplify is_quarter * undo test * simplify * fix is_quarter_end * Address is_month_end * Remove unused
ArrowTemporalProperties
object has no attributeday_name
#52388 (Replace xxxx with the GitHub issue number)doc/source/whatsnew/vX.X.X.rst
file if fixing a bug or adding a new feature.