Skip to content

ENH: Add more dt property/method support for ArrowDtype(timestamp) #52503

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 21 commits into from
Apr 21, 2023

Conversation

mroeschke
Copy link
Member

@mroeschke mroeschke commented Apr 6, 2023

@mroeschke mroeschke added Datetime Datetime data dtype Arrow pyarrow functionality labels Apr 6, 2023
@mroeschke mroeschke added this to the 2.1 milestone Apr 6, 2023
@mroeschke
Copy link
Member Author

@phofl do you prefer backporting this too?

@phofl
Copy link
Member

phofl commented Apr 11, 2023

No real opinion here, I think the string equivalent is used more widely so was better to back port the split pr

Comment on lines 2121 to 2125
@property
def _dt_is_quarter_start(self):
is_correct_month = pc.is_in(pc.month(self._pa_array), pa.array([1, 4, 7, 10]))
is_first_day = pc.equal(pc.day(self._pa_array), 1)
return type(self)(pc.and_(is_correct_month, is_first_day))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you use pc.floor_temporal? E.g:

return pc.equal(pc.floor_temporal(self._pa_array, unit='quarter'), self._pa_array)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For is_quarter_start, even if the date has non zero hour, minute, etc components we want to return true if the month and day are correct, so unfortunately we don't want the flooring/ceiling of those components e.g.

In [15]: ser
Out[15]: 
0   2023-11-30 03:00:00
1   2023-01-01 03:00:00
2   2023-03-31 03:00:00
3                   NaT
dtype: datetime64[ns]

In [16]: ser.dt.is_quarter_start
Out[16]: 
0    False
1     True
2    False
3    False
dtype: bool

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. How about:

pc.equal(pc.floor_temporal(self._pa_array, unit='quarter'), pc.floor_temporal(self._pa_array, unit='day'))

Comment on lines 2127 to 2132
@property
def _dt_is_quarter_end(self):
is_correct_month = pc.is_in(pc.month(self._pa_array), pa.array([3, 6, 9, 12]))
plus_one_day = pc.add(self._pa_array, pa.scalar(datetime.timedelta(days=1)))
is_first_day = pc.equal(pc.day(plus_one_day), 1)
return type(self)(pc.and_(is_correct_month, is_first_day))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As above but with pc.ceil_temporal.

Comment on lines 2134 to 2146
@property
def _dt_days_in_month(self):
pa_array = self._pa_array
if self.dtype.pyarrow_dtype.unit != "ns":
pa_array = self._pa_array.cast(
pa.timestamp("ns", tz=self.dtype.pyarrow_dtype.tz)
)
pa_array_int = pa_array.cast(pa.int64())
if self._hasna:
pa_array_int = pa_array_int.fill_null(iNaT)
np_result = get_date_field(pa_array_int.to_numpy(), "dim")
mask = np_result == -1
return type(self)(pa.array(np_result, mask=mask))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might make sense to implement upstream. Otherwise this might work:

pc.days_between(pc.floor_temporal(self._pa_array, unit='month'), pc.ceil_temporal(self._pa_array, unit='month'))

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! This works perfectly

Comment on lines 2298 to 2301
if ambiguous != "raise":
raise NotImplementedError(f"{ambiguous=} is not supported")
if nonexistent != "raise":
raise NotImplementedError(f"{nonexistent=} is not supported")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these checks really required? Do we actually need ambiguous/nonexistent?

Comment on lines 2303 to 2309
if tz is None:
result = self._pa_array.cast(pa.timestamp(current_unit, "UTC")).cast(
pa.timestamp(current_unit)
)
else:
pa_tz = str(tz)
result = self._pa_array.cast(pa.timestamp(current_unit, pa_tz))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could probably reduce this to just:

Suggested change
if tz is None:
result = self._pa_array.cast(pa.timestamp(current_unit, "UTC")).cast(
pa.timestamp(current_unit)
)
else:
pa_tz = str(tz)
result = self._pa_array.cast(pa.timestamp(current_unit, pa_tz))
result = self._pa_array.cast(pa.timestamp(current_unit, pa_tz))

def _dt_is_quarter_end(self):
result = pc.equal(
pc.ceil_temporal(self._pa_array, unit="quarter"),
pc.ceil_temporal(self._pa_array, unit="day"),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would this have to be:

Suggested change
pc.ceil_temporal(self._pa_array, unit="day"),
pc.floor_temporal(self._pa_array, unit="day"),

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, so I just noticed pc.ceil_temporal(self._pa_array, unit="quarter") will return the start of the new quarter

(Pdb) self._pa_array
<pyarrow.lib.ChunkedArray object at 0x14764cb30>
[
  [
    2023-11-30 03:00:00.000000,
    2023-01-01 03:00:00.000000,
    2023-03-31 03:00:00.000000,
    null
  ]
]
(Pdb) pc.ceil_temporal(self._pa_array, unit="quarter")
<pyarrow.lib.ChunkedArray object at 0x1478e1860>
[
  [
    2024-01-01 00:00:00.000000,
    2023-04-01 00:00:00.000000,
    2023-04-01 00:00:00.000000,
    null
  ]
]
(Pdb) pc.floor_temporal(self._pa_array, unit="day")
<pyarrow.lib.ChunkedArray object at 0x1478e1950>
[
  [
    2023-11-30 00:00:00.000000,
    2023-01-01 00:00:00.000000,
    2023-03-31 00:00:00.000000,
    null
  ]
]

so I would need to check if pc.floor_temporal(self._pa_array, unit="day") + 1 day = pc.ceil_temporal(self._pa_array, unit="quarter")?

Copy link
Contributor

@rok rok Apr 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed! So something like:

return pc.equal(pc.days_between(pc.floor_temporal(array, unit="day"), pc.ceil_temporal(array, unit="quarter")), 1)

@mroeschke
Copy link
Member Author

Thanks for all the reviews @rok!

@mroeschke
Copy link
Member Author

Going to merge this in. Can address any follow ups if needed

@mroeschke mroeschke merged commit 367d24f into pandas-dev:main Apr 21, 2023
@mroeschke mroeschke deleted the enh/arrow/more_dt branch April 21, 2023 17:59
Rylie-W pushed a commit to Rylie-W/pandas that referenced this pull request May 19, 2023
…andas-dev#52503)

* Add more properties & attributes

* Add issue number

* Add xfails

* Simplify days_in_month

* Add tz_convert

* Undo quarter

* Add another issue

* simplify is_quarter

* undo test

* simplify

* fix is_quarter_end

* Address is_month_end

* Remove unused
@rhshadrach rhshadrach mentioned this pull request Aug 15, 2023
5 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Arrow pyarrow functionality Datetime Datetime data dtype
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: ArrowTemporalProperties object has no attribute day_name BUG: tz_convert not implemented for arrow timestamps
3 participants