-
-
Notifications
You must be signed in to change notification settings - Fork 18.6k
PDEP-1: Purpose and guidelines for pandas enhancement proposals #47444
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 8 commits
8bde84f
0b43492
3d9a75b
6d9d34b
a0e6cda
a0d7276
a8295b8
1e408dd
291de8d
2ce2164
ebf1687
05d43a5
d20de1e
9b37d11
55b3887
8c34db0
4f3343b
7c1a725
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change | ||||
---|---|---|---|---|---|---|
|
@@ -15,10 +15,53 @@ fundamental changes to the project that are likely to take months or | |||||
years of developer time. Smaller-scoped items will continue to be | ||||||
tracked on our [issue tracker](https://github.com/pandas-dev/pandas/issues). | ||||||
|
||||||
See [Roadmap evolution](#roadmap-evolution) for proposing | ||||||
changes to this document. | ||||||
The roadmap is defined as a set of major enhancement proposals named PDEPs. | ||||||
For more information about PDEPs, and how to submit one, please refer to | ||||||
[PEDP-1](/pdeps/accepted/0001-puropose-and-guidelines.html). | ||||||
|
||||||
## Extensibility | ||||||
## PDEPs | ||||||
|
||||||
### PDEPs under discussion | ||||||
|
||||||
{% for pdep in pdeps.under_discussion -%} | ||||||
- [{{ pdep.title }}]({{ pdep.url }}) | ||||||
{% else %} | ||||||
There are currently no PEPs under discussion | ||||||
{% endfor %} | ||||||
|
||||||
### Accepted PDEPs | ||||||
|
||||||
{% for pdep in pdeps.accepted -%} | ||||||
- [{{ pdep.title }}]({{ pdep.url }}) | ||||||
{% else %} | ||||||
There are currently no accepted PEPs | ||||||
{% endfor %} | ||||||
|
||||||
### Rejected PDEPs | ||||||
|
||||||
{% for pdep in pdeps.rejected -%} | ||||||
- [{{ pdep.title }}]({{ pdep.url }}) | ||||||
{% else %} | ||||||
There are currently no rejected PEPs | ||||||
{% endfor %} | ||||||
|
||||||
### Implemented PDEPs | ||||||
|
||||||
{% for pdep in pdeps.implemented -%} | ||||||
- [{{ pdep.title }}]({{ pdep.url }}) | ||||||
{% else %} | ||||||
There are currently no implemented PEPs | ||||||
{% endfor %} | ||||||
|
||||||
## Roadmap points pending a PDEP | ||||||
|
||||||
<div class="alert alert-warning" role="alert"> | ||||||
pandas is in the process of moving roadmap points to PDEPs (implemented in | ||||||
June 2022). During the transition, some roadmap points will exist as PDEPs, | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
to match the PDEP created date? but probably August. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. yeah switch to August :-> |
||||||
while others will exist as sections below. | ||||||
</div> | ||||||
|
||||||
### Extensibility | ||||||
|
||||||
Pandas `extending.extension-types` allow | ||||||
for extending NumPy types with custom data types and array storage. | ||||||
|
@@ -33,7 +76,7 @@ library, making their behavior more consistent with the handling of | |||||
NumPy arrays. We'll do this by cleaning up pandas' internals and | ||||||
adding new methods to the extension array interface. | ||||||
|
||||||
## String data type | ||||||
### String data type | ||||||
|
||||||
Currently, pandas stores text data in an `object` -dtype NumPy array. | ||||||
The current implementation has two primary drawbacks: First, `object` | ||||||
|
@@ -54,7 +97,7 @@ work, we may need to implement certain operations expected by pandas | |||||
users (for example the algorithm used in, `Series.str.upper`). That work | ||||||
may be done outside of pandas. | ||||||
|
||||||
## Apache Arrow interoperability | ||||||
### Apache Arrow interoperability | ||||||
|
||||||
[Apache Arrow](https://arrow.apache.org) is a cross-language development | ||||||
platform for in-memory data. The Arrow logical types are closely aligned | ||||||
|
@@ -65,7 +108,7 @@ data types within pandas. This will let us take advantage of its I/O | |||||
capabilities and provide for better interoperability with other | ||||||
languages and libraries using Arrow. | ||||||
|
||||||
## Block manager rewrite | ||||||
### Block manager rewrite | ||||||
|
||||||
We'd like to replace pandas current internal data structures (a | ||||||
collection of 1 or 2-D arrays) with a simpler collection of 1-D arrays. | ||||||
|
@@ -92,7 +135,7 @@ See [these design | |||||
documents](https://dev.pandas.io/pandas2/internal-architecture.html#removal-of-blockmanager-new-dataframe-internals) | ||||||
for more. | ||||||
|
||||||
## Decoupling of indexing and internals | ||||||
### Decoupling of indexing and internals | ||||||
|
||||||
The code for getting and setting values in pandas' data structures | ||||||
needs refactoring. In particular, we must clearly separate code that | ||||||
|
@@ -107,7 +150,7 @@ Indexing is a complicated API with many subtleties. This refactor will | |||||
require care and attention. More details are discussed at | ||||||
<https://github.com/pandas-dev/pandas/wiki/(Tentative)-rules-for-restructuring-indexing-code> | ||||||
|
||||||
## Numba-accelerated operations | ||||||
### Numba-accelerated operations | ||||||
|
||||||
[Numba](https://numba.pydata.org) is a JIT compiler for Python code. | ||||||
We'd like to provide ways for users to apply their own Numba-jitted | ||||||
|
@@ -119,7 +162,7 @@ window contexts). This will improve the performance of | |||||
user-defined-functions in these operations by staying within compiled | ||||||
code. | ||||||
|
||||||
## Documentation improvements | ||||||
### Documentation improvements | ||||||
|
||||||
We'd like to improve the content, structure, and presentation of the | ||||||
pandas documentation. Some specific goals include | ||||||
|
@@ -134,7 +177,7 @@ pandas documentation. Some specific goals include | |||||
subsections of the documentation to make navigation and finding | ||||||
content easier. | ||||||
|
||||||
## Performance monitoring | ||||||
### Performance monitoring | ||||||
|
||||||
Pandas uses [airspeed velocity](https://asv.readthedocs.io/en/stable/) | ||||||
to monitor for performance regressions. ASV itself is a fabulous tool, | ||||||
|
@@ -154,29 +197,3 @@ We'd like to fund improvements and maintenance of these tools to | |||||
<https://pyperf.readthedocs.io/en/latest/system.html> | ||||||
- Build a GitHub bot to request ASV runs *before* a PR is merged. | ||||||
Currently, the benchmarks are only run nightly. | ||||||
|
||||||
## Roadmap Evolution | ||||||
|
||||||
Pandas continues to evolve. The direction is primarily determined by | ||||||
community interest. Everyone is welcome to review existing items on the | ||||||
roadmap and to propose a new item. | ||||||
|
||||||
Each item on the roadmap should be a short summary of a larger design | ||||||
proposal. The proposal should include | ||||||
|
||||||
1. Short summary of the changes, which would be appropriate for | ||||||
inclusion in the roadmap if accepted. | ||||||
2. Motivation for the changes. | ||||||
3. An explanation of why the change is in scope for pandas. | ||||||
4. Detailed design: Preferably with example-usage (even if not | ||||||
implemented yet) and API documentation | ||||||
5. API Change: Any API changes that may result from the proposal. | ||||||
|
||||||
That proposal may then be submitted as a GitHub issue, where the pandas | ||||||
maintainers can review and comment on the design. The [pandas mailing | ||||||
list](https://mail.python.org/mailman/listinfo/pandas-dev) should be | ||||||
notified of the proposal. | ||||||
|
||||||
When there's agreement that an implementation would be welcome, the | ||||||
roadmap should be updated to include the summary and a link to the | ||||||
discussion issue. |
Original file line number | Diff line number | Diff line change | ||||
---|---|---|---|---|---|---|
@@ -0,0 +1,110 @@ | ||||||
# PDEP-1: Purpose and guidelines | ||||||
|
||||||
- Date: 21 June 2022 | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Maybe "Created" instead of "Date"? Since this will typically the date when the process is started, and not when it was eg accepted, "created" might denote that more correctly (and both NEPs and PEPs seem to use that) |
||||||
- Status: Accepted | ||||||
- Discussion: [#47444](https://github.com/pandas-dev/pandas/pull/47444) | ||||||
- Author: [Marc Garcia](https://github.com/datapythonista) | ||||||
|
||||||
## PDEP definition, purpose and scope | ||||||
|
||||||
A PDEP (pandas enhancement proposal) is a proposal to a **major** change in | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
? (not fully sure, but "proposal to" sounds like it should be followed by a verb |
||||||
pandas, in a similar way as a Python [PEP](https://peps.python.org/pep-0001/) | ||||||
or a NumPy [NEP](https://numpy.org/neps/nep-0000.html). | ||||||
|
||||||
Bug fixes and conceptually minor changes (e.g. adding a parameter to a function) | ||||||
are out of the scope of PDEPs. A PDEP should be used for changes that are not | ||||||
immediate and not obvious, and are expected to require a significant amount of | ||||||
discussion and require detailed documentation before being implemented. | ||||||
|
||||||
PDEP are appropriate for user facing changes, internal changes and organizational | ||||||
discussions. Examples of topics worth a PDEP could include moving a module from | ||||||
pandas to a separate repository, a refactoring of the pandas block manager or | ||||||
a proposal of a new code of conduct. | ||||||
|
||||||
## PDEP guidelines | ||||||
|
||||||
### Target audience | ||||||
|
||||||
A PDEP is a public document available to anyone, but the main stakeholders to | ||||||
consider when writing a PDEP are: | ||||||
|
||||||
- The core development team, who will have the final decision on whether a PDEP | ||||||
is approved or not | ||||||
- Developers of pandas and other related projects, and experienced users. Their | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
? (might be a bit more generic wording, as "developer" sounds more code-centric?) |
||||||
feedback is highly encouraged and appreciated, to make sure all points of views | ||||||
are taken into consideration | ||||||
- The wider pandas community, in particular users, who may or may not have feedback | ||||||
on the proposal, but should know and be able to understand the future direction of | ||||||
the project | ||||||
|
||||||
### PDEP authors | ||||||
|
||||||
Anyone can propose a PDEP, but in most cases developers of pandas itself and related | ||||||
projects are expected to author PDEPs. If you are unsure if you should be opening | ||||||
an issue or creating a PDEP, it's probably safe to start by | ||||||
[opening an issue](https://github.com/pandas-dev/pandas/issues/new/choose), which can | ||||||
be eventually moved to a PDEP. | ||||||
|
||||||
### Workflow | ||||||
|
||||||
#### Submitting a PDEP | ||||||
|
||||||
Proposing a PDEP is done by creating a PR adding a new file to `web/pdeps/accepted/`. | ||||||
The file is a markdown file, you can use `web/pdeps/accepted/0001.md` as a reference | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I feel like it would be convenient to have an actual blank template PDEP file to be filled out. It might also be nice to have a script that generates the next PDEP number for you(I anticipate it might be hard to find the next PDEP number if a lot of PDEPs are submitted in the future). There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'd personally leave that for later, once we need it. Those sound like good ideas to me, but I wouldn't implement them initially, I would wait to see how things work first. If we merge like PDEP per month, I think checking the last PDEP number before merging is easier than a system to autogenerate them. And for a template, I'd wait to have few actual PDEPs before deciding if it helps, or if PDEPs are too different from one to another. Does it make sense to you to start simple and iterate later as we have more experience? |
||||||
for the expected format. | ||||||
|
||||||
By default, we expect a PDEP will be accepted, so the PR of a PDEP should be done | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Do we want to also have a public comment period on PDEP's like tensorflow has with their RFC's? We should probably also clarify when voting happens(define what proportion of core team members need to consider a PDEP ready before voting like @mroeschke stated) and how long core team members have to vote. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Happy to discuss and hear other opinions. But to me personally, I wouldn't have a voting period or deadline. I see PDEPs more like ongoing discussions, with feedback and updates, than just a voting process. Also, I assume tensorflow core devs are mainly google employees, so imposing a deadline makes more sense, as they are supposed to be working on the project X hours. But for a mostly volunteer developed project like pandas, I'd leave it more open. But again, I'd start by seeing how things work, and if we see PDEPs keep open for too long, and seems like having a timeframe should help, surely worth giving it a try. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
I agree, just having a PDEP is akin to a new process, so having a process for the PDEP is maybe like introducing another meta process at the same time. So in response to this comment and others related to the process, I agree with @datapythonista that to begin with we just discuss whether we want to use PDEPs, what we want out of a PDEP, and roughly what it should contain and iterate on the gaps over time. The only caveat here, is that once someone has spent effort preparing a PDEP and it is approved that we don't have "blockers" implementing, so that does affect the approval process to some degree. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. i agree - we need a voting process here - we could emulate that of NEP which is pretty simple There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
I think what Marc proposed above is to defer a discussion about the decision process (basically our governance model) for a subsequent discussion. (which sounds good to me to keep the discussion manageable) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. i am not sure this is true - sure a lot will easily be accepted but some might be controversial and ultimately not accepted There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. yeah. maybe just qualify this by changing before this text we have...
so the PEP is part of the detailed documentation and an issue should perhaps be the initial part of the discussion. (just as I would normally expect say a bug fix PR to have an associated issue) and we also have
I think that maybe issues should always be opened before submitting a PEP, either as a specific issue or an existing issue concluding that a PEP is required. Maybe we also need to ensure that If we have some form of gate for opening a PEP, then most should be accepted. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We could also just leave out the "By default, we expect a PDEP will be accepted" ? To me it doesn't seem to add much, rather than a potentially wrong/confusing message (in general PDEPs are for topics that are not trivial and thus will not always be accepted) |
||||||
in the `accepted` directory and contain `Status: Accepted`. If a PDEP is finally | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We could alsof default to start a PDEP in a "draft" status (or "under discussion", I see there is a section for those)? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Sounds good to me. Regarding merging PDEPs as draft, feels like it could make discussions more difficult. Like where to have the discussion once the PR is merged. Via new PRs? Or in a second PR, not being able to comment in parts not in the diff. But I guess in some cases it can be useful, maybe a PR with different parts, and merging as a Draft once the first part has been discussed, and before working on other parts. I'll update the document if there are no objections to using Under discussion status (or Draft if people prefer this name) until PR is Accepted. And to consider merging PRs before they are accepted when it's useful. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. IMO I think the current guidance stated here is clear as long as a PDEP is an open PR, it should be considered "under discussion" and a draft PDEP can be a draft PR for example. I agree with @datapythonista that merging intermediate states of a PDEP would fragment the discussion and there shouldn't be more than one page/PR/location where a PDEP can be discussed. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
We will probably have to see how it goes in practice. For many PDEPs a single PR with proposal, discussion, acceptance all in one will quite likely be sufficient. As some references: numpy seems to already merge NEP PRs if they are still under discussion (eg numpy/numpy#18456), but they also have substantial discussion on the mailing list. Similarly in Python, PEPs are merged quickly, but also there the discussion is mostly happening on mailing list or discourse. That's of course different with the current way in pandas where most discussion typically happens on GitHub. |
||||||
rejected, its status and directory will be updated by the core team before merging, | ||||||
once the decision is made. Please make sure you select the option | ||||||
`Allow edits and access to secrets by maintainers` when opening the PR. | ||||||
|
||||||
#### Accepted PDEP | ||||||
|
||||||
A PDEP will be accepted by the core development team, and decisions will be made | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Maybe reword "will" to something else? This language makes it feel as if all PDEP's will be accepted. |
||||||
based on the [pandas governance document](https://github.com/pandas-dev/pandas-governance/blob/master/governance.md). | ||||||
|
||||||
Once a PDEP is accepted, any contributions can be made toward the implementing the PDEP with an open-ended completion timeline . The | ||||||
pandas project development, with a mix of volunteers and developers paid from | ||||||
different sources, and development priorities are difficult to understand or | ||||||
forecast. For companies, institutions or individuals with interest in seeing a | ||||||
PDEP being implemented, or to in general see progress to the pandas roadmap, | ||||||
please check how you can help in the [contributing page](/contribute.html). | ||||||
|
||||||
#### Implemented PDEP | ||||||
|
||||||
Once a PDEP is implemented and available in the main branch of pandas, its | ||||||
mroeschke marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
status will be changed to implemented, so there is visibility that the PDEP | ||||||
is not part of the roadmap and future plans, but a change that it already | ||||||
happened. The first pandas version in which the PDEP implementation is | ||||||
available will also be included in the PDEP. | ||||||
|
||||||
#### Rejected PDEP | ||||||
|
||||||
A PDEP can be rejected when the final decision is that its implementation is | ||||||
not the best for the interests of the project. They are as useful as accepted | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The word "accepted" seems a bit strange here, since this is about PDEPs that are not accepted? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is to contrast rejected PDEPs against accepted PDEPs; i.e. "even rejected PDEPs are useful". I think the wording is okay, but maybe "Rejected PDEPs are just as useful as accepted PDEPs..." would make it more clear? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Ah, I missed the first "as" in the sentence, so I read it as "they are useful as accepted PDEP", instead of "they are as useful as accepted PDEPs", hence the confusion. |
||||||
PDEPs, since there are discussions that are worth having, and decisions about | ||||||
changes to pandas being made. They will be merged with `Status: Rejected`, so | ||||||
there is visibility on what was discussed and what was the outcome of the | ||||||
discussion. A PDEP can be rejected for different reasons, for example good ideas | ||||||
that aren't backward-compatible, and the breaking changes aren't considered worth | ||||||
implementing. | ||||||
|
||||||
#### Invalid PDEP | ||||||
|
||||||
For submitted PDEPs that do not contain proper documentation, are out of scope, or | ||||||
are not useful to the community for any other reason, the PR will be closed after | ||||||
discussion with the author, instead of merging them as rejected. This is to not | ||||||
add noise to the list of rejected PDEPs, which should contain documentation as | ||||||
good as an accepted PDEP, but where the final decision was to not implement the changes. | ||||||
|
||||||
## Evolution of this PDEP | ||||||
|
||||||
While most PDEPs aren't expected to change after accepted, in some cases like this | ||||||
PDEP, they will be updated to contain its latest version, if things evolve. A log | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We should probably clarify when a PDEP can be revised v.s. when a new PDEP has to be submitted superseding the old PDEP. |
||||||
of the summary of the changes will be kept to make it easier to see if any change | ||||||
has happened. | ||||||
|
||||||
### PDEP-1 History | ||||||
|
||||||
- 21 June 2022: Initial version |
Uh oh!
There was an error while loading. Please reload this page.