-
-
Notifications
You must be signed in to change notification settings - Fork 18.6k
Moving to PyArrow dtypes by default #61618
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Moving users to pyarrow types by default in 3.0 would be insane because #32265 has not been addressed. Getting that out of the way was the point of PDEP 16, which you and Simon are oddly hostile toward. |
Since I helped draft PDEP-10, I would like a world where the Arrow type system with NA semantics would be the only pandas type system. Secondarily, I would like a world where pandas has the Arrow type system with NA semantics and the legacy (NaN) Numpy type system which is completely independent from the Arrow type system (e.g. a user cannot mix the two in any way) I agree with Brock that null semantics (NA vs NaN) must inevitably be discussed with adopting a new type system. I've also generally been concerned about the growing complexity of PDEP-14 with the configurability of "storage" and NA semantics (like having to define a comparison hierarchy #60639). While I understand that we've been very cautious about compatibility with existing types, I don't think this is maintainable or clearer for users in the long run. My ideal roadplan would be:
|
So with this deprecation enforced, NaN in a constructor, setitem, or csv would be treated as distinct from pd.NA? If so, I’m on board but I expect that to be a painful deprecation. |
To be clear, I'm not hostile towards PDEP-16 at all. I think it's important for pandas to have clear and simple missing value handling, and while incomplete, I think the PDEP and the discussions have been very useful and insightful. Amd I really appreciate that work. I just don't see PDEP-16 as a blocker for moving to pyarrow, even less if implemented at the same time. And also, I wouldn't spend time with our own nullable dtypes, I would implement PDEP-16 only for pyarrow types. I couldn't agree more on the points Matt mentions for pandas 3.x. Personally I would change the default earlier. Sounds like pandas 4.x is mostly about showing a warning to users until they manually change the default, which I personally wouldn't do. But that's a minor point, I like the general idea. |
Correct. Yeah I am optimistic that most of the deprecation would hopefully go into
There is a lot of references about type systems in that PDEP that I think would warrant some re-imagining given which type systems are favored. As mentioned before (unfortunately) I think type systems and missing value semantics need to be discussed together |
I created a separate issue #61620 for the option mentioned in the description and in Matt's roadmap, since I think that's somehow independent, and no blocked by PDEP-15, by this issue, or by nothing else that I know.
I fully agree with this. But I'm not sure I fully understand why PDEP-16 must be a blocker for defaulting to PyArrow types. For users already using PyArrow, they'll have to follow the deprecation transition if they are using For users not yet using PyArrow, I do understand that it's better to force the move when the PyArrow dtypes behave as we think they should behave. I'm not convinced this should be a blocker, and even less if the deprecation of the special treatment of |
@mroeschke a couple of questions:
|
I have always understood the ArrowExtensionArray and ArrowDtype to be experimental, no PDEP and no roadmap item, for the purpose of evaluating PyArrow types in pandas to potentially eventually use as a backend for pandas nullable dtypes. So I can sort of understand why the ArrowDtypes are no longer pure and have allowed pandas semantics to creep into the API. As experimental dtypes why do they need any deprecation at all? Where do we promote these types as pandas recommended types? |
just to be clear about my previous statement, there is a roadmap item
but I've never interpreted this to cover adopting the current ArrowDtype system throughout pandas |
@mroeschke given the current level of funding and interest for the development of the pandas nullable dtypes and the proportion of core devs that now appear to favor embracing the ArrowDtype instead, I fear that this may be, at this time, the most pragmatic approach in keeping pandas development moving forward as it does seem to have slowed of late. I'm not necessarily comfortable deprecating so much prior effort but then the same could have been said about Panel many years ago and I'm not sure anyone misses it today. If the community wants nullable dtypes by default, they may be less interested in the implementation details or even to some extent the performance. If the situation changes and there is more funding and contributions in the future and we have released a Windows 8 in the meantime then we could perhaps bring back the pandas nullable types. |
The goal of this and PDEP-13 were pretty aligned; prefer PyArrow to build out our default type system where applicable, and fill in the gaps using whatever we have as a fallback. That PDEP conversation stalled; not sure if its worth reviving or if this issue is going to tackle a smaller subset of the problem, but in any case I definitely support this |
@datapythonista I just would like some agreement that that defaulting to PyArrow types also matches PDEP-16's proposal to (only) NA semantics for this type as well when making the change for a consistent story, but I suppose they don't need to be done at the same time
@simonjayhawkins yes, it purely uses
It does not, and ideally it won't. When interacting with other parts of Cython in pandas, e.g.
This is a valid point; technically it shouldn't require deprecation. While our docs don't necessarily state a recommended type, anecdotally, it has felt like in past year or two there's been quite a number of conference talks, blogs posts, books that have "celebrated" the newer Arrow types in pandas. Although attention != usage, it may still warrant some care if changing behavior IMO.
An alternative to completely deprecating and removing the pandas NumPy nullable types is to spin them off into their own repository & package and treat them like any other 3rd party |
we have a roadmap item:
the roadmap also states
so we could perhaps have a separate issue on this (perhaps culminating in a PDEP to clear this off that list) to address any points to make the spin off easier? |
In the absence of a competing proposal, I think you should proceed assuming that PDEP-16 is going to be accepted someday. A stale PDEP cannot be allowed to hold up development in other areas, especially if you have the bandwidth and resources to move forward on this. |
just to be clear, Int, Float, Boolean and pd.NA variants of StringArray only?
does that include numpy object type? |
I'm going to throw out another (possibly crazy) idea based on these comments:
The idea is as follows:
We only do bug fixes to If they want "modern" pandas, they fully buy into Then we don't have to worry about a transition path that WE maintain. The transition is up to the users. They can take working code with |
Generally curious but why do we feel the need to remove the extension types? If anything, couldn't we just update them to use PyArrow behind the scenes? I realize we use the term "numpy_nullable" throughout, but conceptually there's nothing different between what they and their corresponding Arrow types do. The Arrow implementation is just going to be more lightweight, so why not update those in place and not require users to change their code? |
All proposals and discussion points so far seem very interesting. Personally I think it's great we are having this conversation. Based on the feedback above, what makes most sense to me is:
This wouldn't take us to a perfect situation, but it makes the transition path very easy for both users and us, and we can start focussing on pandas backed by Arrow, while gathering feedback from users quite fast without creating them any inconvenient (other that adding a line of code setting the option). After getting the feedback we will be in a better position to decide on deprecating the legacy types, moving them to a separate repo, maintaining them forever... |
Yes and I'd say all variants of
I would say yes. |
3.0 is already huge (CoW, Strings) and way, way late. Let's not risk adding another year to that lateness. |
Deprecating that behavior is relatively easy (though I kind of expect @jorisvandenbossche to chime in that we're missing something important). The hard part is the step after that telling users they have to change all their existing As to "huge effort", off the top of my head, some tasks that would be required in order to make pyarrow dtypes the default include: a) making object, category, interval, period dtypes Just Work |
Thanks @jbrockmendel, this is great feedback. Do you think implementing the global option (without defaulting to arrow or fixing the above points) is reasonable for 3.0 (releasing it soon)? For numpy nullable types, does that same list apply? Or just the work on missing value semantics is needed to consider them ready to be the default? |
Everything except the performance would need to be done regardless of the default. |
Trying to catch up with all discussions .. but already two quick points of feedback:
I agree with @jbrockmendel here. Personally I would certainly not consider any of this for 3.0. There is still lots of work to do before we can even consider making the pyarrow dtypes the default (right now users cannot even try to this out fully), which I think will easily take at least a year.
Personally, I would prefer that we go for the model chosen for the As another example of this, I think we should create a variant of |
Thanks @jorisvandenbossche
This aligns with my expectation of the future direction of pandas. Do we have any umbrella discussion anywhere that could be used as the basis of a PDEP so this is on the roadmap. (note: I'm not in any way volunteering to take that on) I don't think PDEP-16 covers the using of pyarrow as a backend only and hiding the functionality behind our pandas nullable types as it is focused on the missing value semantics. Now, I appreciate that is part of that plan, but is not the plan as a roadmap item. But to be fair, ArrowDType was introduced without a PDEP or IMO a roadmap item that covered what was implemented. As said above, these types have been promoted at conferences etc and are potentially being actively used. This is now IMO pulling the development team in two different directions. For instance, I am very frustrated at the ArrowDType for string being available to users as it has caused so much confusion even within the team. IMO we don't need this as "experimental" since we have the pandas nullable dtype available. However, I also appreciate that for evaluating an experimental dtype system (not just individual dtypes) we should perhaps include it. So I am conflicted. Now we are where we are. And we probably now have as many (or more) core-devs that appear to favor the ArrowDtype system. So managing that is now likely as difficult as the technical discussions themselves. @mroeschke please don't take this as criticism as the work you have done is great (for the whole project, the maintenance and not just the Arrow work) and any issues that have arisen are systematic failures. As @WillAyd pointed out the PDEP process needs to be able to allow to make decisions without every minute detail ironed out before the PDEP is approved. |
Thanks all for the feedback. Personally feels like while opening so many discussions have been a bit chaotic (sorry about that as I started or reopened most), I think it's been very productive. I'd like to know how do you feel about trying to unify all the type system topics into a single PDEP that broadly defines a roadmap for the next steps and final goal. Maybe I'm too optimistic, but I think there is agreement in most points, and it's just the details that need to be decided. Some of the things that I think could go into the PDEP and there is mostly agreement:
Does people feel like it make sense to write a PDEP with the general picture of this roadmap (that should superseed this issue, PDEP-15 and probably others)? Does anyone want to lead the effort to get this PDEP done? |
None taken :) . Yes, the development of these types preceded our agreed-upon approval process for larger pandas enhancements.
FWIW, the early model of |
There have been some discussions about this before, but as far as I know there is not a plan or any decision made.
My understanding is (feel free to disagree):
@jbrockmendel commented this, and I think many others share this point of view, based on past interactions:
It would be interesting to know why exactly, I guess it's mainly because of two main reasons:
I don't know the exact state of the PyArrow types, and how often users will face problems if using them instead of the original ones. From my perception, there aren't any major efforts to make them better at this point. So, I'm unsure if the situation in that regard will be very different if we make the PyArrow types the default ones tomorrow, or if we make them the default ones in two years.
My understanding is that the only person who is paid consistently to work on pandas is Matt, and he's doing an amazing job at keeping the project going, reviewing most of the PRs, keeping the CI in good shape... But I don't think not him not anyone else if being able to put hours into developing new things as it used to be. For reference, this is the GitHub chart of pandas activity (commits) since pandas 2.0:
So, in my opinion, the existing problems with PyArrow will start to be addressed significantly, whenever they become the default ones.
So, in my opinion our two main options are:
Of course not all users are ready for pandas 3.0 with Arrow types. They can surely pin to
pandas=2
until pandas 3 is more mature and they made the required changes to their code. We can surely add a flagpandas.options.mode.use_arrow = False
that reverts the new default to the old status quo. So users can actually move to pandas 3.0 but stay with the old types until we (pandas) and them are ready to get into the new default types. The transition from Python 2 to 3 (which is surely an example of what not to do) took more than 10 years. I don't think in our case we need as much. And if there is interest (aka money) we can also support the pandas 2 series while needed.multiple major release cycles
, so I assume something like 3 at a rate of one major release every 2 years).It will be great to know what are other people's thoughts and ideal plans, and see what makes more sense. But to me personally, based on the above information, it doesn't sound more insane to move to PyArrow in pandas 3, than to move in pandas 6.
The text was updated successfully, but these errors were encountered: