Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add specification for computing the cumulative sum #653

Merged
merged 6 commits into from
Dec 14, 2023

Conversation

kgryte
Copy link
Contributor

@kgryte kgryte commented Jul 13, 2023

This PR

  • resolves Add support for computing the cumulative sum to the standard #597 by adding a specification for the cumulative sum.
  • avoids naming concerns by using the more verbose name cumulative_sum. This follows recent convention in the dataframe API standard (ref: Add some cumulative Column operations dataframe-api#193).
  • follows dtype conventions established in the specifications for sum, whereby, if no dtype is specified, the data type of the output array must have a range of values at least as large as the default data type of the same kind.
  • follows axis conventions whereby, for one-dimensional arrays, providing an axis is optional; however, if the input array has more than one dimension, providing an axis is required. This deviates from NumPy where not providing an axis computes a cumulative sum over the flattened array and follows PyTorch where providing an axis is required. The solution here attempts to thread the needle between the two, siding with convention established in take where convenience is given greater weight for one-dimensional arrays.
  • includes an include_initial keyword argument to control whether the initial value (by convention, the additive identity) is included in the output. The include_initial naming convention takes inspiration from NumPy#6044 which proposed include_identity, but this PR uses _initial to align with a future extension in which an initial kwarg specifies a custom initial value. By default, include_initial=False, which matches the default behavior across the ecosystem. As of this PR, only TensorFlow supports excluding/including the starting summation value.
  • handling of special values (e.g., NaN and infinities) must follow special cases as defined in add.

Questions

  • Should axis be fully optional, as in NumPy?
    • Update: during the consortium meeting on 13 July 2023, consensus was to use take convention, where NumPy's flattening behavior is not supported.
  • Bikeshed: alternative naming conventions for startpoint and endpoint? Or are we okay following specification precedent?
    • Update: during the consortium meeting on 13 July 2023, concerns were raised regarding the inclusion of endpoint, especially when endpoint=False can be achieved by slicing off the last element. Omission of the last element is a common use case (e.g., when computing offsets), but it's explicit support in the API signature did not seem warranted. Further, concerns were raised regarding startpoint/endpoint naming conventions in that they lacked obviousness. This PR has been updated to use include_initial with potential future alignment with future extension in which initial kwarg support is added to cumulative_sum.

@kgryte kgryte added the API extension Adds new functions or objects to the API. label Jul 13, 2023
@kgryte kgryte added this to the v2023 milestone Jul 13, 2023
Copy link
Member

@rgommers rgommers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @kgryte!

My initial comments:

  • I look at cumulative_sum like a reduction, which should closely follow what sum and other reductions do for its main parameters.

  • The use cases about "prepending a value" are tricky because of the output becoming keyword-dependent. As I read it, that is what both startpoint and endpoint do, right?

    • It kinda breaks our policy regarding output shapes depending on keywords, so I think we need an explicit okay for this from JAX & co.
    • Most of the discussion was specifically about prepending zero, which is startpoint. I'm less convinced by the inclusion of endpoint.
  • initial is indeed useful to leave out here, because if we want to add it we should do that uniformly for all reductions.

startpoint: bool
boolean indicating whether to include the initial value as the first value in the output. Default: ``False``.
endpoint: bool
boolean indicating whether to include the total sum as the last value in the output. Default: ``True``.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will note that initial= would be a useful extension to allow starting with something other than 0, especially together with startpoint=True. (It might even be necessary in case an identity isn't defined: such as for timedelta's were there startpoint has to be a datetime.)

Mainly mentioning it in case anyone is inspired about startpoint/endpoint naming, I don't mind adding such parameters (not sure how hard it is to add to numpy, but I don't expect it is hard).

@adityagoel4512
Copy link

Is there a reason this PR stalled? I think cumulative sums are quite pervasive and would be a good candidate for standardisation.

@rgommers
Copy link
Member

No reason beyond limited bandwidth. This function will definitely make it into the next release.

@kgryte
Copy link
Contributor Author

kgryte commented Sep 19, 2023

Update: I added guidance concerning what happens when include_initial is True and yet x is an empty array. My sense is that the returned array should also be empty, but others may have differing opinions. (related comment on NumPy mailing list: https://mail.python.org/archives/list/numpy-discussion@python.org/message/P5ZTUQ3L4Y2EKXVETYWWZLDOE2TGJBPJ/)

@kgryte kgryte added the Needs Review Pull request which needs review. label Sep 19, 2023
@kgryte
Copy link
Contributor Author

kgryte commented Oct 19, 2023

Update: based on feedback during Consortium discussions, I've updated the guidance concerning what should happen when include_initial is True and the array is empty.

The guidance now states that, no matter the axis size (>=0), the output array shape must follow the same rule: namely, the size of the axis along which to compute the cumulative sum must be N+1, with the first element equal to the initial value. Hence, if the input array x is empty and include_initial is True, then the size of the corresponding dimension in the output array must be 1.

This change ensures that the output array shape follows consistent rules and is not dependent on the input array shape.

@kgryte
Copy link
Contributor Author

kgryte commented Nov 6, 2023

@seberg Would you be willing to do another round of review? Would be good to get another set of eyes on this before we merge. Thanks!

@seberg
Copy link
Contributor

seberg commented Nov 30, 2023

FWIW, I am happy with include_initial and all the other things were general things shared with normal sum anyway.

@kgryte kgryte removed the Needs Review Pull request which needs review. label Dec 14, 2023
@kgryte
Copy link
Contributor Author

kgryte commented Dec 14, 2023

As this PR has been discussed previously and no objections were raised, will go ahead and merge. Any further updates or clarifications can be addressed in follow-ups PRs.

@kgryte kgryte merged commit 1fd3429 into data-apis:main Dec 14, 2023
@kgryte kgryte deleted the cumulative-sum branch December 14, 2023 05:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API extension Adds new functions or objects to the API.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add support for computing the cumulative sum to the standard
5 participants