Add specification for computing the cumulative sum #653

kgryte · 2023-07-13T06:48:03Z

This PR

resolves Add support for computing the cumulative sum to the standard #597 by adding a specification for the cumulative sum.
avoids naming concerns by using the more verbose name cumulative_sum. This follows recent convention in the dataframe API standard (ref: Add some cumulative Column operations dataframe-api#193).
follows dtype conventions established in the specifications for sum, whereby, if no dtype is specified, the data type of the output array must have a range of values at least as large as the default data type of the same kind.
follows axis conventions whereby, for one-dimensional arrays, providing an axis is optional; however, if the input array has more than one dimension, providing an axis is required. This deviates from NumPy where not providing an axis computes a cumulative sum over the flattened array and follows PyTorch where providing an axis is required. The solution here attempts to thread the needle between the two, siding with convention established in take where convenience is given greater weight for one-dimensional arrays.
includes an include_initial keyword argument to control whether the initial value (by convention, the additive identity) is included in the output. The include_initial naming convention takes inspiration from NumPy#6044 which proposed include_identity, but this PR uses _initial to align with a future extension in which an initial kwarg specifies a custom initial value. By default, include_initial=False, which matches the default behavior across the ecosystem. As of this PR, only TensorFlow supports excluding/including the starting summation value.
handling of special values (e.g., NaN and infinities) must follow special cases as defined in add.

Questions

Should axis be fully optional, as in NumPy?
- Update: during the consortium meeting on 13 July 2023, consensus was to use take convention, where NumPy's flattening behavior is not supported.
Bikeshed: alternative naming conventions for startpoint and endpoint? Or are we okay following specification precedent?
- Update: during the consortium meeting on 13 July 2023, concerns were raised regarding the inclusion of endpoint, especially when endpoint=False can be achieved by slicing off the last element. Omission of the last element is a common use case (e.g., when computing offsets), but it's explicit support in the API signature did not seem warranted. Further, concerns were raised regarding startpoint/endpoint naming conventions in that they lacked obviousness. This PR has been updated to use include_initial with potential future alignment with future extension in which initial kwarg support is added to cumulative_sum.

rgommers

Thanks @kgryte!

My initial comments:

I look at cumulative_sum like a reduction, which should closely follow what sum and other reductions do for its main parameters.
The use cases about "prepending a value" are tricky because of the output becoming keyword-dependent. As I read it, that is what both startpoint and endpoint do, right?
- It kinda breaks our policy regarding output shapes depending on keywords, so I think we need an explicit okay for this from JAX & co.
- Most of the discussion was specifically about prepending zero, which is startpoint. I'm less convinced by the inclusion of endpoint.
initial is indeed useful to leave out here, because if we want to add it we should do that uniformly for all reductions.

src/array_api_stubs/_draft/statistical_functions.py

seberg · 2023-07-13T18:02:26Z

src/array_api_stubs/_draft/statistical_functions.py

+    startpoint: bool
+        boolean indicating whether to include the initial value as the first value in the output. Default: ``False``.
+    endpoint: bool
+        boolean indicating whether to include the total sum as the last value in the output. Default: ``True``.


I will note that initial= would be a useful extension to allow starting with something other than 0, especially together with startpoint=True. (It might even be necessary in case an identity isn't defined: such as for timedelta's were there startpoint has to be a datetime.)

Mainly mentioning it in case anyone is inspired about startpoint/endpoint naming, I don't mind adding such parameters (not sure how hard it is to add to numpy, but I don't expect it is hard).

src/array_api_stubs/_draft/statistical_functions.py

adityagoel4512 · 2023-09-14T19:02:12Z

Is there a reason this PR stalled? I think cumulative sums are quite pervasive and would be a good candidate for standardisation.

rgommers · 2023-09-15T10:56:32Z

No reason beyond limited bandwidth. This function will definitely make it into the next release.

kgryte · 2023-09-19T21:25:02Z

Update: I added guidance concerning what happens when include_initial is True and yet x is an empty array. My sense is that the returned array should also be empty, but others may have differing opinions. (related comment on NumPy mailing list: https://mail.python.org/archives/list/numpy-discussion@python.org/message/P5ZTUQ3L4Y2EKXVETYWWZLDOE2TGJBPJ/)

kgryte · 2023-10-19T01:23:10Z

Update: based on feedback during Consortium discussions, I've updated the guidance concerning what should happen when include_initial is True and the array is empty.

The guidance now states that, no matter the axis size (>=0), the output array shape must follow the same rule: namely, the size of the axis along which to compute the cumulative sum must be N+1, with the first element equal to the initial value. Hence, if the input array x is empty and include_initial is True, then the size of the corresponding dimension in the output array must be 1.

This change ensures that the output array shape follows consistent rules and is not dependent on the input array shape.

kgryte · 2023-11-06T04:12:25Z

@seberg Would you be willing to do another round of review? Would be good to get another set of eyes on this before we merge. Thanks!

seberg · 2023-11-30T18:07:26Z

FWIW, I am happy with include_initial and all the other things were general things shared with normal sum anyway.

kgryte · 2023-12-14T04:59:56Z

As this PR has been discussed previously and no objections were raised, will go ahead and merge. Any further updates or clarifications can be addressed in follow-ups PRs.

Add specification for computing the cumulative sum

ae4bc1c

kgryte added the API extension Adds new functions or objects to the API. label Jul 13, 2023

kgryte added this to the v2023 milestone Jul 13, 2023

rgommers reviewed Jul 13, 2023

View reviewed changes

src/array_api_stubs/_draft/statistical_functions.py Show resolved Hide resolved

seberg reviewed Jul 13, 2023

View reviewed changes

tomwhite mentioned this pull request Jul 25, 2023

Add scan / prefix sum primitive cubed-dev/cubed#277

Open

Rename kwargs and remove endpoint

2e0b224

kgryte added 2 commits September 19, 2023 14:12

Merge branch 'main' into cumulative-sum

948a649

Add guidance for empty arrays

8083e5b

kgryte added the Needs Review Pull request which needs review. label Sep 19, 2023

kgryte added 2 commits September 21, 2023 01:40

Fix formatting

3228d4c

Ensure consistent rules for when include_initial is True

756fc92

kgryte mentioned this pull request Nov 6, 2023

Tracking issue for the 2023 revision of the array API specification #643

Closed

17 tasks

kgryte removed the Needs Review Pull request which needs review. label Dec 14, 2023

kgryte merged commit 1fd3429 into data-apis:main Dec 14, 2023

kgryte deleted the cumulative-sum branch December 14, 2023 05:01

asmeurer mentioned this pull request Apr 22, 2024

RFC: disallow 0-D inputs in cumulative_sum #797

Closed

ogrisel mentioned this pull request Oct 8, 2024

Add array API support for _weighted_percentile scikit-learn/scikit-learn#29431

Draft

11 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add specification for computing the cumulative sum #653

Add specification for computing the cumulative sum #653

kgryte commented Jul 13, 2023 •

edited

Loading

rgommers left a comment

seberg Jul 13, 2023

adityagoel4512 commented Sep 14, 2023

rgommers commented Sep 15, 2023

kgryte commented Sep 19, 2023

kgryte commented Oct 19, 2023 •

edited

Loading

kgryte commented Nov 6, 2023

seberg commented Nov 30, 2023

kgryte commented Dec 14, 2023

Add specification for computing the cumulative sum #653

Add specification for computing the cumulative sum #653

Conversation

kgryte commented Jul 13, 2023 • edited Loading

Questions

rgommers left a comment

Choose a reason for hiding this comment

seberg Jul 13, 2023

Choose a reason for hiding this comment

adityagoel4512 commented Sep 14, 2023

rgommers commented Sep 15, 2023

kgryte commented Sep 19, 2023

kgryte commented Oct 19, 2023 • edited Loading

kgryte commented Nov 6, 2023

seberg commented Nov 30, 2023

kgryte commented Dec 14, 2023

kgryte commented Jul 13, 2023 •

edited

Loading

kgryte commented Oct 19, 2023 •

edited

Loading