Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Py_HashDouble() function #2

Closed
4 tasks done
vstinner opened this issue Dec 5, 2023 · 55 comments
Closed
4 tasks done

Add Py_HashDouble() function #2

vstinner opened this issue Dec 5, 2023 · 55 comments

Comments

@vstinner
Copy link

vstinner commented Dec 5, 2023

  • PR: gh-111545: Add Py_HashDouble() function python/cpython#112449

  • API: int Py_HashDouble(double value, Py_hash_t *result)

    • Set *result to the hash value and return 1 on success.
    • Set *result to 0 and return 0 if the hash value cannot be calculated. For example, if value is not-a-number (NaN).
    • result must not be NULL
  • Stable ABI: compatible

  • Not added to the limited C API yet

UPDATE: I updated the API documation to address @zooba's suggestion (make the API more generic, be less specific about NaN case).

Voters:

Since I proposed this API and I'm part of the C API Working Group, I will not vote on this decision.

@zooba
Copy link

zooba commented Dec 5, 2023

I'm unclear why this needs to be a public API rather than going through a PyObject. What can you do with a primitive hash value for a primitive value?

Assuming that it does need a dedicated, public API, I prefer the return values as described in this post (and not the other variants that were tried), though I would prefer if they were described as "returns 1 if result is valid; otherwise 0" and says elsewhere that NaN cannot be hashed. It should also clearly say that it does not set a Python exception (and if it did, then I think the result should be negative).

@vstinner
Copy link
Author

vstinner commented Dec 5, 2023

I'm unclear why this needs to be a public API rather than going through a PyObject

The rationale is explained in the issue. numpy currently uses the _Py_HashDouble() function: numpy/numpy#25035

float_arrtype_hash() calls Npy_HashDouble() with (double)PyArrayScalar_VAL(obj, @name@) value:

/~https://github.com/numpy/numpy/blob/742920aef552b27806094f8ed6462d3cf1744785/numpy/_core/src/multiarray/scalartypes.c.src#L3629-L3633

I don't see how existing PyObject_Hash() API can be used for that.

Assuming that it does need a dedicated, public API, I prefer the return values as described in this post (and not the other variants that were tried), though I would prefer if they were described as "returns 1 if result is valid; otherwise 0" and says elsewhere that NaN cannot be hashed.

Isn't the proposed API?

Technically, the function returns 0 for not-a-number and it can be used as the hash value for some use cases. Hash conflict can lead to inefficient hash table but it's usually correct. If you want to mimick Python hash(float), you should call Py_HashPointer(obj) if the C double is not-a-number.

It should also clearly say that it does not set a Python exception (and if it did, then I think the result should be negative).

The function cannot fail: it cannot return 0 and it cannot set an exception. It can fix the doc to make it more explicit if needed.

@zooba
Copy link

zooba commented Dec 5, 2023

numpy currently uses the _Py_HashDouble() function

So the rationale is: "libraries that implement their own numeric type that needs to compare equal to the built-in float type may use this to implement their tp_hash function".

I don't see how existing PyObject_Hash() API can be used for that.

The object_arrtype_hash function a few lines below your link does it. If you need an identical hash to an existing Python type, then you convert the value to that type and then call PyObject_Hash.

Isn't the proposed API?

Yes, but the way you describe it isn't a good way to document it. The return value should reflect the validity of the result or post conditions, rather than the values of the arguments. That way we retain freedom to change the reasons it may fail without having to change the documentation (and follow a deprecation process).

It's the difference between:

  • "returns 0 when you pass NaN" (bad - too specific)
  • "returns 0 when it couldn't calculate a hash value" (good)

And in the latter case, we can also document that we can't calculate hash values for NaN. But that's disconnected from the definition of the result, so if it one day turns out we can't calculate a hash for some other value then we can start returning an error without having to create a new function. That may be unlikely in this particular case, but the general principle applies elsewhere, and the docs here should be consistent with the other places where it is more likely that over-specifying the function could get us into trouble.

Again, it's just grammar. But in a specification, which is how our documentation is used, the grammar matters.

I can fix the doc to make it more explicit if needed.

Yes. Fixing the doc is what I was asking for. Don't assume that users will read our source code.

@vstinner
Copy link
Author

vstinner commented Dec 5, 2023

"returns 0 when it couldn't calculate a hash value" (good)

In Python <= 3.9, hash(float("nan")) was just 0. It's not as if Python "cannot calculate" a hash value. Or that the hash value 0 will put your computer on fire. When proposed API return 0, it really means "not-a-number": as a hint that you might want to use a different hash value if you want.

In Python <= 3.9, all NaN float had a same hash value, even if they were different Python objects:

$ python3.9
Python 3.9.18 (main, Aug 28 2023, 00:00:00) 
>>> import math
>>> x=math.inf-math.inf
>>> y=math.inf-math.inf
>>> x is y
False
>>> hash(x), hash(y)
(0, 0)

So if these Python float objects are used as dict keys for example, hash collisions were more likely. The Python dict type is fine with hash collision, since NaN is not equal to NaN:

>>> x == y
False
>>> d={x: 1, y: 2}
>>> len(d)
2
>>> d[x], d[y]
(1, 2)
>>> d
{nan: 1, nan: 2}

Python 3.10 modified hash(float) by using object.__hash__(obj) for NaN floats to reduce hash collisions.

But maybe for some other usages, you might want to use something else.

There are 3 categories of float values:

  • finite: return (more or less) hash(int(value)) (in an optimized way) -- it can be equal to 0. IEEE 754 sub-normals values fall into this category as well.
  • infinite: return sys.hash_info.inf or -sys.hash_info.inf.
  • NaN: return 0.

Note: maybe we should have this discussion on python/cpython#112449 PR or python/cpython#111545 issue.

@vstinner
Copy link
Author

vstinner commented Dec 5, 2023

If proposed API is confusing, an alternative is Py_hash_t Py_HashDouble(double value, int *is_nan). It might be less efficient if we accept NULL for is_nan. But such API may be less confusing.

Or it can be just Py_hash_t Py_HashDouble(double value) and the caller would have to call isnan(value) explicitly. It's less efficient, since Py_HashDouble() already calls isnan(value).

@zooba
Copy link

zooba commented Dec 7, 2023

The proposed API isn't confusing, I just disagree with how you describe it.

We don't say "if <parameter> contains a specific value, the return value is 0". We say "if the result cannot be calculated, the return value is 0. One reason the result may not be calculated is if <parameter> contains a specific value."

Do you see the difference between those two statements?

@vstinner
Copy link
Author

vstinner commented Dec 7, 2023

The proposed API isn't confusing, I just disagree with how you describe it.

I thought that the vote was restricted on the API.

If you have concerns on the implementation (doc, tests, whatever), I just suggest to move the discussion on the PR instead. I'm sure that the doc can be enhanced :-)

@zooba
Copy link

zooba commented Dec 7, 2023

I have a concern with your API design.

That concern is that you designed: "when the input is NaN, returns 0".

I can tell this is your design because of the documentation.

I think your design should be: "when it fails to generate a hash, returns 0".

I'll be able to tell that you've fixed the design when the documentation is changed.

This is all about API design. The documentation you've written is how I can tell what you've designed.

Once it's released, we can't just change the documentation, because it will change the definition of the API in a subtly incompatible way. It doesn't affect the current implementation, but it could in the future, so I want to address it now so that we don't run into issues in the future.

Please don't treat this process as just a checkbox before you do whatever you want. We're trying to collaborate to produce a consistent, reliable, forward-compatible API that doesn't get us into trouble later on. If you think I'm just arbitrarily impeding you, feel free to make a complaint to the SC and get me removed from the WG - we have a process for this now.

@vstinner
Copy link
Author

vstinner commented Dec 8, 2023

Well, I was thinking aloud since I'm not used to the C API Working Group. In short, I proposed to update the PR to address your concern, and then come back to the decision. I didn't propose to make a decision and change the API afterwards.

I can tell this is your design because of the documentation.

It's kind to refer to this API as "my design", but it is not mine 😁. It was a collective work on issue python/cpython#111545. I only tried to lead the work to have a full change (implementation, doc, tests) addressing all concerns. I prefer to rely on other IEEE 754 experts as Mark Dickinshon who is the latest author who authored a major redesign of all hash() functions for "numbers (int, float, complex, Decimal) to have consistent hash values for all stdlib number types.

My concern is more that the PR was already approved twice, and now we are discussing further changes. It's a new situation to me, I don't know if the C API Working Group should take a decision, and then the PR is updated. Or the opposite: the WG suggests change, the PR is updated and maybe I wait to new approval, and then the WG makes a decision.


That being said, I applied your documentation suggestion in the description of this issue: #2 (comment) I'm fine with it, it's reasonable to make the API more generic, and less specific about "NaN" (only mention it as "one possible case" of "cannot calculate the hash value").

First, I expected "the API" to be int Py_HashDouble(double value, Py_hash_t *result) and nothing else. But I agree that the description of the parameters, output, error cases, etc. are part of the API. Well, that's also why I added it to this issue :-) So the API is the function definition and its documentation.

@zooba
Copy link

zooba commented Dec 8, 2023

With the clarified return value semantics, I'm happy with this design.

@vstinner
Copy link
Author

@gvanrossum, @iritkatriel, @encukou: so, what do you think of this API? Tell me if you need more details.

@iritkatriel
Copy link
Member

LGTM. I left some comments on documentation on the PR.

@gvanrossum
Copy link

I am not reviewing the PR (no time) but I am okay with the name and spec of the new API, given that there is a real need (e.g. Numpy).

@encukou
Copy link
Collaborator

encukou commented Dec 12, 2023

LGTM too; the new documentation is much better. Thank you!

@vstinner
Copy link
Author

We agreed on int Py_HashDouble(double value, Py_hash_t *result), but Serhiy Storchaka and Mark Dickinson prefer the following API: python/cpython#112449 (comment)

        Py_hash_t
        hash_double(PyObject *obj, double value)
        {
            if (Py_IS_NAN(value)) {
                return Py_HashPointer(obj);
            }
            else {
                return Py_HashDouble(value);
            }
        }

How should I handle this situation? Enforce C API Working Group decision? Reopen the discussion here? I respect Serhiy and Mark's opinion on IEEE 754 float and hashing since they are expert on this topic.

@encukou
Copy link
Collaborator

encukou commented Dec 14, 2023

Try to come to an agreement. If it doesn't happen, the working group should decide.
I'd be happy to be convinced by IEEE experts. But in this case I don't know what users would pass in as PyObject *obj, given that API is, AFAIK, for cases where you don't have a PyDouble object.

@zooba
Copy link

zooba commented Dec 14, 2023

Yeah, if you've already got a PyObject * for the value, then just use PyObject_Hash on it.

This API is supposedly for situations where you have a float/double that is not yet a PyObject, and you don't want to construct one until you know it's necessary (and you've implemented a native collection type that can do hash and equality comparisons without needing a real PyObject - in other words, this is already a very specialised API that most people will never use).

@vstinner
Copy link
Author

The Py_hash_t Py_HashDouble(double value) API is being discussed at: python/cpython#112449 (comment)

@vstinner
Copy link
Author

vstinner commented Dec 16, 2023

@gvanrossum @iritkatriel @zooba @encukou: I'm sorry, I would like to invite you to reconsider this decision! Mark Dickinson and Serhiy Storchaka would prefer Py_hash_t Py_HashDouble(double value) API:

Mark is the author of the unified hash implementation of all numbers in Python (int, float, complex, Decimal). He has a good background on mathematics and IEEE 754 floating point numbers. Serhiy also seems to have a good background in mathematics and floating points. They understand concepts such as rounding, NaN, ULP and other things better than me :-)

I propose to change the API to: Py_hash_t Py_HashDouble(double value)

  • If value is positive infinity, return sys.hash_info.inf.
  • If value is negative infinity, return -sys.hash_info.inf.
  • If value is not-a-number (NaN), return sys.hash_info.nan (0).
  • Otherwise, return the hash value of the finite value number.

This function cannot fail.

Second vote:

See PR #113115 which implements this API.

3 weeks ago, I ran a micro-benchmark on Py_HashDouble():

  • API int Py_HashDouble(double v, Py_hash_t *result): 13.2 ns +- 0.0 ns
  • API Py_hash_t Py_HashDouble(double v): 12.3 ns +- 0.0 ns

So this API is a little bit faster: 13.2 ns => 12.3 ns (-0.9 ns).

@gvanrossum
Copy link

Okay on the simpler API.

@encukou
Copy link
Collaborator

encukou commented Dec 18, 2023

-1 for that simpler API. IMO, it's too easy to misuse if you're not an expert floats. See my comment on the issue.

@gvanrossum
Copy link

gvanrossum commented Dec 18, 2023 via email

@gvanrossum
Copy link

Random thought: Maybe hash() of a NaN should hash based on the bits of the NaN rather than on the id() of the object? Then Py_HashDouble(double) could do the right thing for NaNs. Or is __eq__ for NaN also tied to the object id()?

@zooba
Copy link

zooba commented Dec 20, 2023

I think someone had a scenario where they wanted to keep a lot of "distinct" NaNs as keys in a dict without facing all the hash collisions.

As a scenario, it seems a bit of a stretch, but it shouldn't really matter for this case. All that id() is used for is randomisation, which means we could simply say that Py_HashDouble(NaN) returns some fixed/specific value (as currently proposed) and the caller will need to randomise it if they care about hash collisions. __eq__ for NaN should always return false, even for itself1 and so the actual hash value is irrelevant, provided it is unlikely to be the same for two "distinct" NaNs.

Footnotes

  1. Though I wouldn't be surprised if we did an is check first for performance, to save even looking up __eq__.

@gvanrossum
Copy link

So I think the desire here is to have an API that works given just the double, even if that double contains a NaN. The code contains this comment on the current strategy for NaN (by Raymond):

   NaNs hash with a pointer hash.  Having distinct hash values prevents
   catastrophic pileups from distinct NaN instances which used to always
   have the same hash value but would compare unequal.

For good measure, here's the PR: python/cpython#25493, and the issue: python/cpython#87641 (bpo: https://bugs.python.org/issue43475). The issue discussion goes on forever but only seems to consider the choice between a constant hash value for all NaNs and a a hash based on hash(id(x)) (it also considers fixing NaN equality, but not seriously). Basing the hash on all bits of the value doesn't seem to be considered.

Anyway, I like the idea of returning some fixed value if the input is a NaN, and letting the caller sort out using the object id hash, if they care. (I imagine that this function is used in numpy to hash float/double values that don't have a corresponding object yet, e.g. when calculating the hash of an array? In that case requiring an object to be passed in isn't really helpful.)

@ngoldbaum
Copy link

ngoldbaum commented Dec 21, 2023

I imagine that this function is used in numpy to hash float/double values that don't have a corresponding object yet, e.g. when calculating the hash of an array? In that case requiring an object to be passed in isn't really helpful.

Sorry to chime in from the peanut gallery. I'm a numpy developer and happened to come across this thread this evening.

Numpy has a concept of scalar types that are distinct from a zero-dimensional array. One big difference from arrays, which are mutable, is that scalars are immutable and hashable, and wrap a single C data value. Numpy currently uses _Py_HashDouble to implement tp_hash for the float and complex scalar types.

@gvanrossum
Copy link

Could you link to the code? I’m curious what object is passed in (especially in the complex case).

@mdickinson
Copy link

mdickinson commented Dec 21, 2023

@gvanrossum

Basing the hash on all bits of the value doesn't seem to be considered.

It doesn't solve the problem in python/cpython#87641: in the problematic situation we have a large number of NaN objects, all of which have the same NaN bit pattern (the bit pattern of np.nan) but different ids. So a hash based on the NaN bit pattern would still end up generating large numbers of hash table collisions.

There's a mass of compromises here, and no really good solution. I think the "right" solution is to regard all NaNs as equal for the purposes of dict/set containment (or perhaps as a refinement to regard NaNs with the same bit pattern as equal), so that a dict or set can only have a single NaN in it. But even leaving backwards compatibility concerns aside, I don't see how we special-case NaNs in the general-purpose dict/set/hash machinery without introducing new special methods and some sort of third "for use in containment checks" equality relation that's distinct from both == and is. (I suppose we could actually special-case the float type, but where does that leave things like Decimal?) Any general-purpose solution seems too heavyweight to solve a problem that's really only caused by one value in one type.

@gvanrossum
Copy link

So @mdickinson do you agree that we should just promote something with the same (object*, double) signature?

@mdickinson
Copy link

do you agree that we should just promote something with the same (object*, double) signature?

Sure, that seems fine to me. I'd also be perfectly happy with the latest version of @vstinner's proposal (as in this comment); for me, the dangers of returning 0 for NaNs aren't big enough to disqualify that proposal.

@gvanrossum
Copy link

It’s just more effort to adapt existing code that needs compatibility with Python numeric values. And that’s why this API exists.

@encukou
Copy link
Collaborator

encukou commented Jan 5, 2024

That gets me thinking that _Py_HashDouble was... fine as it was. It exists to provide what CPython does, the way CPython does it; and if its signature or implementation ever changes, all of its callers should at least look at their calls to see if they're still valid.

Why not give it the modern name, PyUnstable_HashDouble, and let PyFloat_Type->tp_hash be the API for general use?
(And if we do that, then per PEP 689 the underscored name should stay as an alias – so NumPy doesn't need any code change.)

The downside is that this doesn't help the API graduate into the stable ABI – but the conversations I'd like to have for that would look quite far off the mark here. (mostly: What should this do on implementations without IEEE floats?)

@vstinner
Copy link
Author

vstinner commented Jan 9, 2024

let PyFloat_Type->tp_hash be the API for general use?

PyFloat_Type->tp_hash() takes a Python float object, whereas the proposed API takes a C double. The proposed API is to implement a compatible hash function to float-like types which don't inherit from Python float type.

@vstinner
Copy link
Author

vstinner commented Jan 9, 2024

(mostly: What should this do on implementations without IEEE floats?)

Building Python requires IEEE 754 (with not-a-number) since Python 3.11: https://docs.python.org/dev/using/configure.html

@vstinner
Copy link
Author

vstinner commented Jan 9, 2024

I propose to change the API to: Py_hash_t Py_HashDouble(double value)

Votes: 4 people are in favor of this API (me included), Petr is against it.

Then another (object*, double) API was discussed. Guido: I'm now confused if you are ok with Py_hash_t Py_HashDouble(double value) or not.

Also, PEP 731 doesn't mention specific guidelines for votes. Should I have to reach 100% majority (no "veto" vote)? Or is 4 "+1" and 1 "-1" means that the "group" accepts the proposed API?

@gvanrossum
Copy link

I think the most helpful thing for current users is to pass both a double and an object, so I am now in favor of that. I think I explained my reasoning above.

@encukou
Copy link
Collaborator

encukou commented Jan 10, 2024

I guess I don't fully understand the (double, object) proposal, in the sense that I couldn't write good docs for it: what is that object?
AFAIK it's only been explained in terms of the implementation: if the double is not hashable, Py_HashDouble will return a hash of the object.
But what should you pass in? It seems it's meant to be some “container” that holds the float value (like a Python float, or a NumPy scalar/matrix). Is that right?
That would mean float values can only be hashed if they're in a PyObject* “container”. Is that a reasonable limitation of Python's float hashing algorithm?


Also, PEP 731 doesn't mention specific guidelines for votes. Should I have to reach 100% majority (no "veto" vote)? Or is 4 "+1" and 1 "-1" means that the "group" accepts the proposed API?

Either way, the original proposal here would win. I think our first draft of a formal voting process isn't really working in this issue.


Building Python requires IEEE 754 (with not-a-number) since Python 3.11

Yup, but that's CPython. I would like stable ABI to be usable across implementations (particularly because far-future versions of CPython essentially are alternate implementations). But as I said, that's a whole other conversation.

@vstinner
Copy link
Author

@gvanrossum:

I think the most helpful thing for current users is to pass both a double and an object, so I am now in favor of that.

Hum, it seems like we are going into circles since that was my first proposed API, Petr and Mark Dickinson didn't like it so I proposed other APIs.

So far, I proposed 3 APIs as 3 PR.

API A (double+obj)

Py_hash_t PyHash_Double(double value, PyObject *obj) => python/cpython#112095

  • PyHash_Double(value, obj) returns _Py_HashPointer(obj) if value is not-a-number.
  • PyHash_Double(value, NULL) is valid and returns _PyHASH_NAN (0) if value is not-a-number

@mdickinson (comment) and @encukou (comment) didn't like the API. @encukou asked for:

int PyHash_Double(double value, Py_hash_t *result)
// -> 1 for non-NaN (*result is set to the hash)
// -> 0 for NaN (*result is set to 0)

@serhiy-storchaka liked (comment) this API.

Later, Guido proposed again such API and so is in favor of it. Mark wrote "Sure, that seems fine to me".

Updates

There were discussions about functions which "cannot fail": capi-workgroup/api-evolution#43

There were also discussions on the function name: PyHash_Double() vs Py_HashDouble().

Benchmarks were also run on the different APIs.

I added Py_hash_t Py_HashPointer(const void *ptr) and I wrote tests on the existing PyHash_GetFuncDef() API.

API B (return int)

int Py_HashDouble(double value, Py_hash_t *result) => python/cpython#112449

  • Return 0 if value is not-a-number (and set *result to 0).
  • Set *result to the hash value and return 1 otherwise.

@mdickinson and @encukou approved the PR. So I created this issue for the C API Working Group and shortly the Working Group approved this API as well.

@serhiy-storchaka suggested a different API: Py_hash_t hash_double(PyObject *obj, double value) (more or less API 1).

API C (double => Py_hash_t)

Some people were not fully satistified with API B, so I proposed a new one!

Py_hash_t Py_HashDouble(double value) => python/cpython#113115

@serhiy-storchaka approved it. @mdickinson prefers this API.

@encukou dislikes this API: he sees it as being too error-prone.

Ranking

Let me try to summarize people preferences on these API by raking them.

APIs:

  • API A: Py_hash_t PyHash_Double(double value, PyObject *obj)
  • API B: int Py_HashDouble(double value, Py_hash_t *result)
  • API C: Py_hash_t Py_HashDouble(double value)

Ranking:

Attempt to list supporters per API (+2 for first preferred API, +1 for second preferred, no vote for 3rd preferred):

I'm not sure that I collected correctly preferences of each person involved in the discussions.

It seems like Irit and Steve only looked at API B and then API C proposed here, they didn't express their opinion about API A.

Variants

For API A, I proposed to allow passing NULL for the object, which makes the API a "superset" of API C.

Petr considers that API C is error-prone since developers don't read the documentation and so can be affected by high hash collision if there are many not-a-number stored in a hash table / dict. Others people don't see it as a blocker issue.

For API C, example of recipe for numpy:

Py_hash_t hash_double(double value, PyObject *obj)
{
    if (Py_IS_NAN(value)) {
        return Py_HashPointer(obj);
    }
    else {
        return Py_HashDouble(value);
    }
}

@gvanrossum
Copy link

It seems I hadn't actually explained my reason for changing my preference back to API A properly.

The evidence is in this comment from a numpy maintainer: #2 (comment)

This shows how the object is used (the object passed in may be a larger object whose hash is being computed, as in the example of the hash of a numpy complex number).

My point is that only API A makes it simple for numpy and others like it to switch from using an internal API to the new public API using a simply #define. All other proposals would require them to maintain two separate variants of their hashing code, one for 3.13 and one for earlier versions.

@zooba
Copy link

zooba commented Jan 10, 2024

If API A is clearly said to be "for implementing tp_hash on a float-like type", that the object parameter should be the instance, and that the object parameter is allowed to be NULL (in which case the result when it was necessary is either 0 or an identifiable error code distinct from any other failure reason), I'm okay with that one.

I'm still only barely convinced that this is worth adding a public API, even an unstable one. It's still going to be totally valid to construct an actual float from the value and get its hash, so we have a slow+compatible option already. If the perf is important, copying the function in its entirety and keeping it updated in new versions should be worth the effort. (Though I admit this is mostly about setting a boundary - we do not yet have a problem of making public C API for every single micro-function just for someone who wants both speed and compatibility, but I worry that may become more of a problem.)

@gvanrossum
Copy link

If the perf is important, copying the function in its entirety and keeping it updated in new versions should be worth the effort.

This doesn't sound right to me. Perf is definitely important for numpy when doing this for an array (which they do; complex was just the simplest example). The numpy team shouldn't be required to watch the code in that particular file for changes. The API is for sure more stable than the implementation, so they can't get a compiler warning for this -- they'd have to implement a test that dynamically compares the value of the hashes they compute to the hashes computed by the CPython runtime for a wide variety of floats, and that seems just the wrong way about it when we can just export the proper API. Also note that we effectively already export the proper API (albeit as a private function) -- this problem is almost entirely of our own making since we decided we want to make it hard for 3rd parties to use private functions.

@serhiy-storchaka
Copy link

How to compute the hash of multicomponent value? You call Py_HashDouble() for every component and combine results. If several components are NaNs, API A will give you the same hash for them. Combining a hash with itself can give more predicable result if you use something simple (like add or xor). You also repeatedly calculate the hash of the pointer.

API B and API C allow to stop when you find a NaN and return Py_HashPointer(obj). With API B you check the result of the call, with API C you check for NaN before the call.

@ngoldbaum
Copy link

Perf is definitely important for numpy when doing this for an array (which they do; complex was just the simplest example).

For what it's worth, for numpy and this API specifically, we would only be calling this function if someone needs the hash of a float or complex scalar, e.g. if they make one a dict key. Arrays don't have hashes because they are mutable.

@gvanrossum
Copy link

Ah, together those are a compelling argument not to care so much about either perf or convenience for numpy. So then I'm okay with all three APIs.

@encukou
Copy link
Collaborator

encukou commented Jan 11, 2024

a compelling argument not to care so much about either perf or convenience for numpy

Indeed. If either of those was important, my preferences would be very different :)

For a summary, rather than focus on the history of the proposals, I'd rather focus on the proposals themselves -- specifically, the remaining reasons people have to not prefer the API.
It's hard to judge, as people's understanding and opinions evolve. Maybe it's something like:

  • API A (double, PyObject *obj) -> Py_hash_t
    • assumes a container -- a loss in generality.
  • API B (double, Py_hash_t *result) -> int
    • is slower.
  • API C (double) -> Py_hash_t
    • hides a potentially dangerous edge case.

Anything missing?

If API B's slowness is an issue, I'll be happy to change my preferences (after trying a faster implementation).

@mdickinson
Copy link

@encukou

  • assumes a container -- a loss in generality

I think there may be a misunderstanding here. In the NumPy use-case, there's a scalar value (e.g., of type np.float32) and we want to compute a hash for it in a way that's consistent with equality (e.g., equality with Python floats). That scalar value is a Python object, containing a raw C float. For API A, NumPy would pass that float (cast to double) as the first argument, and the wrapper object as the second argument. I don't think there's any container here (except in the same sense that a Python float can be regarded as a container for a single C double).

@mdickinson
Copy link

Apologies; after rereading, the misunderstanding is mine.

except in the same sense that a Python float can be regarded as a container for a single C double

... and that's exactly the sense of container that you meant. Mea culpa.

@vstinner
Copy link
Author

IMO the 3 discussed APIs are fine, there are no huge difference between them. Well, some people have opinions about one specific aspect of these APIs: report error, performance, etc.

Now, how can we take a decision to move on?

@gvanrossum
Copy link

Now, how can we take a decision to move on?

My first choice is to keep the existing private API and close this issue. My second choice is to rename the existing API to remove the leading underscore without semantic changes (i.e., API A).

@vstinner
Copy link
Author

Multiple APIs were discussed in the same issue and it's uneasy for me to follow which API is being referred to, so I created a new issue to add a new Py_hash_t Py_HashDouble(double, PyObject *obj) function: #10

@pitrou
Copy link

pitrou commented Feb 6, 2024

Similarly to this, it would be nice to expose a public Py_hash_t Py_HashBytes(const void*, Py_ssize_t) function. This is required to make buffer-like objects hashable in a way that's consistent with Python's bytes and memoryview objects.

@vstinner
Copy link
Author

vstinner commented Feb 6, 2024

Similarly to this, it would be nice to expose a public Py_hash_t Py_HashBytes(const void*, Py_ssize_t) function.

I suggest to open a new issue for this.

@vstinner
Copy link
Author

Apparently, the C API Working Group failed to take a decision on this API.

On January 31st, @gvanrossum wrote on the companion issue #10:

I'm taking an executive decision and closing this issue for lack of progress.

So I also close this issue. numpy will have to continue using the private _Py_HashDouble() function for now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

9 participants