New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
functools.cache should support non-hashable function arguments via a key arg. #101016
Comments
|
To my eyes, the example looks like an anti-pattern. Casting to an ndarray to a tuple is a slow and memory intensive operation. It loops over the entire array and builds a Python object (likely a float) from every array element. Next, the cache logic has to loop over the entire tuple performing a pure python equality test for each element to verify a cache hit. Once the cached value is returned, all of that work is thrown away and has to happen again on the next call. |
|
If it was a tuple to begin with you'd still have to iterate the whole thing to generate the hash key used to store it in the dictionary. It's less efficient because you need to make a copy of the numpy array, but you could do better with something like: @functools.cache(key=lambda arr, value: (hash(arr.data.tobytes()), value))
def expensive_func(arr: np.ndarray, value: int) -> np.ndarray:
...or depending on your exact use case, maybe: @functools.cache(key=lambda arr, value: (arr.shape, value))is sufficient, or in some extreme cases, even only using a subset of the arguments. The point is more that this is something that would be useful to expose to the user, and it's up to them to use it in a sensible way. I used a numpy array because it's a well known object, but this would generalize to any type. Maybe the documentation should make it clear that this can be slow or lead to hash collisions if the key value isn't sensible, but that's also already true given the |
|
Do you have a better of way of caching the result of an expensive computation where the arguments are a numpy array, in a thread safe way, efficiently, without much boilerplate? Even better if it doesn't involve too much change to the caller or callee. |
Framing a problem that way pushes you away from the common, cheap solution. Generally, it is anti-pattern to use a large dataset as a dictionary key to lookup a computation result: The norm I've seen over the years is for data to be labeled or tagged in a some way. This lends itself naturally to caching: If forced into the awkward position of using a dataset as a lookup key, I would make the operation explicit with three lines of simple Python (a dict.get, followed by a None test, and conditionally followed by a function call): Note it is much more memory efficient (and likely faster) to replace That said, I would seek to avoid using a mutable dataset as a key. We don't usually see people doing this will dicts, sets, or database keys, so why would they do it with an |
|
Given that this is in the core library and even if accepted wouldn't be usable until the next release makes it to my production environment (both of which are out of my control), I already implemented it as a dictionary lookup. My first attempt was with It's also worth noting that numpy arrays aren't necessarily big. We often use them for trivial things like single (x, y) point coordinates. In my case turning the numpy array into a tuple and hashing it is ~10,000 times faster than the computation done on it.
This is exactly what I'm asking to be able to do without having to write my own caching/dict, and without having to change the arguments to the function. If this is accepted and there is an example in the documentation, then something like |
|
The
When it comes to the standard library, design decisions are made for the world of future users, not for one specific circumstance of an already solved problem. The proposal is to open the door for all users to feed non-hashable arguments to a cached function. Do we need to open a door that most people shouldn't go through? Also keep in mind that a lot of people find If we open this door, I suspect it will cause many more problems than it solves. It would take a good deal of thought and care to use this in a correct and performant way. |
|
I can totally see this being useful for a |
|
I appreciate the suggestion and fully appreciate your use case, but I'm going to decline. Mostly, I would like to the keep the API simple, avoid feature creep, and not try to be all things to all people. Also, I hesitate to go against the long standing Python norm that mappings don't provide automatic support for mutable keys — this has kept the average user out of trouble and made experienced users be explicit about the conversion. ISTM that your use case is atypical. The inputs are short (x,y coodinates) so the cost of a tuple conversion is small and the additional memory use can be ignored. The wrapped function is 10,000x slower than conversion and hashing. Presumably, the cache hit rate is high and the added cost of a cache miss can be ignored. For other users, assessing whether to use a cache key-function is more challenging and treacherous. For example, While the key function in your single use case is simple, it would be much harder for nested data, JSON for example. And your arity-two example departs from the existing norm that all key functions are of arity one. It sounds like your particular problem was already solved and solved easily. The only thing missing was the aspiration to force this into the standard Side note: It might be worthwhile to submit a NumPy feature request for opt-in hashability. Already, you can set the |
|
I forgot to mention the double or triple maintenance D.R.Y. issue that arose while experimenting with this proposal. For a key function to be robust, it would need to faithfully mirror the underlying function signature in every detail. That relationship would have to be maintained over time, updating all three the keep them in sync. It can be viewed as a triple maintenance problem because the parameters have to be stated three times: once in the main function, again in the key function lambda parameters, and again in the lambda function body. Also, signatures with optional arguments would be challenging to model in a key function. For example, Python's ISTM that the proposal falls apart as the function signatures grow more complex. |
|
Marking read-only numpy arrays as hashable sounds like a good feature request. That does force a change for the caller though. That does bring up the point that something along this lines where you're storing the (long enough) hash as the key instead of the the value itself could be more efficient and support normal objects like lists without the problem of mutability. Unclear if that is a good thing or not, but it would remove a caveat to the current cache function. An alternative approach would be to allow replacing The DRY issue wouldn't be such a big deal if it could delegate to the normal def np_to_hash(v):
return hash(v.to_bytes()) if isinstance(v, np.ndarray) else v
def np_key(args, kwargs, typed):
return functools.make_key(tuple(map(np_to_hash, args)), {k: np_to_hash(v) for k, v in kwargs.items()}, typed)
@cache(make_key=np_key)
def slow_func(...)At least this is pretty general and could be reused, but it's getting pretty verbose, so of unclear value. I'd certainly prefer something like: class NumpyCache(functools.cache):
def _make_key(args, kwargs, typed):
return super()._make_key(...)But that would require a pretty big change to the existing cache function to be possible, and is also unclear value. Thanks for exploring this idea with me. |
@tewalds You're welcome. Please keep-up the good work at DeepMind. I'm a big fan and read all of the papers coming out of that research. |
Say you have a function:
that is expensive but called with the same values repeatedly. Ideally you'd wrap it in the
functools.cache(orlru_cache) decorator and call it done, butnp.ndarrayis not hashable, so instead you get an exception.Currently you'd have to have to create your own cache dict.
or maybe wrap your function in such a way that the args are hashable, which adds a lot of boilerplate.
Instead I'd like to see something like:
Looking at the implementation of
lru_cachethere's already a_make_keyfunction: https://github.com/python/cpython/blob/3.11/Lib/functools.py#L448 . I'm mainly proposing letting you supply that function yourself. The default would remain as is, requiring hashable args.The text was updated successfully, but these errors were encountered: