Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

functools.cache should support non-hashable function arguments via a key arg. #101016

Closed
tewalds opened this issue Jan 13, 2023 · 11 comments
Closed
Assignees
Labels
type-feature A feature request or enhancement

Comments

@tewalds
Copy link

tewalds commented Jan 13, 2023

Say you have a function:

def expensive_func(arr: np.ndarray, value: int) -> np.ndarray:
  ...

that is expensive but called with the same values repeatedly. Ideally you'd wrap it in the functools.cache (or lru_cache) decorator and call it done, but np.ndarray is not hashable, so instead you get an exception.

Currently you'd have to have to create your own cache dict.

_CACHE = {}
def expensive_func(arr: np.ndarray, value: int) -> np.ndarray:
  key = (tuple(arr), value)
  if key not in _CACHE:
    _CACHE[key] = ...
  return _CACHE[key]

or maybe wrap your function in such a way that the args are hashable, which adds a lot of boilerplate.

Instead I'd like to see something like:

@functools.cache(key=lambda arr, value: (tuple(arr), value))
def expensive_func(arr: np.ndarray, value: int) -> np.ndarray:
   ...

Looking at the implementation of lru_cache there's already a _make_key function: https://github.com/python/cpython/blob/3.11/Lib/functools.py#L448 . I'm mainly proposing letting you supply that function yourself. The default would remain as is, requiring hashable args.

@tewalds tewalds added the type-feature A feature request or enhancement label Jan 13, 2023
@rhettinger
Copy link
Contributor

To my eyes, the example looks like an anti-pattern. Casting to an ndarray to a tuple is a slow and memory intensive operation. It loops over the entire array and builds a Python object (likely a float) from every array element. Next, the cache logic has to loop over the entire tuple performing a pure python equality test for each element to verify a cache hit. Once the cached value is returned, all of that work is thrown away and has to happen again on the next call.

@tewalds
Copy link
Author

tewalds commented Jan 13, 2023

If it was a tuple to begin with you'd still have to iterate the whole thing to generate the hash key used to store it in the dictionary. It's less efficient because you need to make a copy of the numpy array, but you could do better with something like:

  @functools.cache(key=lambda arr, value: (hash(arr.data.tobytes()), value))
  def expensive_func(arr: np.ndarray, value: int) -> np.ndarray:
     ...

or depending on your exact use case, maybe:

  @functools.cache(key=lambda arr, value: (arr.shape, value))

is sufficient, or in some extreme cases, even only using a subset of the arguments. The point is more that this is something that would be useful to expose to the user, and it's up to them to use it in a sensible way. I used a numpy array because it's a well known object, but this would generalize to any type. Maybe the documentation should make it clear that this can be slow or lead to hash collisions if the key value isn't sensible, but that's also already true given the typed argument.

@tewalds
Copy link
Author

tewalds commented Jan 13, 2023

Do you have a better of way of caching the result of an expensive computation where the arguments are a numpy array, in a thread safe way, efficiently, without much boilerplate? Even better if it doesn't involve too much change to the caller or callee.

@rhettinger
Copy link
Contributor

Do you have a better of way of caching the result of an expensive computation where the arguments are a numpy array, ...

Framing a problem that way pushes you away from the common, cheap solution. Generally, it is anti-pattern to use a large dataset as a dictionary key to lookup a computation result:

I want to use a 800Mb genome sequence as a key to look up a mutation search result ...
I want to use a SQL database as a key to look up a slow SQL query result ...
I want to use the contents of my directory as a key to lookup a directory analysis result ...
I want to use a giant matrix as a key to lookup an eigenvalue computation ...

The norm I've seen over the years is for data to be labeled or tagged in a some way.

mutation_result['fruit_fly_assay_59284a']
query_result['employee.db.snapshot.2013.01.14']
eigenvalues = ['image_4812.jpg']

This lends itself naturally to caching:

from functools import cache    
import numpy as np
import time

datasets = {
    'north': np.random.random(10),
    'south': np.random.random(10),
    'east': np.random.random(10),
    'west': np.random.random(10),
}

@cache
def summarize(region):
    dataset = datasets[region]
    print('!')
    time.sleep(1)
    return dataset.sum()

for region in ['north', 'south', 'north', 'east', 'south',
               'east', 'west', 'south', 'north', 'west']:
    print(summarize(region))

If forced into the awkward position of using a dataset as a lookup key, I would make the operation explicit with three lines of simple Python (a dict.get, followed by a None test, and conditionally followed by a function call):

    import numpy as np
    import time

    def slow(arr):
        print('!')
        time.sleep(1)
        return arr.sum()

    a1 = np.random.random(10)
    a2 = np.random.random(10)
    a3 = np.random.random(10)
    a4 = np.random.random(10)

    saved = {}
    for arr in [a1, a2, a1, a3, a3, a1, a4, a3, a1]:
        result = saved.get(t := tuple(arr))
        if result is None:
            result = saved[t] = slow(arr)
        print(result)

Note it is much more memory efficient (and likely faster) to replace tuple(arr) with sha256(arr.tobytes()). If the data is immutable and unique, then id(arr) would suffice.

That said, I would seek to avoid using a mutable dataset as a key. We don't usually see people doing this will dicts, sets, or database keys, so why would they do it with an @cache decorator (which is the same a dict lookup but with pretty wrapping).

@tewalds
Copy link
Author

tewalds commented Jan 14, 2023

Given that this is in the core library and even if accepted wouldn't be usable until the next release makes it to my production environment (both of which are out of my control), I already implemented it as a dictionary lookup. My first attempt was with @cache, and the code reviews suggested that as well, so I'm certainly not the only one who thinks my particular case makes sense to use it. It's quite possible that there is a refactor that would make the caching redundant, but in this particular case that refactor is likely quite big and invasive due to the (internal) parallelization framework we're using. Writing the caching manually is certainly doable and not that hard.

It's also worth noting that numpy arrays aren't necessarily big. We often use them for trivial things like single (x, y) point coordinates. In my case turning the numpy array into a tuple and hashing it is ~10,000 times faster than the computation done on it.

Note it is much more memory efficient (and likely faster) to replace tuple(arr) with sha256(arr.tobytes()).

This is exactly what I'm asking to be able to do without having to write my own caching/dict, and without having to change the arguments to the function. If this is accepted and there is an example in the documentation, then something like sha256(arr.tobytes()) would make sense to show how to use it in a fast and memory efficient way.

@rhettinger
Copy link
Contributor

The @cache decorator is a thin wrapper around a dictionary lookup, so what is special about caching that would warrant a key-function but not sets and dicts? What differentiates this issue from a more generic, "sets and dicts should support non-hashable function arguments via a key arg"? Why do you think sets and dicts do not support non-hashable arguments?

In my case, ...

When it comes to the standard library, design decisions are made for the world of future users, not for one specific circumstance of an already solved problem. The proposal is to open the door for all users to feed non-hashable arguments to a cached function. Do we need to open a door that most people shouldn't go through?

Also keep in mind that a lot of people find lambda to be gross (I am not one of them) and would want a stand-alone function. At that point the terseness is lost and a person may have been better-off with some other design.

If we open this door, I suspect it will cause many more problems than it solves. It would take a good deal of thought and care to use this in a correct and performant way.

@tewalds
Copy link
Author

tewalds commented Jan 14, 2023

I can totally see this being useful for a dict or set, but one big difference is that you can subclass dict pretty trivially to achieve this for your specific case by overwriting a couple dunder methods. You can't subclass cache since it's a function and the relevant functionality, make_key, isn't available to override. You therefore need to either fork it, or reimplement (or drop) the rest of the functionality (thread safety, lru, stats, etc), which is also error prone. In some sense, I'm asking to expose cache's __getattr__ and __setattr__.

@rhettinger
Copy link
Contributor

rhettinger commented Jan 16, 2023

I appreciate the suggestion and fully appreciate your use case, but I'm going to decline. Mostly, I would like to the keep the API simple, avoid feature creep, and not try to be all things to all people. Also, I hesitate to go against the long standing Python norm that mappings don't provide automatic support for mutable keys — this has kept the average user out of trouble and made experienced users be explicit about the conversion.

ISTM that your use case is atypical. The inputs are short (x,y coodinates) so the cost of a tuple conversion is small and the additional memory use can be ignored. The wrapped function is 10,000x slower than conversion and hashing. Presumably, the cache hit rate is high and the added cost of a cache miss can be ignored.

For other users, assessing whether to use a cache key-function is more challenging and treacherous. For example, dot = cache(key=lambda v1, v2: tuple(v1), tuple(v2))(np.ndarray.dot). Even for a cache hit, this would be slower than calling the function directly. The cached keys would eat much more memory than underlying function call and unlike the underlying function call arguments, they would persist until the cache is cleared. A user would be better-off not being baited to go down this path.

While the key function in your single use case is simple, it would be much harder for nested data, JSON for example. And your arity-two example departs from the existing norm that all key functions are of arity one.

It sounds like your particular problem was already solved and solved easily. The only thing missing was the aspiration to force this into the standard @cache decorator. That may have been a good thing in that one situation, but for others I think it would be an attractive nuisance.

Side note: It might be worthwhile to submit a NumPy feature request for opt-in hashability. Already, you can set the writeable attribute of an array to False. It would not be difficult to then have the array be hashable. That would serve users like yourself who think it is a good idea to use datasets as dict keys.

@rhettinger
Copy link
Contributor

I forgot to mention the double or triple maintenance D.R.Y. issue that arose while experimenting with this proposal. For a key function to be robust, it would need to faithfully mirror the underlying function signature in every detail. That relationship would have to be maintained over time, updating all three the keep them in sync. It can be viewed as a triple maintenance problem because the parameters have to be stated three times: once in the main function, again in the key function lambda parameters, and again in the lambda function body.

@lru_cache(maxsize = 2048, 
           types = Type,
           key = (lambda a, b=None, overwrite_a=False, check_finite=True, homogeneous_eigvals=False:
                  tuple(a), None if b is None else tuple(b), overwrite_a, check_finite, homogeneous_eigvals))
def eigvals(a, b=None, overwrite_a=False, check_finite=True, homogeneous_eigvals=False):
    ...

Also, signatures with optional arguments would be challenging to model in a key function. For example, Python's range() builtin or dict.get() behave differently when the number of arguments changes. Numpy has these as well. See numpy.var for example.

ISTM that the proposal falls apart as the function signatures grow more complex.

@tewalds
Copy link
Author

tewalds commented Jan 16, 2023

Marking read-only numpy arrays as hashable sounds like a good feature request. That does force a change for the caller though.

That does bring up the point that something along this lines where you're storing the (long enough) hash as the key instead of the the value itself could be more efficient and support normal objects like lists without the problem of mutability. Unclear if that is a good thing or not, but it would remove a caveat to the current cache function.

An alternative approach would be to allow replacing dict with your own subclass. Then you could do the extension in a general way in the dict's __getattr__ and __setattr__. It wouldn't be as concise or know the argument names though, so doesn't seem like a great idea.

The DRY issue wouldn't be such a big deal if it could delegate to the normal make_key function. It could do something like:

def np_to_hash(v):
  return hash(v.to_bytes()) if isinstance(v, np.ndarray) else v

def np_key(args, kwargs, typed):
  return functools.make_key(tuple(map(np_to_hash, args)), {k: np_to_hash(v) for k, v in kwargs.items()}, typed)

@cache(make_key=np_key)
def slow_func(...)

At least this is pretty general and could be reused, but it's getting pretty verbose, so of unclear value. I'd certainly prefer something like:

class NumpyCache(functools.cache):

  def _make_key(args, kwargs, typed):
    return super()._make_key(...)

But that would require a pretty big change to the existing cache function to be possible, and is also unclear value.

Thanks for exploring this idea with me.

@rhettinger
Copy link
Contributor

rhettinger commented Jan 17, 2023

Thanks for exploring this idea with me.

@tewalds You're welcome. Please keep-up the good work at DeepMind. I'm a big fan and read all of the papers coming out of that research.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type-feature A feature request or enhancement
Projects
None yet
Development

No branches or pull requests

2 participants