Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle empty memoryviews of bytearrays when (de)serializing #6576

Merged
merged 7 commits into from
Jun 23, 2022

Conversation

quasiben
Copy link
Member

Closes #6542

  • Tests added / passed
  • Passes pre-commit run --all-files

@jakirkham can you review and take over if anything else is needed ?

cc @adamklein

@github-actions
Copy link
Contributor

github-actions bot commented Jun 14, 2022

Unit Test Results

See test report for an extended history of previous test failures. This is useful for diagnosing flaky tests.

       15 files  ±  0         15 suites  ±0   7h 21m 38s ⏱️ + 3m 11s
  2 881 tests +  2    2 797 ✔️ +  4    83 💤  - 1  1  - 1 
21 344 runs  +14  20 376 ✔️ +17  967 💤  - 1  1  - 2 

For more details on these failures, see this check.

Results for commit fbe4dc0. ± Comparison against base commit 96c13a5.

♻️ This comment has been updated with latest results.

@quasiben
Copy link
Member Author

Merging EOD unless there are other comments

@jakirkham
Copy link
Member

Sorry Ben I'll try to look, but it may not happen right away. Have some other high priority things that needed to be resolved first

@adamklein
Copy link
Contributor

LGTM, thx @quasiben

@jakirkham
Copy link
Member

jakirkham commented Jun 16, 2022

One issue that may come up here is the handling of arrays with multiple dimensions, one of which is 0. For example

In [1]: import numpy
   ...: 
   ...: from distributed.protocol import deserialize, serialize

In [2]: mv = numpy.ones((2, 0), dtype="f8").data

In [3]: deserialize(*serialize(mv))

Side note: This doesn't require NumPy. One could use mv = memoryview(b"123456").cast("B", (2, 3))[:0].

Current Behavior (error):
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Input In [3], in <cell line: 1>()
----> 1 deserialize(*serialize(mv))

File ~/miniconda/envs/dask/lib/python3.9/site-packages/distributed/protocol/serialize.py:416, in deserialize(header, frames, deserializers)
    411     raise TypeError(
    412         "Data serialized with %s but only able to deserialize "
    413         "data with %s" % (name, str(list(deserializers)))
    414     )
    415 dumps, loads, wants_context = families[name]
--> 416 return loads(header, frames)

File ~/miniconda/envs/dask/lib/python3.9/site-packages/distributed/protocol/serialize.py:58, in dask_loads(header, frames)
     56 typ = pickle.loads(header["type-serialized"])
     57 loads = dask_deserialize.dispatch(typ)
---> 58 return loads(header["sub-header"], frames)

File ~/miniconda/envs/dask/lib/python3.9/site-packages/distributed/protocol/serialize.py:786, in _deserialize_memoryview(header, frames)
    784 else:
    785     out = memoryview(b"".join(frames))
--> 786 out = out.cast(header["format"], header["shape"])
    787 return out

TypeError: memoryview: cannot cast view with zeros in shape or strides

This would not be reconstructed correctly during deserialization. How should we handle this case?

  1. Raise an error during serialization
  2. Warn during serialization
  3. Do nothing
  4. ?

With this change, we are doing 3. Though maybe we should be doing one of the others instead (silent failures can be difficult to debug).

Edit: Updated the example above to show how serialization comes into play (and how it errors currently).

@quasiben
Copy link
Member Author

quasiben commented Jun 16, 2022

Without this change we are still silently doing nothing.

  1. seems like something we should do until we have a solution. We could also warn in the deserialization if out.shape != header['shape''

Also, is there a usecase where passing empty memoryviews of numpy arrays comes up ? At the moment, I'm assuming not because I don't think Dask has correctly handled this case in the past. Is this right ?

@jakirkham
Copy link
Member

jakirkham commented Jun 17, 2022

Not exactly. Without this change we are raising an error, it is just indiscriminate in nature. We could narrow it for example. We could also move it from deserialize to serialize. Tried to clarify this in the example above.

Think @adamklein would be better able to answer those questions than I. Being a user of this functionality (serialization of memoryviews), what would you expect to happen when serializing the data in the case above?

This case is different enough from the `array` test that it is worth
handling on its own.
It is not possible to correctly deserialize an empty `memoryview` with
multiple dimensions using builtin Python functionality alone. So raise
when trying to serialize such an object to provide this information the
user as close to the error producing object as possible (and not when it
is deserialized somewhere else). Should help the user find this and
either flatten the `memoryview` themselves, switching to using NumPy
where this is handled correctly, or do something else.
Unlike NumPy arrays, this works with `memoryview`s. So check whether it
is trivial or not this way.
This was done for us before with `cast`. However it is not handled in
this branch. So do that ourselves.
Comment on lines +775 to +776
if not obj and obj.ndim > 1:
raise ValueError("Cannot serialize empty non-1-D `memoryview`")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we can't properly reconstruct these during deserialization, just raise here. This should be as close as possible to user code. So should hopefully make it clearer what data triggered this error.

Comment on lines +606 to +611
def test_ser_empty_nd_memoryview():
mv = memoryview(b"12").cast("B", (1, 2))[:0]

# serialize empty `memoryview`
with pytest.raises(TypeError):
serialize(mv, on_error="raise")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test the error case

@jakirkham
Copy link
Member

Ok pushed some changes above. Please let me know what you think :)

out = out.cast(header["format"], header["shape"])
else:
out = out.cast(header["format"])
assert out.shape == header["shape"]
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this assertion necessary ? I'm thinking about what happens if a user gets this ? Is the intent to inform the user something in serializing/deserializing has gone awry ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah the .cast effectively does this check for us. However it is not handled in the else case so added an assert

If this assertion raises, it means something has gone horribly wrong. For example we expected a non-empty memoryview, but got an empty one. IOW data was lost in communication somewhere.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based off offline discussion, it sounds like this is ok. Though happy to discuss further if needed

@quasiben
Copy link
Member Author

Added a single comment but planning to merge tomorrow before Friday's release cc @jsignell

@jakirkham jakirkham merged commit bc04d0e into dask:main Jun 23, 2022
@jakirkham
Copy link
Member

Thanks Ben! 🙏

@quasiben quasiben deleted the fix-empty-memoryviews branch June 23, 2022 12:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

_deserialize_memoryview bug
3 participants