Handle empty memoryviews of bytearrays when (de)serializing #6576

quasiben · 2022-06-14T12:58:15Z

Tests added / passed
Passes pre-commit run --all-files

@jakirkham can you review and take over if anything else is needed ?

github-actions · 2022-06-14T13:44:20Z

Unit Test Results

See test report for an extended history of previous test failures. This is useful for diagnosing flaky tests.

      15 files ±  0       15 suites ±0 7h 21m 38s ⏱️ + 3m 11s
  2 881 tests +  2   2 797 ✔️ +  4   83 💤 - 1 1 ❌ - 1
21 344 runs +14 20 376 ✔️ +17 967 💤 - 1 1 ❌ - 2

For more details on these failures, see this check.

Results for commit fbe4dc0. ± Comparison against base commit 96c13a5.

♻️ This comment has been updated with latest results.

quasiben · 2022-06-15T15:09:12Z

Merging EOD unless there are other comments

jakirkham · 2022-06-15T19:50:31Z

Sorry Ben I'll try to look, but it may not happen right away. Have some other high priority things that needed to be resolved first

adamklein · 2022-06-16T02:09:49Z

LGTM, thx @quasiben

jakirkham · 2022-06-16T05:52:22Z

One issue that may come up here is the handling of arrays with multiple dimensions, one of which is 0. For example

In [1]: import numpy
   ...: 
   ...: from distributed.protocol import deserialize, serialize

In [2]: mv = numpy.ones((2, 0), dtype="f8").data

In [3]: deserialize(*serialize(mv))

Side note: This doesn't require NumPy. One could use mv = memoryview(b"123456").cast("B", (2, 3))[:0].

Current Behavior (error):

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Input In [3], in <cell line: 1>()
----> 1 deserialize(*serialize(mv))

File ~/miniconda/envs/dask/lib/python3.9/site-packages/distributed/protocol/serialize.py:416, in deserialize(header, frames, deserializers)
    411     raise TypeError(
    412         "Data serialized with %s but only able to deserialize "
    413         "data with %s" % (name, str(list(deserializers)))
    414     )
    415 dumps, loads, wants_context = families[name]
--> 416 return loads(header, frames)

File ~/miniconda/envs/dask/lib/python3.9/site-packages/distributed/protocol/serialize.py:58, in dask_loads(header, frames)
     56 typ = pickle.loads(header["type-serialized"])
     57 loads = dask_deserialize.dispatch(typ)
---> 58 return loads(header["sub-header"], frames)

File ~/miniconda/envs/dask/lib/python3.9/site-packages/distributed/protocol/serialize.py:786, in _deserialize_memoryview(header, frames)
    784 else:
    785     out = memoryview(b"".join(frames))
--> 786 out = out.cast(header["format"], header["shape"])
    787 return out

TypeError: memoryview: cannot cast view with zeros in shape or strides

This would not be reconstructed correctly during deserialization. How should we handle this case?

Raise an error during serialization
Warn during serialization
Do nothing
?

With this change, we are doing 3. Though maybe we should be doing one of the others instead (silent failures can be difficult to debug).

Edit: Updated the example above to show how serialization comes into play (and how it errors currently).

quasiben · 2022-06-16T13:22:41Z

Without this change we are still silently doing nothing.

seems like something we should do until we have a solution. We could also warn in the deserialization if out.shape != header['shape''

Also, is there a usecase where passing empty memoryviews of numpy arrays comes up ? At the moment, I'm assuming not because I don't think Dask has correctly handled this case in the past. Is this right ?

jakirkham · 2022-06-17T05:03:46Z

Not exactly. Without this change we are raising an error, it is just indiscriminate in nature. We could narrow it for example. We could also move it from deserialize to serialize. Tried to clarify this in the example above.

Think @adamklein would be better able to answer those questions than I. Being a user of this functionality (serialization of memoryviews), what would you expect to happen when serializing the data in the case above?

distributed/protocol/tests/test_serialize.py

This case is different enough from the `array` test that it is worth handling on its own.

It is not possible to correctly deserialize an empty `memoryview` with multiple dimensions using builtin Python functionality alone. So raise when trying to serialize such an object to provide this information the user as close to the error producing object as possible (and not when it is deserialized somewhere else). Should help the user find this and either flatten the `memoryview` themselves, switching to using NumPy where this is handled correctly, or do something else.

Unlike NumPy arrays, this works with `memoryview`s. So check whether it is trivial or not this way.

This was done for us before with `cast`. However it is not handled in this branch. So do that ourselves.

jakirkham · 2022-06-22T04:23:25Z

distributed/protocol/serialize.py

+    if not obj and obj.ndim > 1:
+        raise ValueError("Cannot serialize empty non-1-D `memoryview`")


Since we can't properly reconstruct these during deserialization, just raise here. This should be as close as possible to user code. So should hopefully make it clearer what data triggered this error.

jakirkham · 2022-06-22T04:39:42Z

distributed/protocol/tests/test_serialize.py

+def test_ser_empty_nd_memoryview():
+    mv = memoryview(b"12").cast("B", (1, 2))[:0]
+
+    # serialize empty `memoryview`
+    with pytest.raises(TypeError):
+        serialize(mv, on_error="raise")


Test the error case

jakirkham · 2022-06-22T04:40:44Z

Ok pushed some changes above. Please let me know what you think :)

quasiben · 2022-06-22T19:48:09Z

distributed/protocol/serialize.py

+        out = out.cast(header["format"], header["shape"])
+    else:
+        out = out.cast(header["format"])
+        assert out.shape == header["shape"]


Is this assertion necessary ? I'm thinking about what happens if a user gets this ? Is the intent to inform the user something in serializing/deserializing has gone awry ?

Yeah the .cast effectively does this check for us. However it is not handled in the else case so added an assert

If this assertion raises, it means something has gone horribly wrong. For example we expected a non-empty memoryview, but got an empty one. IOW data was lost in communication somewhere.

Based off offline discussion, it sounds like this is ok. Though happy to discuss further if needed

quasiben · 2022-06-22T19:49:40Z

Added a single comment but planning to merge tomorrow before Friday's release cc @jsignell

jakirkham · 2022-06-23T07:21:44Z

Thanks Ben! 🙏

handle empty memoryviews of bytearrays when (de)serializing

c9185ae

jakirkham reviewed Jun 22, 2022

View reviewed changes

distributed/protocol/tests/test_serialize.py Outdated Show resolved Hide resolved

jakirkham added 4 commits June 21, 2022 20:43

Break out empty memoryview test

0f52ff5

This case is different enough from the `array` test that it is worth handling on its own.

Test empty memoryview with multiple dimensions

5b318e5

Check truthy value of memoryview

77ab6fc

Unlike NumPy arrays, this works with `memoryview`s. So check whether it is trivial or not this way.

jakirkham force-pushed the fix-empty-memoryviews branch from bed24db to 77ab6fc Compare June 22, 2022 04:22

jakirkham added 2 commits June 21, 2022 21:23

Update comment for memoryviews

85c3e9f

assert memoryview's shape is as expected

fbe4dc0

This was done for us before with `cast`. However it is not handled in this branch. So do that ourselves.

jakirkham reviewed Jun 22, 2022

View reviewed changes

quasiben commented Jun 22, 2022

View reviewed changes

jakirkham merged commit bc04d0e into dask:main Jun 23, 2022

quasiben deleted the fix-empty-memoryviews branch June 23, 2022 12:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handle empty memoryviews of bytearrays when (de)serializing #6576

Handle empty memoryviews of bytearrays when (de)serializing #6576

quasiben commented Jun 14, 2022

github-actions bot commented Jun 14, 2022 •

edited

Loading

quasiben commented Jun 15, 2022

jakirkham commented Jun 15, 2022

adamklein commented Jun 16, 2022

jakirkham commented Jun 16, 2022 •

edited

Loading

quasiben commented Jun 16, 2022 •

edited

Loading

jakirkham commented Jun 17, 2022 •

edited

Loading

jakirkham Jun 22, 2022

jakirkham Jun 22, 2022

jakirkham commented Jun 22, 2022

quasiben Jun 22, 2022

jakirkham Jun 22, 2022

jakirkham Jun 23, 2022

quasiben commented Jun 22, 2022

jakirkham commented Jun 23, 2022

		if not obj and obj.ndim > 1:
		raise ValueError("Cannot serialize empty non-1-D `memoryview`")

Handle empty memoryviews of bytearrays when (de)serializing #6576

Handle empty memoryviews of bytearrays when (de)serializing #6576

Conversation

quasiben commented Jun 14, 2022

github-actions bot commented Jun 14, 2022 • edited Loading

Unit Test Results

quasiben commented Jun 15, 2022

jakirkham commented Jun 15, 2022

adamklein commented Jun 16, 2022

jakirkham commented Jun 16, 2022 • edited Loading

quasiben commented Jun 16, 2022 • edited Loading

jakirkham commented Jun 17, 2022 • edited Loading

jakirkham Jun 22, 2022

Choose a reason for hiding this comment

jakirkham Jun 22, 2022

Choose a reason for hiding this comment

jakirkham commented Jun 22, 2022

quasiben Jun 22, 2022

Choose a reason for hiding this comment

jakirkham Jun 22, 2022

Choose a reason for hiding this comment

jakirkham Jun 23, 2022

Choose a reason for hiding this comment

quasiben commented Jun 22, 2022

jakirkham commented Jun 23, 2022

github-actions bot commented Jun 14, 2022 •

edited

Loading

jakirkham commented Jun 16, 2022 •

edited

Loading

quasiben commented Jun 16, 2022 •

edited

Loading

jakirkham commented Jun 17, 2022 •

edited

Loading