Add min, max and clamp arithmetic ops #2298

klecki · 2020-09-24T18:38:53Z

Why we need this PR?

Add min(a, b), max(a, b), clamp(v, lo, hi) as arithmetic ops.

Todo:

add tests with 0-dim inputs
docs

What happened in this PR?

What solution was applied:
Added support for ternary operators + their type/value switching.
Affected modules and functionalities:
Arithm Ops
Key points relevant for the review:
Where to put named arithm op functions in DALI Python package?
Validation and testing:
Nosetest that doesn't end in L1, probably should limit that
Documentation (including examples):
TODO

JIRA TASK: [DALI-1628]

mzient · 2020-09-24T20:37:51Z

dali/python/nvidia/dali/ops.py

+def min(left, right):
+    return _arithm_op("min", left, right)
+
+def max(left, right):
+    return _arithm_op("max", left, right)
+
+def clamp(value, lo, hi):
+    return _arithm_op("clamp", value, lo, hi)


Maybe they should land in dali.fn? Both the usage mode and lowercase naming would play better with dali.fn.

Besides the module:

Suggested change

def min(left, right):

return _arithm_op("min", left, right)

def max(left, right):

return _arithm_op("max", left, right)

def clamp(value, lo, hi):

return _arithm_op("clamp", value, lo, hi)

def min(left, right):

"""Fills the output with minima of corresponding values in ``left`` and ``rigt``"""

return _arithm_op("min", left, right)

def max(left, right):

"""Fills the output with maxima of corresponding values in ``left`` and ``rigt``"""

return _arithm_op("max", left, right)

def clamp(value, lo, hi):

"""Produces a tensor of values from ``value`` clamped to the range ``lo``..``hi``."""

return _arithm_op("clamp", value, lo, hi)

JanuszL · 2020-09-24T21:11:02Z

dali/operators/math/expressions/arithmetic.h

@@ -115,8 +115,8 @@ DLL_PUBLIC DALIDataType PropagateTypes(ExprNode &expr, const workspace_t<Backend
  }
  auto &func = dynamic_cast<ExprFunc &>(expr);
  int subexpression_count = func.GetSubexpressionCount();
-  DALI_ENFORCE(subexpression_count == 1 || subexpression_count == 2,
-               "Only unary and binary expressions are supported");
+  DALI_ENFORCE(1 <= subexpression_count && subexpression_count <= kMaxArity,


Suggested change

DALI_ENFORCE(1 <= subexpression_count && subexpression_count <= kMaxArity,

DALI_ENFORCE(0 < subexpression_count && subexpression_count <= kMaxArity,

To be consistent with L340.

JanuszL · 2020-09-24T21:12:39Z

dali/operators/math/expressions/arithmetic.h

@@ -193,8 +193,8 @@ DLL_PUBLIC inline const TensorListShape<> &PropagateShapes(ExprNode &expr,
  }
  auto &func = dynamic_cast<ExprFunc &>(expr);
  int subexpression_count = expr.GetSubexpressionCount();
-  DALI_ENFORCE(subexpression_count == 1 || subexpression_count == 2,
-               "Only unary and binary expressions are supported");
+  DALI_ENFORCE(1 <= subexpression_count && subexpression_count <= kMaxArity,


JanuszL · 2020-09-24T21:13:58Z

dali/operators/math/expressions/expression_factory_instances/expression_impl_factory.h

+        } else {
+          DALI_FAIL("Expression cannot have three scalar operands");
+        }
+      ), DALI_FAIL("No suitable type found"););  // NOLINT(whitespace/parens)


Could you print type as well and tell which argument is the faulty one?

.../math/expressions/expression_factory_instances/expression_impl_factory.h

dali/test/python/test_operator_arithmetic_ops.py

mzient · 2020-09-25T09:07:58Z

dali/operators/math/expressions/arithmetic_meta.h

+    auto v_ = static_cast<result_t<T, Min, Max>>(v);
+    auto lo_ = static_cast<result_t<T, Min, Max>>(lo);
+    auto hi_ = static_cast<result_t<T, Min, Max>>(hi);
+    auto lo_clamp_ = v_ <= lo_ ? lo_ : v_;
+    return lo_clamp_ >= hi_ ? hi_ : lo_clamp_;


Suggested change

auto v_ = static_cast<result_t<T, Min, Max>>(v);

auto lo_ = static_cast<result_t<T, Min, Max>>(lo);

auto hi_ = static_cast<result_t<T, Min, Max>>(hi);

auto lo_clamp_ = v_ <= lo_ ? lo_ : v_;

return lo_clamp_ >= hi_ ? hi_ : lo_clamp_;

return clamp<result_t<T, Min, Max>>(v, lo, hi);

dali/core/math_util.h

JanuszL · 2020-09-30T12:47:41Z

dali/operators/math/expressions/expression_impl_gpu.cuh

+    CUDA_CALL(cudaEventElapsedTime(&time, start, end));
+    std::cerr << "Elapsed Time: " << time  << " s\n";
+
+    // time *= (1e+6f / kIters);  // convert to nanoseconds / 100 samples


Will remove all of the profiling before posting final version.

JanuszL · 2020-09-30T12:47:47Z

dali/operators/math/expressions/expression_impl_gpu.cuh

@@ -195,6 +258,18 @@ using ExprImplGpuCT = ExprImplGPUInvoke<InvokerBinOp<op, Result, Left, Right, fa
 template <ArithmeticOp op, typename Result, typename Left, typename Right>
 using ExprImplGpuTC = ExprImplGPUInvoke<InvokerBinOp<op, Result, Left, Right, true, false>>;

+// template <ArithmeticOp op, typename Result, typename First, typename Second, typename Third,


klecki · 2020-10-06T16:35:37Z

!build

dali-automaton · 2020-10-06T16:40:37Z

CI MESSAGE: [1680060]: BUILD STARTED

klecki · 2020-10-06T16:47:16Z

!build

dali-automaton · 2020-10-06T16:50:45Z

CI MESSAGE: [1680100]: BUILD STARTED

mzient · 2020-10-06T18:01:03Z

dali/python/nvidia/dali/math.py

+    import nvidia.dali.ops
+    # Fully circular imports don't work. We need to import _arithm_op late and
+    # replace this trampoline function.
+    setattr(sys.modules[__name__], "_arithm_op", nvidia.dali.ops._arithm_op)


I find the following simpler - but it's a matter of taste, I guess.

Suggested change

setattr(sys.modules[__name__], "_arithm_op", nvidia.dali.ops._arithm_op)

global _arithm_op

_arithm_op = nvidia.dali.ops._arithm_op

I just copied your code from data_node.py :)

mzient · 2020-10-06T21:10:22Z

dali/test/python/test_operator_arithmetic_ops.py

+        numpy_in = get_numpy_input(dali_in, kinds[i], dali_in.dtype.type, target_type if target_type is not None else dali_in.dtype.type)
+        inputs.append(numpy_in)
+    out = as_cpu(pipe_out[arity]).at(sample_id)
+    return tuple(inputs) + (out,)


More Pythonic? ;)

Suggested change

return tuple(inputs) + (out,)

return (*inputs, out)

mzient · 2020-10-06T21:15:30Z

dali/operators/math/expressions/expression_impl_cpu.h

+    auto output = static_cast<Result *>(tile.output);
+    const void *first = tile.args[0];
+    const void *second = tile.args[1];
+    const void *third = tile.args[2];


Suggested change

auto output = static_cast<Result *>(tile.output);

const void *first = tile.args[0];

const void *second = tile.args[1];

const void *third = tile.args[2];

auto *__restrict__ output = static_cast<Result *>(tile.output);

const void *__restrict__ first = tile.args[0];

const void *__restrict__ second = tile.args[1];

const void *__restrict__ third = tile.args[2];

Just today I saw adding __restrict__ outperform caching in shared memory (in another kernel). The speedup was 1.7x

dali-automaton · 2020-10-06T21:16:30Z

CI MESSAGE: [1680100]: BUILD FAILED

mzient · 2020-10-06T21:20:11Z

dali/operators/math/expressions/expression_impl_factory.h

+ *        based on `as_ptr`
+ */
+template <bool as_ptr, typename T>
+DALI_HOST_DEV std::enable_if_t<!as_ptr, T> Pass(const void* ptr, DALIDataType type_id) {


Suggested change

DALI_HOST_DEV std::enable_if_t<!as_ptr, T> Pass(const void* ptr, DALIDataType type_id) {

DALI_HOST_DEV std::enable_if_t<!as_ptr, T> Pass(const void *__restrict__ ptr, DALIDataType type_id) {

mzient · 2020-10-06T21:20:26Z

dali/operators/math/expressions/expression_impl_factory.h

+template <bool as_ptr, typename T>
+DALI_HOST_DEV std::enable_if_t<!as_ptr, T> Pass(const void* ptr, DALIDataType type_id) {
+  TYPE_SWITCH(type_id, type2id, AccessType, ARITHMETIC_ALLOWED_TYPES, (
+    const auto *access = reinterpret_cast<const AccessType*>(ptr);


Suggested change

const auto *access = reinterpret_cast<const AccessType*>(ptr);

const auto *__restrict__ access = reinterpret_cast<const AccessType*>(ptr);

mzient · 2020-10-06T21:20:35Z

dali/operators/math/expressions/expression_impl_factory.h

+}
+
+template <typename T>
+DALI_HOST_DEV T Access(const T* ptr, int64_t idx) {


Suggested change

DALI_HOST_DEV T Access(const T* ptr, int64_t idx) {

DALI_HOST_DEV T Access(const T* __restrict__ ptr, int64_t idx) {

mzient · 2020-10-06T21:21:20Z

dali/operators/math/expressions/expression_impl_factory.h

+}
+
+template <typename T>
+DALI_HOST_DEV T Access(const void* ptr, int64_t idx, DALIDataType type_id) {


Suggested change

DALI_HOST_DEV T Access(const void* ptr, int64_t idx, DALIDataType type_id) {

DALI_HOST_DEV T Access(const void* __restrict__ ptr, int64_t idx, DALIDataType type_id) {

mzient

I think there could be some gain from doing this:
const ExtendedTileDesc *__restrict__ tiles
the tile is read many times, so making it cacheable is desireable. There is some type-punning going on, so the compiler might be quite conservative here.

mzient

You can try restrict, but it's optional.

jantonguirao · 2020-10-07T06:50:13Z

dali/operators/math/expressions/arithmetic_meta.cc

@@ -0,0 +1,57 @@
+// Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.


Suggested change

// Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.

// Copyright (c) 2020, NVIDIA CORPORATION. All rights reserved.

jantonguirao · 2020-10-07T06:50:13Z

dali/operators/math/expressions/arithmetic_meta.cc

+    }
+  }
+  DALIDataType result = BinaryTypePromotion(types[0], types[1]);
+  if (types.size() > 2) {


this if is redundant

Yes, removed 👍

jantonguirao · 2020-10-07T06:50:13Z

dali/operators/math/expressions/arithmetic_meta.h

+  DALI_HOST_DEV static constexpr result_t<L, R> impl(L l, R r) {
+    auto l_ = static_cast<result_t<L, R>>(l);
+    auto r_ = static_cast<result_t<L, R>>(r);
+    return l_ <= r_ ? l_ : r_;


Suggested change

return l_ <= r_ ? l_ : r_;

return l_ < r_ ? l_ : r_;

This would work as well, right?

Yes, will change

jantonguirao · 2020-10-07T06:50:13Z

dali/operators/math/expressions/expression_factory_instances/expression_factory_max.cc

@@ -0,0 +1,24 @@
+
+// Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.


Suggested change

// Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.

// Copyright (c) 2020, NVIDIA CORPORATION. All rights reserved.

jantonguirao · 2020-10-07T06:50:13Z

dali/operators/math/expressions/expression_factory_instances/expression_factory_max.cu

@@ -0,0 +1,25 @@
+
+// Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.


Suggested change

// Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.

// Copyright (c) 2020, NVIDIA CORPORATION. All rights reserved.

jantonguirao · 2020-10-07T06:50:13Z

docs/math.rst

+.. note::
+    Type promotion is commutative.
+
+For more than two arguments, the resulting type is calculated as reduction from left to right


Suggested change

For more than two arguments, the resulting type is calculated as reduction from left to right

For more than two arguments, the resulting type is calculated as a reduction from left to right

jantonguirao · 2020-10-07T06:50:13Z

docs/math.rst

+
+For more than two arguments, the resulting type is calculated as reduction from left to right
+- first calculating the result of operating on first two arguments, next between that intermediate
+result and thirs argument and so on, untill we have only the result type left.


Suggested change

result and thirs argument and so on, untill we have only the result type left.

result and the third argument and so on, until we have only the result type left.

jantonguirao · 2020-10-07T06:50:13Z

docs/math.rst

+Similarly to arithmetic expressions, one can use selected mathematical functions in the Pipeline
+graph definition. They also accept :class:`nvidia.dali.pipeline.DataNode`,
+:meth:`nvidia.dali.types.Constant` or regular Python value of type ``bool``, ``int``, or ``float``
+as arguments. At least one of the inputs must be output of other DALI Operator.


Suggested change

as arguments. At least one of the inputs must be output of other DALI Operator.

as arguments. At least one of the inputs must be the output of other DALI Operator.

jantonguirao · 2020-10-07T06:50:13Z

docs/supported_ops.rst

Suggested change

from invoking other operators. Full documentation can be found in the dedicated documentation

from invoking other operators. Full documentation can be found in the dedicated section of the documentation

jantonguirao · 2020-10-07T06:50:13Z

docs/supported_ops.rst

+from invoking other operators. Full documentation can be found in the dedicated documentation
+for :ref:`mathematical expressions`.


Suggested change

for :ref:`mathematical expressions`.

:ref:`mathematical expressions`.

Signed-off-by: Krzysztof Lecki <klecki@nvidia.com>

There appears to be an issue with the clamp tests Signed-off-by: Krzysztof Lecki <klecki@nvidia.com>

Signed-off-by: Krzysztof Lecki <klecki@nvidia.com>

.../math/expressions/expression_factory_instances/expression_impl_factory.h

dali/operators/math/expressions/expression_impl_factory.h

docs/functional_api.rst

docs/supported_ops.rst

Signed-off-by: Krzysztof Lecki <klecki@nvidia.com>

klecki · 2020-10-07T10:37:57Z

!build

dali-automaton · 2020-10-07T10:40:55Z

CI MESSAGE: [1682623]: BUILD STARTED

The same behaviour for invalid ranges Signed-off-by: Krzysztof Lecki <klecki@nvidia.com>

klecki · 2020-10-07T12:26:03Z

!build

dali-automaton · 2020-10-07T12:30:45Z

CI MESSAGE: [1682793]: BUILD STARTED

dali-automaton · 2020-10-07T13:15:14Z

CI MESSAGE: [1682623]: BUILD FAILED

dali-automaton · 2020-10-07T17:14:32Z

CI MESSAGE: [1682793]: BUILD PASSED

mzient reviewed Sep 24, 2020

View changes

JanuszL reviewed Sep 24, 2020

View changes

.../math/expressions/expression_factory_instances/expression_impl_factory.h Outdated Show resolved Hide resolved

JanuszL reviewed Sep 24, 2020

View changes

dali/test/python/test_operator_arithmetic_ops.py Show resolved Hide resolved

mzient reviewed Sep 25, 2020

View changes

JanuszL reviewed Sep 30, 2020

View changes

klecki force-pushed the klecki:ternary-ops branch 2 times, most recently from 2a3a457 to 2d2ecac Oct 6, 2020

mzient reviewed Oct 6, 2020

View changes

mzient approved these changes Oct 6, 2020

View changes

jantonguirao reviewed Oct 7, 2020

View changes

klecki added 4 commits Sep 24, 2020

WIP min max clamp - CPU

74d25a7

Signed-off-by: Krzysztof Lecki <klecki@nvidia.com>

Working clamp with endles tests

e0eb562

Signed-off-by: Krzysztof Lecki <klecki@nvidia.com>

Parallelize the compilation a bit

634d4ef

Signed-off-by: Krzysztof Lecki <klecki@nvidia.com>

Move switching to the implementation

e4b15fd

Signed-off-by: Krzysztof Lecki <klecki@nvidia.com>

klecki added 11 commits Oct 1, 2020

More timing

2cded6c

Signed-off-by: Krzysztof Lecki <klecki@nvidia.com>

Some cleanup

936c2cd

Signed-off-by: Krzysztof Lecki <klecki@nvidia.com>

Clamp

e5adc49

Signed-off-by: Krzysztof Lecki <klecki@nvidia.com>

Wip docs

23da314

There appears to be an issue with the clamp tests Signed-off-by: Krzysztof Lecki <klecki@nvidia.com>

Adjust checks

274f568

Signed-off-by: Krzysztof Lecki <klecki@nvidia.com>

Test seems to work

c348b9c

Signed-off-by: Krzysztof Lecki <klecki@nvidia.com>

It should be testable

ce87cce

Signed-off-by: Krzysztof Lecki <klecki@nvidia.com>

cleanup

b91be99

Signed-off-by: Krzysztof Lecki <klecki@nvidia.com>

Adjust the documentation

423219f

Signed-off-by: Krzysztof Lecki <klecki@nvidia.com>

Lint

d739c9c

Signed-off-by: Krzysztof Lecki <klecki@nvidia.com>

Add missing math.py module

87354ba

Signed-off-by: Krzysztof Lecki <klecki@nvidia.com>