i32x4.nearest_sat_f32x4_s and i32x4.nearest_sat_f32x4_u #247

Maratyszcza · 2020-06-06T06:52:49Z

Introduction

This PR adds a form of floating-point-to-integer conversion with rounding to nearest (ties to even), in addition to existing instructions with rounding towards zero mode. This operation is natively supported in SSE2 and ARMv8 NEON, and can be efficiently simulated in native instructions on ARMv7 NEON.

Applications

Mapping to Common Instruction Sets

This section illustrates how the new WebAssembly instructions can be lowered on common instruction sets. However, these patterns are provided only for convenience, compliant WebAssembly implementations do not have to follow the same code generation patterns.

x86 processors with SSE2 instruction set

i32x4.nearest_sat_f32x4_s
- y = i32x4.nearest_sat_f32x4_s(x) (y is NOT x) is lowered to
  - MOVAPS xmm_y, xmm_x
  - MOVAPS xmm_tmp, wasm_splat_f32(0x1.0p+31f)
  - CMPUNORDSS xmm_y, xmm_y
  - CMPLEPS xmm_tmp, xmm_x
  - ANDNPS xmm_y, xmm_x
  - CVTDQ2PS xmm_y, xmm_y
  - PXOR xmm_y, xmm_tmp
i32x4.nearest_sat_f32x4_u
- y = i32x4.nearest_sat_f32x4_u(x) (y is NOT x) is lowered to
  - MOVAPS xmm_tmp0, wasm_splat_f32(0x1.0p+31f)
  - MOVAPS xmm_tmp1, xmm_x
  - CMPNLTPS xmm_tmp1, xmm_tmp0
  - MOVAPS xmm_y, xmm_x
  - MOVAPS xmm_tmp2, xmm_tmp0
  - ANDPS xmm_tmp0, xmm_tmp1
  - SUBPS xmm_y, xmm_tmp0
  - PSLLD xmm_tmp1, 31
  - CMPLEPS xmm_tmp2, xmm_y
  - CVTPS2DQ xmm_y, xmm_y
  - PXOR xmm_tmp1, xmm_y
  - PXOR xmm_y, xmm_y
  - PCMPGTD xmm_y, xmm_x
  - POR xmm_tmp1, xmm_tmp2
  - PANDN xmm_y, xmm_tmp1

ARM64 processors

i32x4.nearest_sat_f32x4_s
- y = i32x4.nearest_sat_f32x4_s(x) is lowered to FCVTNS Vy.4S, Vx.4S
i32x4.nearest_sat_f32x4_u
- y = i32x4.nearest_sat_f32x4_u(x) is lowered to FCVTNU Vy.4S, Vx.4S

ARM processors with ARMv8 (32-bit) instruction set

i32x4.nearest_sat_f32x4_s
- y = i32x4.nearest_sat_f32x4_s(x) is lowered to VCVTN.S32.F32 Qy, Qx
i32x4.nearest_sat_f32x4_u
- y = i32x4.nearest_sat_f32x4_u(x) is lowered to VCVTN.U32.F32 Qy, Qx

ARM processors with ARMv7 (32-bit) instruction set

i32x4.nearest_sat_f32x4_s
- y = i32x4.nearest_sat_f32x4_s(x) (y is NOT x) is lowered to
  - VMOV.I32 Qtmp, 0x80000000
  - VMOV.F32 Qy, 0x4B000000
  - VBSL Qtmp, Qx, Qy
  - VADD.F32 Qy, Qx, Qtmp
  - VSUB.F32 Qy, Qy, Qtmp
  - VACLT.F32 Qtmp, Qx, Qtmp
  - VBSL Qtmp, Qy, Qx
  - VCVT.S32.F32 Qy, Qtmp
i32x4.nearest_sat_f32x4_u
- y = i32x4.nearest_sat_f32x4_u(x) (y is NOT x) is lowered to
  - VMOV.I32 Qtmp, 0x4B000000
  - VADD.F32 Qy, Qx, Qtmp
  - VSUB.F32 Qy, Qy, Qtmp
  - VCLT.U32 Qtmp, Qx, Qtmp
  - VBSL Qtmp, Qy, Qx
  - VCVT.U32.F32 Qy, Qtmp


        i32x4.nearest_sat_f32x4_s and i32x4.nearest_sat_f32x4_u

dtig · 2020-06-09T00:00:37Z

These instructions are widely useful, and I agree that these are hard to emulate without operations being explicitly exposed. The suboptimal mapping on Intel hardware has been contentious in the past for the conversion and other operations, but we have included them as there's no good way to emulate them - explicitly asking @arunetm and other Intel folks for opinions here.

zeux · 2020-06-10T02:49:43Z

Just noting that in all my high performance kernels that needed f32->i32 conversion, I had to stay away from the "native" Wasm instructions due to the large overhead.

I have three kernels that need this instruction; they run at 2 GB/s, 3.6 GB/s and 2.6 GB/s when using a fast emulation. When using the "native" instruction on latest v8, I get 1.25 GB/s, 2.4 GB/s and 2.1 GB/s - very significant and noticeable penalty, and that's given that the "native" code doesn't perform the rounding that's required and I get for free in the emulated version, so the real perf delta is larger. As usual, these aren't microbenchmarks, and the conversion is merely part of the computational chain.

The instructions proposed here would perhaps help a bit in that at least my emulation can be tested vs a native rounding instruction and I'd expect these to perform similarly to existing variants, but I'm expecting these to be similarly not-useful for performance sensitive code unless it's impossible to implement the algorithm without these.

That's not really an objection to adding these, as these instructions aren't worse than what we already have, merely an observation. In the examples linked I believe the expectation is that the lowering is much more optimal than the one proposed (because of the differences in handling saturation/NaNs).

zeux · 2020-06-10T02:59:07Z

(on a less pessimistic note, if we decide to go ahead with these I'd be happy to contribute the kernels above as benchmarks for perf evaluation, we could compare "manual" rounding (adding 0.5 with the proper sign and using truncate), "assisted" rounding (using new fp32 rounding + truncate), proposed direct rounding and the fast emulation)

arunetm · 2020-06-10T03:52:25Z

I think the current state of spec makes it too risky to include these. Mapping on x86 looks concerning for these instructions with the largest gap w.r.t instr count (16 & 7). We already have a significant asymmetry in spec considering costs of op-implementations on x86. I am afraid inclusion of these significantly increases the risk of hiding higher perf penalties/regressions on one popular platform vs. others limiting their usability for developers and moving away from spec goals.
The instructions looks useful from a convenience standpoint, but the use-case perf benefits and tradeoffs seems unclear. Even included, the perf cost on x86 may force users to rely on emulations like @zeux pointed out negating any benefits. Thanks @zeux for sharing the info.
I suggest not including these in current spec and re-consider for post MVP given their usefulness.

Maratyszcza · 2020-06-10T04:43:47Z

@arunetm please note that x86 lowering differs only by one instruction (CVTTPS2DQ -> CVTPS2DQ) from the lowering of i32x4.trunc_sat_f32x4_s and i32x4.nearest_sat_f32x4_u instructions already in the spec. The other instructions are needed to handle the difference between out-of-bounds behavior of x86 conversation instructions (return INT32_MIN) and WAsm conversion instructions (saturate between INT32_MIN and INT32_MAX and convert NaN to 0), and to simulate the floating-point -> unsigned integer conversion missing on pre-AVX512 x86.

Without i32x4.nearest_sat_f32x4_s and i32x4.nearest_sat_f32x4_u in the spec, developers would implement them as i32x4.trunc_sat_f32x4_s(f32x4.nearest(x)) and i32x4.trunc_sat_f32x4_u(f32x4.nearest(x)), which is strictly worse on all platforms, but particularly so on pre-SSE4 x86.

arunetm · 2020-06-10T17:37:28Z

lowering of i32x4.trunc_sat_f32x4_s and i32x4.nearest_sat_f32x4_u instructions already in the spec.

Did you mean i32x4.trunc_sat_f32x4_s & i32x4.trunc_sat_f32x4_u here? Unfortunately these ops are highly expensive to implement on x86 and we have open issues discussing their tradeoffs #173 . We need to clearly understand their real world implications regarding perf cliffs before including trunc instructions. We may be compounding the problem by adding the new ops that rely on these. Given that we cannot assume AVX-512 instructions broad availability, the extra cost of handling out of bounds will make it even worse.

Agree that not including these will force developers to choose workarounds that may not be ideal on certain platforms. IMO, its a a better tradeoff than letting them be vulnerable to hidden perf cliffs on certain common platforms when they rely SIMD anticipating consistent performance gains. Also, runtimes always has the option of adding platform specific optimizations in these cases where developer expectations will not be broken by the spec and only enhanced by implementers.

Maratyszcza changed the title ~~[WIP] i32x4.nearest_sat_f32x4_s and i32x4.nearest_sat_f32x4_u~~ i32x4.nearest_sat_f32x4_s and i32x4.nearest_sat_f32x4_u Jun 7, 2020

i32x4.nearest_sat_f32x4_s and i32x4.nearest_sat_f32x4_u

b6e39b9

Maratyszcza force-pushed the Maratyszcza:nearest-sat branch from 1d420d4 to b6e39b9 Jun 8, 2020

WebAssembly / simd

i32x4.nearest_sat_f32x4_s and i32x4.nearest_sat_f32x4_u #247

i32x4.nearest_sat_f32x4_s and i32x4.nearest_sat_f32x4_u #247

Maratyszcza commented Jun 6, 2020 •

edited

dtig commented Jun 9, 2020

zeux commented Jun 10, 2020 •

edited

zeux commented Jun 10, 2020 •

edited

arunetm commented Jun 10, 2020

Maratyszcza commented Jun 10, 2020 •

edited

arunetm commented Jun 10, 2020

WebAssembly / simd

Join GitHub today

i32x4.nearest_sat_f32x4_s and i32x4.nearest_sat_f32x4_u #247

i32x4.nearest_sat_f32x4_s and i32x4.nearest_sat_f32x4_u #247

Conversation

Maratyszcza commented Jun 6, 2020 • edited

Introduction

Applications

Mapping to Common Instruction Sets

x86 processors with SSE2 instruction set

ARM64 processors

ARM processors with ARMv8 (32-bit) instruction set

ARM processors with ARMv7 (32-bit) instruction set

dtig commented Jun 9, 2020

zeux commented Jun 10, 2020 • edited

zeux commented Jun 10, 2020 • edited

arunetm commented Jun 10, 2020

Maratyszcza commented Jun 10, 2020 • edited

arunetm commented Jun 10, 2020

Maratyszcza commented Jun 6, 2020 •

edited

zeux commented Jun 10, 2020 •

edited

zeux commented Jun 10, 2020 •

edited

Maratyszcza commented Jun 10, 2020 •

edited