Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

i32x4.nearest_sat_f32x4_s and i32x4.nearest_sat_f32x4_u #247

Open
wants to merge 1 commit into
base: master
from

Conversation

@Maratyszcza
Copy link
Contributor

Maratyszcza commented Jun 6, 2020

Introduction

This PR adds a form of floating-point-to-integer conversion with rounding to nearest (ties to even), in addition to existing instructions with rounding towards zero mode. This operation is natively supported in SSE2 and ARMv8 NEON, and can be efficiently simulated in native instructions on ARMv7 NEON.

Applications

Mapping to Common Instruction Sets

This section illustrates how the new WebAssembly instructions can be lowered on common instruction sets. However, these patterns are provided only for convenience, compliant WebAssembly implementations do not have to follow the same code generation patterns.

x86 processors with SSE2 instruction set

  • i32x4.nearest_sat_f32x4_s
    • y = i32x4.nearest_sat_f32x4_s(x) (y is NOT x) is lowered to
      • MOVAPS xmm_y, xmm_x
      • MOVAPS xmm_tmp, wasm_splat_f32(0x1.0p+31f)
      • CMPUNORDSS xmm_y, xmm_y
      • CMPLEPS xmm_tmp, xmm_x
      • ANDNPS xmm_y, xmm_x
      • CVTDQ2PS xmm_y, xmm_y
      • PXOR xmm_y, xmm_tmp
  • i32x4.nearest_sat_f32x4_u
    • y = i32x4.nearest_sat_f32x4_u(x) (y is NOT x) is lowered to
      • MOVAPS xmm_tmp0, wasm_splat_f32(0x1.0p+31f)
      • MOVAPS xmm_tmp1, xmm_x
      • CMPNLTPS xmm_tmp1, xmm_tmp0
      • MOVAPS xmm_y, xmm_x
      • MOVAPS xmm_tmp2, xmm_tmp0
      • ANDPS xmm_tmp0, xmm_tmp1
      • SUBPS xmm_y, xmm_tmp0
      • PSLLD xmm_tmp1, 31
      • CMPLEPS xmm_tmp2, xmm_y
      • CVTPS2DQ xmm_y, xmm_y
      • PXOR xmm_tmp1, xmm_y
      • PXOR xmm_y, xmm_y
      • PCMPGTD xmm_y, xmm_x
      • POR xmm_tmp1, xmm_tmp2
      • PANDN xmm_y, xmm_tmp1

ARM64 processors

  • i32x4.nearest_sat_f32x4_s
    • y = i32x4.nearest_sat_f32x4_s(x) is lowered to FCVTNS Vy.4S, Vx.4S
  • i32x4.nearest_sat_f32x4_u
    • y = i32x4.nearest_sat_f32x4_u(x) is lowered to FCVTNU Vy.4S, Vx.4S

ARM processors with ARMv8 (32-bit) instruction set

  • i32x4.nearest_sat_f32x4_s
    • y = i32x4.nearest_sat_f32x4_s(x) is lowered to VCVTN.S32.F32 Qy, Qx
  • i32x4.nearest_sat_f32x4_u
    • y = i32x4.nearest_sat_f32x4_u(x) is lowered to VCVTN.U32.F32 Qy, Qx

ARM processors with ARMv7 (32-bit) instruction set

  • i32x4.nearest_sat_f32x4_s
    • y = i32x4.nearest_sat_f32x4_s(x) (y is NOT x) is lowered to
      • VMOV.I32 Qtmp, 0x80000000
      • VMOV.F32 Qy, 0x4B000000
      • VBSL Qtmp, Qx, Qy
      • VADD.F32 Qy, Qx, Qtmp
      • VSUB.F32 Qy, Qy, Qtmp
      • VACLT.F32 Qtmp, Qx, Qtmp
      • VBSL Qtmp, Qy, Qx
      • VCVT.S32.F32 Qy, Qtmp
  • i32x4.nearest_sat_f32x4_u
    • y = i32x4.nearest_sat_f32x4_u(x) (y is NOT x) is lowered to
      • VMOV.I32 Qtmp, 0x4B000000
      • VADD.F32 Qy, Qx, Qtmp
      • VSUB.F32 Qy, Qy, Qtmp
      • VCLT.U32 Qtmp, Qx, Qtmp
      • VBSL Qtmp, Qy, Qx
      • VCVT.U32.F32 Qy, Qtmp
@Maratyszcza Maratyszcza changed the title [WIP] i32x4.nearest_sat_f32x4_s and i32x4.nearest_sat_f32x4_u i32x4.nearest_sat_f32x4_s and i32x4.nearest_sat_f32x4_u Jun 7, 2020
@Maratyszcza Maratyszcza force-pushed the Maratyszcza:nearest-sat branch from 1d420d4 to b6e39b9 Jun 8, 2020
@dtig
Copy link
Member

dtig commented Jun 9, 2020

These instructions are widely useful, and I agree that these are hard to emulate without operations being explicitly exposed. The suboptimal mapping on Intel hardware has been contentious in the past for the conversion and other operations, but we have included them as there's no good way to emulate them - explicitly asking @arunetm and other Intel folks for opinions here.

@zeux
Copy link
Contributor

zeux commented Jun 10, 2020

Just noting that in all my high performance kernels that needed f32->i32 conversion, I had to stay away from the "native" Wasm instructions due to the large overhead.

I have three kernels that need this instruction; they run at 2 GB/s, 3.6 GB/s and 2.6 GB/s when using a fast emulation. When using the "native" instruction on latest v8, I get 1.25 GB/s, 2.4 GB/s and 2.1 GB/s - very significant and noticeable penalty, and that's given that the "native" code doesn't perform the rounding that's required and I get for free in the emulated version, so the real perf delta is larger. As usual, these aren't microbenchmarks, and the conversion is merely part of the computational chain.

The instructions proposed here would perhaps help a bit in that at least my emulation can be tested vs a native rounding instruction and I'd expect these to perform similarly to existing variants, but I'm expecting these to be similarly not-useful for performance sensitive code unless it's impossible to implement the algorithm without these.

That's not really an objection to adding these, as these instructions aren't worse than what we already have, merely an observation. In the examples linked I believe the expectation is that the lowering is much more optimal than the one proposed (because of the differences in handling saturation/NaNs).

@zeux
Copy link
Contributor

zeux commented Jun 10, 2020

(on a less pessimistic note, if we decide to go ahead with these I'd be happy to contribute the kernels above as benchmarks for perf evaluation, we could compare "manual" rounding (adding 0.5 with the proper sign and using truncate), "assisted" rounding (using new fp32 rounding + truncate), proposed direct rounding and the fast emulation)

@arunetm
Copy link
Collaborator

arunetm commented Jun 10, 2020

I think the current state of spec makes it too risky to include these. Mapping on x86 looks concerning for these instructions with the largest gap w.r.t instr count (16 & 7). We already have a significant asymmetry in spec considering costs of op-implementations on x86. I am afraid inclusion of these significantly increases the risk of hiding higher perf penalties/regressions on one popular platform vs. others limiting their usability for developers and moving away from spec goals.
The instructions looks useful from a convenience standpoint, but the use-case perf benefits and tradeoffs seems unclear. Even included, the perf cost on x86 may force users to rely on emulations like @zeux pointed out negating any benefits. Thanks @zeux for sharing the info.
I suggest not including these in current spec and re-consider for post MVP given their usefulness.

@Maratyszcza
Copy link
Contributor Author

Maratyszcza commented Jun 10, 2020

@arunetm please note that x86 lowering differs only by one instruction (CVTTPS2DQ -> CVTPS2DQ) from the lowering of i32x4.trunc_sat_f32x4_s and i32x4.nearest_sat_f32x4_u instructions already in the spec. The other instructions are needed to handle the difference between out-of-bounds behavior of x86 conversation instructions (return INT32_MIN) and WAsm conversion instructions (saturate between INT32_MIN and INT32_MAX and convert NaN to 0), and to simulate the floating-point -> unsigned integer conversion missing on pre-AVX512 x86.

Without i32x4.nearest_sat_f32x4_s and i32x4.nearest_sat_f32x4_u in the spec, developers would implement them as i32x4.trunc_sat_f32x4_s(f32x4.nearest(x)) and i32x4.trunc_sat_f32x4_u(f32x4.nearest(x)), which is strictly worse on all platforms, but particularly so on pre-SSE4 x86.

@arunetm
Copy link
Collaborator

arunetm commented Jun 10, 2020

lowering of i32x4.trunc_sat_f32x4_s and i32x4.nearest_sat_f32x4_u instructions already in the spec.

Did you mean i32x4.trunc_sat_f32x4_s & i32x4.trunc_sat_f32x4_u here? Unfortunately these ops are highly expensive to implement on x86 and we have open issues discussing their tradeoffs #173 . We need to clearly understand their real world implications regarding perf cliffs before including trunc instructions. We may be compounding the problem by adding the new ops that rely on these. Given that we cannot assume AVX-512 instructions broad availability, the extra cost of handling out of bounds will make it even worse.

Agree that not including these will force developers to choose workarounds that may not be ideal on certain platforms. IMO, its a a better tradeoff than letting them be vulnerable to hidden perf cliffs on certain common platforms when they rely SIMD anticipating consistent performance gains. Also, runtimes always has the option of adding platform specific optimizations in these cases where developer expectations will not be broken by the spec and only enhanced by implementers.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

None yet

4 participants
You can’t perform that action at this time.