Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for saturating doubling multiply add #2050

Open
Ryo-not-rio opened this issue Apr 3, 2024 · 9 comments
Open

Support for saturating doubling multiply add #2050

Ryo-not-rio opened this issue Apr 3, 2024 · 9 comments

Comments

@Ryo-not-rio
Copy link

There is a saturating doubling multiply add instructions in NEON and SVE - vqdmla, svqdmla which are equivalent to hn::SaturatedAdd(hn::Mul(2, hn::Mul(a_real, b_imag)),hn::Mul(2, hn::Mul(a_imag, b_real)));. In testing, we have seen a significant performance decrease using highway vs NEON and highway vs SVE. I was wondering if it would be possible to add a similar instruction to highway so we are able to leverage these qdmla instructions.

@jan-wassenberg
Copy link
Member

It sounds reasonable to want access to this instruction. I'm curious about the purpose of the extra 2x mul in their definition?

Because NEON differs from SVE in that it takes the upper/lower half, vs even/odd lanes, I think we'd want to define the op like ReorderWidenMulAccumulate, which is allowed to return any lane order. This is fine for dot products but I wonder if it's still OK for your use case, or whether we have some other idea how to bridge the gap between NEON and SVE?

@Ryo-not-rio
Copy link
Author

The use case we've tested out is a vector multiplication so the order needs to be preserved. You raise a good point on the difference between NEON and SVE, we could potentially return two vectors - the lower and upper half of the result - this sounds like a good compromise since I'd imagine you'd need the whole vector for most use cases anyway

@jan-wassenberg
Copy link
Member

Makes sense. ReorderWidenMulAccumulate also returns a second value using an output param, so there is precedent.
So the proposed op would call both vqdmlal and vqdmlal_high on NEON, and svqdmlalb + svqdmlalt on SVE2?

I'm still curious why the 2x factor in the instruction, is it also for Q1.x fixed-point? (similar to MulFixedPoint15)

@Ryo-not-rio
Copy link
Author

Ryo-not-rio commented Apr 3, 2024

So the proposed op would call both vqdmlal and vqdmlal_high on NEON, and svqdmlalb + svqdmlalt on SVE2?

yes that is correct, I'm still trying to find out why there is a 2x factor myself, I'll give an update once I have an answer

@Ryo-not-rio
Copy link
Author

So it looks like the doubling is just a side effect of the way the multiplication is done. The multiplication is done in Q0.15 format which is saturated to Q0.31 format and this is equivalent to doubling saturating multiply-add of two 16 bit integers

@jan-wassenberg
Copy link
Member

Got it, thanks. In that case adding FixedPoint to the op name may be helpful.

@johnplatts
Copy link
Contributor

AVX3_DL is also capable of carrying out the saturated doubling multiply add using the _mm*_dpwssds_epi32 intrinsics.

Here is how the vqdmla op can be implemented on AVX3_DL for I32 vectors that have 1 to 4 lanes:

template <class VI32, HWY_IF_V_SIZE_LE_V(VI32, 16)>
HWY_API VI32 Vqdmla(VI32 a, Vec<Rebind<int16_t, DFromV<VI32>>> b, Vec<Rebind<int16_t, DFromV<VI32>>> c) {
  const Rebind<int16_t, DFromV<VI32>> di16;
  const Twice<decltype(di16)> dt_i16;

  const auto dup_b = InterleaveWholeLower(dt_i16, ResizeBitCast(dt_i16, b), ResizeBitCast(dt_i16, b));
  const auto dup_c = InterleaveWholeLower(dt_i16, ResizeBitCast(dt_i16, c), ResizeBitCast(dt_i16, c));

  return VI32{_mm_dpwssds_epi32(a.raw, dup_b.raw, dup_c.raw)};
}

The vqdmla op can be carried out on PPC8/PPC9/PPC10 using vec_msums. Here is how the vqdmla op can be implemented on PPC8/PPC9/PPC10:

template <class VI32, HWY_IF_V_SIZE_LE_V(VI32, 16)>
HWY_API VI32 Vqdmla(VI32 a, Vec<Rebind<int16_t, DFromV<VI32>>> b, Vec<Rebind<int16_t, DFromV<VI32>>> c) {
  const Rebind<int16_t, DFromV<VI32>> di16;
  const Twice<decltype(di16)> dt_i16;

  const auto dup_b = InterleaveWholeLower(dt_i16, ResizeBitCast(dt_i16, b), ResizeBitCast(dt_i16, b));
  const auto dup_c = InterleaveWholeLower(dt_i16, ResizeBitCast(dt_i16, c), ResizeBitCast(dt_i16, c));

  return VI32{vec_msums(dup_b.raw, dup_c.raw, a.raw)};
}

@jan-wassenberg
Copy link
Member

Very nice!
@Ryo-not-rio FYI John's solution defines the op in terms of NEON/x86, so for SVE we have two extra Zip. Does that work for you?

@Ryo-not-rio
Copy link
Author

@jan-wassenberg yes, that is perfect! I've added a comment on the PR requesting support for 8 bit and 16 bit elements as well

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants