Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for complex arithmetics #2047

Open
Ryo-not-rio opened this issue Apr 3, 2024 · 9 comments
Open

Support for complex arithmetics #2047

Ryo-not-rio opened this issue Apr 3, 2024 · 9 comments

Comments

@Ryo-not-rio
Copy link

Hi,

I would like to propose the addition of complex arithmetic instructions to highway. This would allow us to take advantage of the SVE complex arithmetic instructions (svcadd, svcmla and svcdot), improving the performance of complex arithmetics on arm. I imagine the difficulty would be the need to implement and maintain equivalent functions for x86 and NEON where these instructions do not exist natively.

@jan-wassenberg
Copy link
Member

We are happy to maintain contributed functions. Assuming only SVE supports these instructions natively, it is actually pretty easy to implement a fallback for other platforms because it can be done just once, without repeating for each platform, by putting it in generic_ops-inl.h.

One general principle is that we want the code to be reasonably efficient on all platforms. I wonder whether it would be better, if we did not have the SVE instructions, to organize complex numbers into two regs re and im, rather than in odd/even lanes of one vector?

Let's imagine an app willing to have a special case for SVE, and a second codepath for other platforms. Would this be faster than if we always used odd/even layout for Z numbers? If so, it sounds like an #if might be a better fit; if not, then a single function with either SVE or emulated implementation sounds reasonable.

@Ryo-not-rio
Copy link
Author

I see your point, we indeed found that de-interleaving the complex numbers first was faster for highway on NEON & SVE. I'm not sure about the x86 side of things though. Even if this is the case, it would be nice to be able to access the SVE instructions from highway since they seem to perform significantly better. Either way, needs further investigation on x86 it sounds like

@johnplatts
Copy link
Contributor

johnplatts commented Apr 3, 2024

Hi,

I would like to propose the addition of complex arithmetic instructions to highway. This would allow us to take advantage of the SVE complex arithmetic instructions (svcadd, svcmla and svcdot), improving the performance of complex arithmetics on arm. I imagine the difficulty would be the need to implement and maintain equivalent functions for x86 and NEON where these instructions do not exist natively.

F32 AddSub(a, b) is equivalent to SVE svcadd_f32_x(svptrue_f32(), a, Reverse2(d, b), 90) and SSSE3 _mm_addsub_ps(a.raw, b.raw).

The F16/F32/F64 AddSub op should be re-implemented using svcadd on SVE targets as svcadd is more efficient than the default AddSub implementation in generic_ops-inl.h on SVE targets.

F16/F32/F64 MulAddSub(a, b, c) should be re-implemented as MulAdd(a, b, AddSub(Set(DFromV<decltype(c)>(), -0.0), c)) on SVE targets (which allows the MulAddSub to be carried out using a svcadd op followed by a svmad op).

SVE svcmla_f32_x(svptrue_f32(), a, b, c, 0) is equivalent to MulAdd(DupEven(b), c, a).

SVE svcmla_f32_x(svptrue_f32(), a, b, c, 90) is equivalent to MulAdd(DupOdd(b), Reverse2(d, AddSub(Set(DFromV<decltype(b)>(), -0.0), c)), a).

SVE svcmla_f32_x(svptrue_f32(), a, b, c, 180) is equivalent to NegMulAdd(DupEven(b), c, a).

SVE svcmla_f32_x(svptrue_f32(), a, b, c, 270) is equivalent to NegMulAdd(DupOdd(b), Reverse2(d, AddSub(Set(DFromV<decltype(b)>(), -0.0), c)), a).

@jan-wassenberg
Copy link
Member

Thanks @johnplatts for pointing out that we can already target svcadd with existing (Mul)AddSub.
@Ryo-not-rio , how close does that get us to what you had in mind?

@johnplatts
Copy link
Contributor

johnplatts commented Apr 3, 2024

Thanks @johnplatts for pointing out that we can already target svcadd with existing (Mul)AddSub.

I have re-implemented AddSub and MulAddSub on SVE using svcadd in pull request #2054.

@Ryo-not-rio
Copy link
Author

It's good to know that svcadd is already being used in highway!
I think we're still missing a direct link to the svcmla instructions. Even when there are equivalent ways of writing things in highway, we've seen a performance hit due to the extra instructions required.
For example
svcmla_f32_m(pg, acc0, vec_a, vec_b, 90); requires an extra reverse instruction on highway

@jan-wassenberg
Copy link
Member

hm. It seems that the CMLA instruction is 'exotic' in the sense that other ISAs do not provide such an instruction. Do you have any suggestion on how we could handle that without performance cliffs in one ISA?

@johnplatts
Copy link
Contributor

hm. It seems that the CMLA instruction is 'exotic' in the sense that other ISAs do not provide such an instruction. Do you have any suggestion on how we could handle that without performance cliffs in one ISA?

Here is a link to a generic implementation of the ComplexAddRot90/270 ops (equivalent to SVE svcadd_*_x) and ComplexMulAdd[Rot90/180/270] (equivalent to SVE svcmla_*_x): https://godbolt.org/z/1zn949a5f

There are also vcaddq_rot90/270_f16/f32/f64 (equivalent to SVE svcadd_*_x) and vcmlaq[_rot90/180/270]_f16/f32/f64 intrinsics (equivalent to SVE svcmla_*_x) intrinsics available with the FCADD extension available on Armv8.3 or later.

The generic implementation of the ComplexAdd/ComplexMulAdd ops linked above is efficient on most SIMD targets, including SSSE3/SSE4/AVX2/AVX3/NEON.

SSSE3/SSE4/AVX2/AVX3 have AddSub instructions for F32/F64 vectors that are 32 bytes or smaller that helps improve the performance of the ComplexAdd/ComplexMulAdd ops.

@jan-wassenberg
Copy link
Member

Thanks, those implementations look good to me! Are we proposing to add those as new ops, with single-instruction implementations for SVE?

That seems fine provided we are confident that apps would want to use those ops as defined. One remaining concern I have (because not familiar with complex arithmetic): are there perhaps other equivalent ways of implementing the desired formulas, that would be more efficient than these generic implementations when run on non-SVE?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants