Make `Lanes` `constexpr` on `arm-sve` target #1593

fbarbari · 2023-07-24T16:29:11Z

I have this "prologue" at the beginning of a function

HWY_ATTR bool foo(const double *data, const uint8_t *mask) {
      namespace hn = hwy::HWY_NAMESPACE;
      const hn::ScalableTag<double> double_vector_tag;
      const hn::Rebind<uint8_t, decltype(double_vector_tag)> mask_vector_tag;

      static_assert(
          hn::Lanes(double_vector_tag) == hn::Lanes(mask_vector_tag),
          "The data SIMD vector type and the mask SIMD vector type must have the same number of lanes");

      // ...
}

This code compiles fine on x86-256 but on arm-sve outputs:

error: static assertion expression is not an integral constant expression
          hn::Lanes(double_vector_tag) == hn::Lanes(mask_vector_tag)

I think it would be useful to have Lanes as a constexpr function so that one can explicit some constraints through static_asserts.

I am using clang 16.0.6 on an ARM A64FX.

The text was updated successfully, but these errors were encountered:

johnplatts · 2023-07-24T16:50:24Z

Lanes(double_vector_tag) returns the actual number of lanes in Vec<decltype(double_vector_tag)>, and the result of Lanes(double_vector_tag) can differ from MaxLanes(double_vector_tag) on targets that use scalable vectors such as SVE or RVV.

Here is how the above code should be corrected:

HWY_ATTR bool foo(const double *data, const uint8_t *mask) {
      namespace hn = hwy::HWY_NAMESPACE;
      const hn::ScalableTag<double> double_vector_tag;
      const hn::Rebind<uint8_t, decltype(double_vector_tag)> mask_vector_tag;

      static_assert(
          hn::MaxLanes(double_vector_tag) == hn::MaxLanes(mask_vector_tag),
          "The data SIMD vector type and the mask SIMD vector type must have the same number of lanes");

      // ...
}

hn::MaxLanes(double_vector_tag) will return the same result as hn::Lanes(mask_vector_tag) on targets that use fixed-size vectors such as x86/NEON/PPC/WASM/EMU128/SCALAR.

hn::Lanes(double_vector_tag) == hn::Lanes(mask_vector_tag) should be true, even on HWY_SVE and HWY_RVV targets, as double_vector_tag.Pow2() is equal to 0 and mask_vector_tag.Pow2() is equal to -3 and as the hn::Lanes(mask_vector_tag) implementation on SVE/RVV actually takes into account mask_vector_tag.Pow2().

hn::Lanes(mask_vector_tag) will return the same result as hn::Lanes(hn::ScalableTag<uint8_t>()) / 8 on all targets except SCALAR, including SVE and RVV.

jan-wassenberg · 2023-07-25T08:17:51Z

Thanks @johnplatts , I agree with your comments.
@fbarbari , it is true that Lanes() is constexpr on x86 (and also NEON) but this should not be relied upon as John notes. A further alternative could be to change static_assert to a debug-build only runtime assert, such as HWY_DASSERT.

Finally, we could consider adding a HWY_SVE_512 target which corresponds to A64FX. Then, Lanes could be constexpr. We should have good evidence that this is worthwhile in general, though - it costs compile time and binary size for everyone (except perhaps if it is opt-in).

fbarbari · 2023-07-25T09:07:55Z

Thank you both for the quick responses. In the end I used the MaxLanes function in the static_assert.
Regarding to HWY_SVE_512, I didn't want to be the cause for a breaking change in the highway API :)
Isn't it already opt-in when using static dispatch (in the sense that you get only what can run on the CPU you are compiling on)?

jan-wassenberg · 2023-07-25T11:31:02Z

Sounds good :)

Regarding to HWY_SVE_512, I didn't want to be the cause for a breaking change in the highway API :)

No worries, this would not be a breaking change. A new target means we recompile the code one more time for that target, with corresponding binary size increase (not major).

you get only what can run on the CPU you are compiling on

It's a bit more subtle. Static dispatch only uses the best target enabled via compiler flags. That may or may not be -march=native. And the fixed-size targets such as HWY_SVE_256 require a bit more still (signaling that we know the vector length via __ARM_FEATURE_SVE_BITS).

But before we get into that, the question remains: would you see a significant benefit from a target that knows/hard-codes the vector size? You can see examples of the optimizations we can do at occurrences of HWY_SVE_256 in the code.

fbarbari · 2023-07-25T14:39:27Z

would you see a significant benefit from a target that knows/hard-codes the vector size?

This is a hard question because I don't have the data to back up my answer right now. I will continue working with what highway offers, for now, and come back to you once I have a reasonable example.

For this reason, I will leave this issue open, if it's ok for you.

jan-wassenberg · 2023-07-25T17:11:10Z

Sure, makes sense. BTW I'm curious what kind of project you are working on?

fbarbari · 2023-07-26T09:28:16Z

I'm working on an HPC application of computational chemistry inside the EUPEX project.

jan-wassenberg · 2023-07-26T10:08:39Z

Cool, thanks for sharing. We're happy to support your work, please do not hesitate to ask questions or raise issues.

fbarbari · 2023-10-05T16:36:53Z

Hello again, sorry for the late reply.
I am posting this here instead of opening a new issue because I think it may be related.

I tried this simple dot product code on a Fujitsu A64FX and noticed that, for SIZE<=8 the results differ. After some investigation, I found out that the reason is MaxLanes is set to 32 while an A64FX supports up to 8 double lanes.
I could fix my code by replacing hn::ScalableTag<double> with hn::FixedTag<double, 8> guarded by ARM64-specific macros, but, as far as I understood, this would not be the recommended approach.

Would this problem be solved by a new HWY_SVE_512 target?

jan-wassenberg · 2023-10-06T09:35:24Z

No worries :)
First, you may be interested in an existing implementation of dot product in hwy/contrib/dot, which includes unrolling.

I think the issue here is using MaxLanes. That can be ok if we want an upper bound, but the actual number of lanes is Lanes(d). This can be non-constexpr, but in this context that's fine. Usually we have a local variable size_t N = Lanes(d), and the loop is over [0, N). Would that work for you?

If you really require constexpr N, then yes, we would have to add an SVE_512, that is doable. We can also consider whether Fujitsu's upcoming Monaka chip supports SVE2, though it is still quite far in the future.

fbarbari · 2023-10-06T10:44:20Z

you may be interested in an existing implementation of dot product in hwy/contrib/dot, which includes unrolling.

Thank you very much, I never checked the contrib sections.

That can be ok if we want an upper bound, but the actual number of lanes is Lanes(d).

Sorry, my bad. From the previous comments by @johnplatts, I thought that

... Lanes() is constexpr on x86 (and also NEON) but this should not be relied upon ...

therefore I stopped using Lanes and replaced it with MaxLanes in my code. I thought that MaxLanes returned the maximum number of available/usable lanes on the current variable-width-vector-capable CPU, while it actually returns the maximum supported by the ISA.

Usually we have a local variable size_t N = Lanes(d), and the loop is over [0, N). Would that work for you?

Yes, of course that would work, but the best solution would be to have the SVE_512 target available, allowing us to prepare at compile-time many constants and masks which depend on the number of lanes.

jan-wassenberg · 2023-10-06T11:52:01Z

You're welcome :) Please let us know if the dot library can be improved for your use case.

I thought that MaxLanes returned the maximum number of available/usable lanes on the current variable-width-vector-capable CPU, while it actually returns the maximum supported by the ISA.

That's right :)

I understand you want to pre-bake masks. As an input to the decision whether we create an SVE_512 already, would you be able to gather some evidence about the speedup vs runtime init for example by running on Graviton3 (SVE_256)?

fbarbari · 2023-10-31T11:14:15Z

Hello again, sorry for the delay.

would you be able to gather some evidence about the speedup vs runtime init for example by running on Graviton3 (SVE_256)?

Sure. I have access to an AWS Graviton3 instance which I can use for this. What would you suggest as benchmark for the initialization?

jan-wassenberg · 2023-11-01T06:38:24Z

Sounds good :) I figured we might use your existing(?) benchmark to measure two versions of code:

prebaked masks, assuming a fixed/known vector size
VL-independent computation of the masks at runtime, right before they are used.

It is quite possible that mask init might be able to 'hide' behind other latencies, or fill unused pipeline slots?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make `Lanes` `constexpr` on `arm-sve` target #1593

Make `Lanes` `constexpr` on `arm-sve` target #1593

fbarbari commented Jul 24, 2023

johnplatts commented Jul 24, 2023

jan-wassenberg commented Jul 25, 2023

fbarbari commented Jul 25, 2023

jan-wassenberg commented Jul 25, 2023

fbarbari commented Jul 25, 2023

jan-wassenberg commented Jul 25, 2023

fbarbari commented Jul 26, 2023

jan-wassenberg commented Jul 26, 2023

fbarbari commented Oct 5, 2023

jan-wassenberg commented Oct 6, 2023

fbarbari commented Oct 6, 2023

jan-wassenberg commented Oct 6, 2023

fbarbari commented Oct 31, 2023

jan-wassenberg commented Nov 1, 2023

Make Lanes constexpr on arm-sve target #1593

Make Lanes constexpr on arm-sve target #1593

Comments

fbarbari commented Jul 24, 2023

johnplatts commented Jul 24, 2023

jan-wassenberg commented Jul 25, 2023

fbarbari commented Jul 25, 2023

jan-wassenberg commented Jul 25, 2023

fbarbari commented Jul 25, 2023

jan-wassenberg commented Jul 25, 2023

fbarbari commented Jul 26, 2023

jan-wassenberg commented Jul 26, 2023

fbarbari commented Oct 5, 2023

jan-wassenberg commented Oct 6, 2023

fbarbari commented Oct 6, 2023

jan-wassenberg commented Oct 6, 2023

fbarbari commented Oct 31, 2023

jan-wassenberg commented Nov 1, 2023

Make `Lanes` `constexpr` on `arm-sve` target #1593

Make `Lanes` `constexpr` on `arm-sve` target #1593