Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make Lanes constexpr on arm-sve target #1593

Open
fbarbari opened this issue Jul 24, 2023 · 14 comments
Open

Make Lanes constexpr on arm-sve target #1593

fbarbari opened this issue Jul 24, 2023 · 14 comments

Comments

@fbarbari
Copy link

I have this "prologue" at the beginning of a function

HWY_ATTR bool foo(const double *data, const uint8_t *mask) {
      namespace hn = hwy::HWY_NAMESPACE;
      const hn::ScalableTag<double> double_vector_tag;
      const hn::Rebind<uint8_t, decltype(double_vector_tag)> mask_vector_tag;

      static_assert(
          hn::Lanes(double_vector_tag) == hn::Lanes(mask_vector_tag),
          "The data SIMD vector type and the mask SIMD vector type must have the same number of lanes");

      // ...
}

This code compiles fine on x86-256 but on arm-sve outputs:

error: static assertion expression is not an integral constant expression
          hn::Lanes(double_vector_tag) == hn::Lanes(mask_vector_tag)

I think it would be useful to have Lanes as a constexpr function so that one can explicit some constraints through static_asserts.

I am using clang 16.0.6 on an ARM A64FX.

@johnplatts
Copy link
Contributor

Lanes(double_vector_tag) returns the actual number of lanes in Vec<decltype(double_vector_tag)>, and the result of Lanes(double_vector_tag) can differ from MaxLanes(double_vector_tag) on targets that use scalable vectors such as SVE or RVV.

Here is how the above code should be corrected:

HWY_ATTR bool foo(const double *data, const uint8_t *mask) {
      namespace hn = hwy::HWY_NAMESPACE;
      const hn::ScalableTag<double> double_vector_tag;
      const hn::Rebind<uint8_t, decltype(double_vector_tag)> mask_vector_tag;

      static_assert(
          hn::MaxLanes(double_vector_tag) == hn::MaxLanes(mask_vector_tag),
          "The data SIMD vector type and the mask SIMD vector type must have the same number of lanes");

      // ...
}

hn::MaxLanes(double_vector_tag) will return the same result as hn::Lanes(mask_vector_tag) on targets that use fixed-size vectors such as x86/NEON/PPC/WASM/EMU128/SCALAR.

hn::Lanes(double_vector_tag) == hn::Lanes(mask_vector_tag) should be true, even on HWY_SVE and HWY_RVV targets, as double_vector_tag.Pow2() is equal to 0 and mask_vector_tag.Pow2() is equal to -3 and as the hn::Lanes(mask_vector_tag) implementation on SVE/RVV actually takes into account mask_vector_tag.Pow2().

hn::Lanes(mask_vector_tag) will return the same result as hn::Lanes(hn::ScalableTag<uint8_t>()) / 8 on all targets except SCALAR, including SVE and RVV.

@jan-wassenberg
Copy link
Member

Thanks @johnplatts , I agree with your comments.
@fbarbari , it is true that Lanes() is constexpr on x86 (and also NEON) but this should not be relied upon as John notes. A further alternative could be to change static_assert to a debug-build only runtime assert, such as HWY_DASSERT.

Finally, we could consider adding a HWY_SVE_512 target which corresponds to A64FX. Then, Lanes could be constexpr. We should have good evidence that this is worthwhile in general, though - it costs compile time and binary size for everyone (except perhaps if it is opt-in).

@fbarbari
Copy link
Author

Thank you both for the quick responses. In the end I used the MaxLanes function in the static_assert.
Regarding to HWY_SVE_512, I didn't want to be the cause for a breaking change in the highway API :)
Isn't it already opt-in when using static dispatch (in the sense that you get only what can run on the CPU you are compiling on)?

@jan-wassenberg
Copy link
Member

Sounds good :)

Regarding to HWY_SVE_512, I didn't want to be the cause for a breaking change in the highway API :)

No worries, this would not be a breaking change. A new target means we recompile the code one more time for that target, with corresponding binary size increase (not major).

you get only what can run on the CPU you are compiling on

It's a bit more subtle. Static dispatch only uses the best target enabled via compiler flags. That may or may not be -march=native. And the fixed-size targets such as HWY_SVE_256 require a bit more still (signaling that we know the vector length via __ARM_FEATURE_SVE_BITS).

But before we get into that, the question remains: would you see a significant benefit from a target that knows/hard-codes the vector size? You can see examples of the optimizations we can do at occurrences of HWY_SVE_256 in the code.

@fbarbari
Copy link
Author

would you see a significant benefit from a target that knows/hard-codes the vector size?

This is a hard question because I don't have the data to back up my answer right now. I will continue working with what highway offers, for now, and come back to you once I have a reasonable example.

For this reason, I will leave this issue open, if it's ok for you.

@jan-wassenberg
Copy link
Member

Sure, makes sense. BTW I'm curious what kind of project you are working on?

@fbarbari
Copy link
Author

I'm working on an HPC application of computational chemistry inside the EUPEX project.

@jan-wassenberg
Copy link
Member

Cool, thanks for sharing. We're happy to support your work, please do not hesitate to ask questions or raise issues.

@fbarbari
Copy link
Author

fbarbari commented Oct 5, 2023

Hello again, sorry for the late reply.
I am posting this here instead of opening a new issue because I think it may be related.

I tried this simple dot product code on a Fujitsu A64FX and noticed that, for SIZE<=8 the results differ. After some investigation, I found out that the reason is MaxLanes is set to 32 while an A64FX supports up to 8 double lanes.
I could fix my code by replacing hn::ScalableTag<double> with hn::FixedTag<double, 8> guarded by ARM64-specific macros, but, as far as I understood, this would not be the recommended approach.

Would this problem be solved by a new HWY_SVE_512 target?

@jan-wassenberg
Copy link
Member

No worries :)
First, you may be interested in an existing implementation of dot product in hwy/contrib/dot, which includes unrolling.

I think the issue here is using MaxLanes. That can be ok if we want an upper bound, but the actual number of lanes is Lanes(d). This can be non-constexpr, but in this context that's fine. Usually we have a local variable size_t N = Lanes(d), and the loop is over [0, N). Would that work for you?

If you really require constexpr N, then yes, we would have to add an SVE_512, that is doable. We can also consider whether Fujitsu's upcoming Monaka chip supports SVE2, though it is still quite far in the future.

@fbarbari
Copy link
Author

fbarbari commented Oct 6, 2023

you may be interested in an existing implementation of dot product in hwy/contrib/dot, which includes unrolling.

Thank you very much, I never checked the contrib sections.

That can be ok if we want an upper bound, but the actual number of lanes is Lanes(d).

Sorry, my bad. From the previous comments by @johnplatts, I thought that

... Lanes() is constexpr on x86 (and also NEON) but this should not be relied upon ...

therefore I stopped using Lanes and replaced it with MaxLanes in my code. I thought that MaxLanes returned the maximum number of available/usable lanes on the current variable-width-vector-capable CPU, while it actually returns the maximum supported by the ISA.

Usually we have a local variable size_t N = Lanes(d), and the loop is over [0, N). Would that work for you?

Yes, of course that would work, but the best solution would be to have the SVE_512 target available, allowing us to prepare at compile-time many constants and masks which depend on the number of lanes.

@jan-wassenberg
Copy link
Member

You're welcome :) Please let us know if the dot library can be improved for your use case.

I thought that MaxLanes returned the maximum number of available/usable lanes on the current variable-width-vector-capable CPU, while it actually returns the maximum supported by the ISA.

That's right :)

I understand you want to pre-bake masks. As an input to the decision whether we create an SVE_512 already, would you be able to gather some evidence about the speedup vs runtime init for example by running on Graviton3 (SVE_256)?

@fbarbari
Copy link
Author

Hello again, sorry for the delay.

would you be able to gather some evidence about the speedup vs runtime init for example by running on Graviton3 (SVE_256)?

Sure. I have access to an AWS Graviton3 instance which I can use for this. What would you suggest as benchmark for the initialization?

@jan-wassenberg
Copy link
Member

Sounds good :) I figured we might use your existing(?) benchmark to measure two versions of code:

  • prebaked masks, assuming a fixed/known vector size
  • VL-independent computation of the masks at runtime, right before they are used.

It is quite possible that mask init might be able to 'hide' behind other latencies, or fill unused pipeline slots?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants