Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improving dynamic dispatch for multiple targets for x86-64/AArch64/PPC64 #1782

Open
johnplatts opened this issue Sep 26, 2023 · 9 comments
Open

Comments

@johnplatts
Copy link
Contributor

There are some dynamic dispatch scenarios that require compiling the same C++ source files more than once (but with different C++ flags for some of the compilation phases), such as x86-64 with MSVC if AVX2/AVX3 targets are enabled, AArch64 if SVE/SVE2 targets are enabled, or PPC if PPC8/PPC9/PPC10 targets are enabled.

Here are the compilation phases for multi-phase compilation with MSVC on x86-64:

  • Compilation phase 1
    • compile with '-DHWY_WANT_SSSE3 -DHWY_WANT_SSE4' enabled
    • compile without '-arch:AVX2' or '-arch:AVX512' flags
    • compile with AVX2/AVX3 targets disabled
  • Compilation phase 2 (if AVX2 is enabled)
    • compile with only AVX2 targets enabled
    • compile with '-arch:AVX2' flag
  • Compilation phase 3 (if AVX3 is enabled)
    • compile with only AVX3 targets enabled
    • compile with '-arch:AVX512' flag
  • Compilation phase 4
    • compile with all supported targets enabled (including AVX2/AVX3 targets)
    • compile without '-arch:AVX2' or '-arch:AVX512' flags
    • Dynamic dispatch code is compiled in this phase

Here are the compilation phases for multi-phase compilation for AArch64 with SVE/SVE2 enabled:

  • Compilation phase 1
    • compile without '-march=armv8-a+sve' or '-march=armv8-a+sve2'
    • compile with SVE/SVE2 targets enabled
  • Compilation phase 2 (if SVE targets are enabled)
    • compile with '-march=armv8-a+sve' option
    • compile with only SVE targets (but not SVE2 targets) enabled
  • Compilation phase 3 (if SVE2 targets are enabled)
    • compile with '-march=armv8-a+sve2' option
    • compile with only SVE2 targets enabled
  • Compilation phase 4
    • compile with all supported targets enabled (including SVE/SVE2 targets)
    • compile without '-march=armv8-a+sve' or '-march=armv8-a+sve2' flags
    • Dynamic dispatch code is compiled in this phase

Here are the compilation phases for multi-phase compilation for PPC64:

  • Compilation phase 1 (if baseline target does not support POWER8 vector instructions)
    • compile with '-mcpu=powerpc64' on big-endian PPC or '-mcpu=powerpc64le' on little-endian PPC
    • compile with only SCALAR or EMU128 target enabled
  • Compilation phase 2 (if PPC8 target is enabled)
    • compile with '-mcpu=power8' option
    • compile with only PPC8 target enabled
  • Compilation phase 3 (if PPC9 target is enabled)
    • compile with '-mcpu=power9' option
    • compile with only PPC9 target enabled
  • Compilation phase 4 (if PPC10 target is enabled)
    • compile with '-mcpu=power10' option
    • compile with only PPC10 target enabled
  • Compilation phase 5
    • compile with all supported targets enabled (including PPC8/PPC9/PPC10)
    • compile with the '-mcpu=' option of the baseline target
    • Dynamic dispatch code is compiled in this phase

There are real-world use cases for multiple compilation dynamic dispatch, including improved performance on PPC9/PPC10/AArch64.

@johnplatts
Copy link
Contributor Author

Here is an example of Highway dynamic dispatch code updated to support multi-phase compilation (compiled more than once with different compiler options for the different compilation phases):

// Generates code for every target that this compiler can support.
#undef HWY_TARGET_INCLUDE
#define HWY_TARGET_INCLUDE "example.cpp"  // this file
#include <hwy/foreach_target.h>  // must come before highway.h
#include <hwy/highway.h>

HWY_BEFORE_NAMESPACE();
namespace project {
namespace HWY_NAMESPACE {  // required: unique per target

// Can skip hn:: prefixes if already inside hwy::HWY_NAMESPACE.
namespace hn = hwy::HWY_NAMESPACE;

using T = float;

void MulAddLoop(const T* HWY_RESTRICT mul_array,
                const T* HWY_RESTRICT add_array,
                const size_t size, T* HWY_RESTRICT x_array);

#if HWY_IN_PER_TARGET_PHASE
void MulAddLoop(const T* HWY_RESTRICT mul_array,
                const T* HWY_RESTRICT add_array,
                const size_t size, T* HWY_RESTRICT x_array) {
  const hn::ScalableTag<T> d;
  for (size_t i = 0; i < size; i += hn::Lanes(d)) {
    const auto mul = hn::Load(d, mul_array + i);
    const auto add = hn::Load(d, add_array + i);
    auto x = hn::Load(d, x_array + i);
    x = hn::MulAdd(mul, x, add);
    hn::Store(x, d, x_array + i);
  }
}
#endif

}  // namespace HWY_NAMESPACE
}  // namespace project
HWY_AFTER_NAMESPACE();

// The table of pointers to the various implementations in HWY_NAMESPACE must
// be compiled only once (foreach_target #includes this file multiple times).
// HWY_ONCE is true for only one of these 'compilation passes'.
#if HWY_ONCE && HWY_IN_DYN_DISPATCH_PHASE

namespace project {

// This macro declares a static array used for dynamic dispatch.
HWY_EXPORT(MulAddLoop);

void CallMulAddLoop(const float* HWY_RESTRICT mul_array,
                const float* HWY_RESTRICT add_array,
                const size_t size, float* HWY_RESTRICT x_array) {
  // This must reside outside of HWY_NAMESPACE because it references (calls the
  // appropriate one from) the per-target implementations there.
  // For static dispatch, use HWY_STATIC_DISPATCH.
  return HWY_DYNAMIC_DISPATCH(MulAddLoop)(mul_array, add_array, size, x_array);
}

}  // namespace project
#endif  // HWY_ONCE

Here is a link to the above example on Compiler Explorer that shows the above code compiled with different options for HWY_IN_PER_TARGET_PHASE/HWY_IN_DYN_DISPATCH_PHASE: https://gcc.godbolt.org/z/63xTfh1bj

@jan-wassenberg
Copy link
Member

Nice, I understand we want to compile with differing compile flags. This makes sense for MSVC;
one could argue that clang/gcc supersede MSVC even on Windows, but certainly MSVC is still being used.
Even for clang/gcc we still have the situation that currently it's not possible to generate both SVE2 and SVE code, or RVV and scalar, or NEON vs NEON_WITHOUT_AES.
My understanding is that this has actually been fixed for SVE in clang-16, but my distro doesn't have that package yet.

It seems reasonable to support something like this, at least as a stopgap. But one very important constraint:
can we ensure that old code with the new headers still compiles?

copybara-service bot pushed a commit that referenced this issue Nov 3, 2023
copybara-service bot pushed a commit that referenced this issue Nov 3, 2023
copybara-service bot pushed a commit that referenced this issue Nov 7, 2023
copybara-service bot pushed a commit that referenced this issue Nov 7, 2023
@Pflugshaupt
Copy link

Pflugshaupt commented Dec 15, 2023

Is it possible to do dynamic dispatch across all targets with one step in Visual Studio when compiling with clang-cl, or does it have the same restrictions as the msvc compiler when it comes to vex code and thus would requite multiple compilation phases as described above?

@jan-wassenberg
Copy link
Member

Hi @Pflugshaupt , we differentiate between HWY_COMPILER_MSVC and HWY_COMPILER_CLANGCL. I believe runtime dispatch would work with the latter, independently of whether invoked via Visual Studio or not.

@Pflugshaupt
Copy link

Thank you for your time and quick answer. It made me keep trying and I was able to find the true problem. I can confirm things work fine with visual studio driving clang-cl in general. But there appears to be an issue with templates.

The problems I am seeing come from using templates for DRY and avoiding branches inside loops using templates. It appears visual studio insists on always creating instantiations for templates even if they are fully inlined. Often these would be removed during linking, but they just don't compile in this special case.
Combined with the multiple includes by the dynamic dispatch logic and changing compiler flags this seems to lead to disaster as there seems to be a mixup of namespaces, templates and compiler flags :(. Clearly this was not designed with changing compiler flags inside the same compile unit.

I keep getting "always_inline function 'Load' requires target feature 'ssse3' but would be inlined into function (..) that is compiled without support for ssse3", as soon as I use templates inside the HWY_NAMESPACE inside my own namespace and instantiate these from other functions inside the same namespace. The kind of template I'm using should be 100% inlined. These are just shortcuts for writing less code.

Maybe I'll find some magical compiler trick to get rid of the instatiation, but if not.. I'd probably still have to split everything into multiple compile units. And then I might as well not use clang-cl.

@Pflugshaupt
Copy link

Update: Just got it to work thanks to this: https://stackoverflow.com/questions/71720201/why-does-msvc-compiler-put-template-instantiation-binaries-in-assembly

However my solution (msvc 2022 + clang-cl) so far is somewhat inelegant and seems to defy logic. It requires

  • putting the template definition inside an anonymous namespace
  • declaring the template before outside anonymous namespace
  • specifying the namespace of the template explicitly for calls to not include the anonymous namespace

This seems to get rid of the troublesome instances as long as the template is only used in the same compile unit. Hopefully there's a simpler way.

@jan-wassenberg
Copy link
Member

hm, the "requires target feature" usually means we are missing a pragma. It is important for all of your SIMD-using code to be between HWY_BEFORE_NAMESPACE/HWY_AFTER_NAMESPACE: these set up a pragma to cover all 'functions' between them. Also, any lambdas require an extra HWY_ATTR before the opening { because lambdas do not count as 'functions'.

Is it possible that this could be an easier solution to the problem?

@Pflugshaupt
Copy link

Wow - thanks heaps! That was it! I was aware of HWY_BEFORE_NAMESPACE/HWY_AFTER_NAMESPACE, but I was mixing lambdas and templates with lambda arguments to get as DRY as possible and adding HWY_ATTR to all lambdas has fixed the issues I was seeing on msvs + clang-cl.

Looking at the docs again I see that there's a HWY_ATTR in the Transform1 example on the main readme (which is similar to what I'm doing) and I unfortunately missed that. Hopefully this conversation helps someone else in the future.

Things compiled fine on macOS without HWY_ATTR before already.

@jan-wassenberg
Copy link
Member

Nice, glad to hear that was it :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants