Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Any chance for other language bindings? #1738

Open
oshaboy opened this issue Sep 11, 2023 · 11 comments
Open

Any chance for other language bindings? #1738

oshaboy opened this issue Sep 11, 2023 · 11 comments

Comments

@oshaboy
Copy link

oshaboy commented Sep 11, 2023

C, Rust, Go and Ada bindings would be nice to have. I understand that templates are important for the library to work but a lot of programmers who can benefit from a portable SIMD library simply can't use it.

Is this a planned feature already?

@jan-wassenberg
Copy link
Member

Hi @oshaboy , it's an interesting topic. I personally do not know Rust, Go and Ada very well.

What seems very feasible is to have C bindings for larger groups of code, for example VQSort(). As non-inlined functions, those could be called from any language, after compiling as C++. That would be a very easy update.

What seems harder is to expose individual ops (e.g. LoadDup128) to other libraries: it's very important that they be inlined. One could imagine a Highway-like library for those languages that shares the same op names, with adaptations to the language requirements. I'd be happy to discuss that with anyone interested in working on that.

@oshaboy
Copy link
Author

oshaboy commented Sep 12, 2023

Yeah after taking a look at the header file I can see it's not something an #ifdef __cplusplus can solve.

For the other programming languages. Rust and Ada can inline across modules. Meanwhile with Go you're at the mercy of the compiler. Of course with all of these you have to rewrite all the inline code to the point where it's a Ship of Theseus of the original library.

@jan-wassenberg
Copy link
Member

Yes indeed :)

Ship of Theseus is fine by me. The main value-add of Highway is 1) shielding user code from compiler bugs 2) finding a useful subset of all instruction sets, and filling in the gaps where required.

1 would have to be repeated/maintained for other languages, but adopting the Highway ops and polyfills (2) would save the authors of new ships/libraries a lot of work.

@Asc2011
Copy link

Asc2011 commented Mar 10, 2024

I have a nim-version of the single-threaded vectorized quicksort from 2022. It's really nice thx and i'll publish it soon. Maybe alongside a python-module..
greets, Andreas

@jan-wassenberg
Copy link
Member

Cool :) Do you mean a variant/port of vqsort?
We're happy to link to it from our readme if you'd like to send a pull request.

@Asc2011
Copy link

Asc2011 commented Mar 11, 2024

It's based on the genuine C++ version from the "Fast and Robust" repo. Therefore it does not yet include all the nice additions (8/16/64-bit-wide)-Types that you have added recently. And its AVX2-only and since i don't have access to a ARM-machine it will stay so - or maybe a M3 Airbook drops from heaven ;). But realisticly others will be able to help out on this issue. My primary concern is to show that Nim can reach 80-100-% Cpp-performance as soon as we embrace SIMD-intrinsics and thus commit to saving energy :)
As soon as the missing-types are sortable i'll do a python-version.
greets, Andreas

@jan-wassenberg
Copy link
Member

Got it. FYI we also substantially changed the pivot sampling step since then, which can make a big difference in perf/robustness.

Cool goal 👍 I'm curious about the Nim plans for intrinsics, is there a writeup/design doc for it?

@Asc2011
Copy link

Asc2011 commented Mar 13, 2024

FYI we also substantially changed the pivot sampling step

thank you - thats good to know - see if i can find it in your implementation.

I'm curious about the Nim plans for intrinsics, is there a writeup/design doc for it?

there is the nimsimd-module which basically wraps the intrinsics. It's used by some projects. I'd wish this to enter Nim's std/lib and sooner or later that will happen.
I'm not aware of the compiler-groups' plans to integrate more SIMD-concepts - atm obviously not. I'm looking at 'SIMD-everywhere', which follows similar goals as Highway. I was not aware of Highway and in terms of quality - i guess - it would be a nice fit to any std-lib. Nims data-types and algos need some upgrade to fit into the 21st-century. I believe intrinsics and atomics are the building blocks to get there.
I've also translated a vectorized RLU-Cache called multistep in-vector RLU by H. Inoue, which you're surely aware of. Aside from VQSort it's one of the most beautiful SIMD-concepts i've seen for years. It gives proof that even in the oldest areas of CS, where one would not expect any sensations, using in-register-ops in combination with fine-grained spin-locks, one can achieve amazing things.
ATM the core nim-devs stay away from optimizations and focus on improving compilation-times. Knowing that the C-compiler SIMD-optimizations will come anyway. At the same time, there are many nim-users which need to use/or simply prefer older or non-optimizing compilers and thats why i believe we should integrate as many SIMD-concepts as possible into the data-structures and algorithms of the language itself.

regards, Andreas

@jan-wassenberg
Copy link
Member

Thanks for the pointer to vectorized RLU cache, hadn't seen it yet! (The author name does ring a bell, I believe they also worked on sorting.)

SIMDe has a different focus: given code written for a particular set of nonportable intrinsics, try to make it run on other platforms. This can be useful for existing codebases, but I'd argue that we want something else for forward-looking languages/new codebases. For example, slightly bending the contract (Reorder) in ReorderWidenMulAccumulate allows us to implement the op efficiently on all platforms, whereas the SIMDe approach would be more expensive because it must faithfully match the platform's quirks.

I agree it would be super useful to get SIMD into the language. For example, in C++ we resort to pragmas to enable codegen for a particular target, but this could be done more elegantly if integrated.

I would strongly advocate a 'vector length agnostic' model, where users can only ask for <= 128 bit, OR [1/8, 1/1] of the native length, which must be queried at runtime. This allows SVE/RVV support, and also avoids poor codegen from users asking for 1024 (or worse: 384) bit vectors. Would be happy to discuss with anyone who wants to start a proposal for Nim.

@Asc2011
Copy link

Asc2011 commented Mar 18, 2024

Thanks for the pointer to vectorized RLU cache

once you have rewritten you caches and you get the first energy-bill, consider sending some flowers & cake to your colleagues at IBM-research/Tokio :)

SIMDe has a different focus ...

yeah agree - so its gonna be highway.

I agree it would be super useful to get SIMD into the language.

well, here i'm afraid of Cpp's Post-traumatic-template-disorder :) Thinking of 'getting it into the language' i tried a DSL using basic overloads and Nim-templates - it is flexible and works, but the performance penalty is prohibitive. One could use a macro to revert the templates/overloads back into intrinsic-calls or use Nims' powerful macros (they are type-checked, so no PTTD here) from the beginning. Looking at the discussions at SO, i get it, that (when things go wrong) compiler-transformations need to be considered. What are the compiler-makers plans regarding user-code containing intrinsics ? Do they regard it as their task to optimise things/algos here ?
From my perspective as a developer, looking at a community that works with Micro-devices, that often uses Mingw or TCC, with a culture of doing things on HW with restricted resources, i'd rather imagine replacing intrinsics with ASM. And in the process getting rid of the need to do Godbolt-analyis or any other layer in-between that has decided to 'optimize' code.

... 'vector length agnostic' model,

that will be (in case of the latest ARM-simd already is) a challenge in itself. Such Runtime-requirements exclude the compiler-makers and demand somewhat new strategies (in the area of supercomputing it has been done, so one could learn from them). But this is not my primary concern - since even AVX-3 won't show up in commodity machines soon. We'll see if Intels AVX-10 with variable runtime register-widths will be advertised as consumer- or server-technique - i expect rather server.
Vectorised-QS and RLU achieve huge gains thru NEON/AVX2 alone - one should make the math, to find out how many coal-plants one could turn off, in case evbd. would use them..

@jan-wassenberg
Copy link
Member

To clarify, the main bit that is required in the language is the "compiler and programmer are allowed to call this intrinsic", i.e. setting the target attributes. The way this is done in clang unfortunately does not compose with templates. It would be nice if we could have something like template specialization, but in C++ what is required is that PLUS pragmas that change the target attributes. The rest can hopefully be a library, though I am not familiar with Nim macros.

What are the compiler-makers plans regarding user-code containing intrinsics ? Do they regard it as their task to optimise things/algos here ?

Clang is quite enthusiastic about optimizing intrinsics. It's definitely not a 1:1 mapping. Often this is actually helpful; I have seen it basically rename memory (actually registers), thus optimizing out a permutation, which is very cool and would be very difficult to do by hand.

i'd rather imagine replacing intrinsics with ASM

This can be done, but I remember a comment by a colleague that what should have been a 20 minute patch, took almost a day in assembly.

in the area of supercomputing it has been done, so one could learn from them

Unfortunately the solutions I have seen usually assume that the HPC cluster is running a certain known CPU, so they tell the compiler to hard-code a certain SVE vector size.

AVX-3 won't show up in commodity machines soon

Aren't there consumer Zen4 machines?

Intels AVX-10 with variable runtime register-widths

AVX10 is actually static, in the sense that you generate either 128-bit, 256-bit, 512-bit instructions. It is just that the CPU advertises which of those will raise faults, i.e. should not be used.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants