norden.social is one of the many independent Mastodon servers you can use to participate in the fediverse.
Moin! Dies ist die Mastodon-Instanz für Nordlichter, Schnacker und alles dazwischen. Folge dem Leuchtturm.

Administered by:

Server stats:

3.4K
active users

#simd

0 posts0 participants0 posts today

New blog series: @folkertdev shows how we use SIMD in the zlib-rs project.

SIMD is crucial to good performance, but learning how to use it can be daunting. In this series we'll show concrete examples of using SIMD in a real world project.

Part 1 explains how the compiler already uses SIMD for us, how to evaluate whether it's doing a good job, and how to use a more optimal version when the current CPU supports it.

tweedegolf.nl/en/blog/153/simd

@trifectatech

tweedegolf.nlSIMD in zlib-rs (part 1): Autovectorization and target features - Blog - Tweede golfI'm fascinated by the creative use of SIMD instructions. When you first learn about SIMD, it is clear that doing more multiplications in a single instruction is useful for speeding up matrix multi ...

While implementing complex numbers for #simd I tripped over failures wrt. negative zero. After multiple re-readings of C23 Annex G and considering the meaning of infinite infinities on a 2D plane (with zeros simply being their inverse) I believe #C and #CPlusPlus should ignore the sign of zeros and infinities in their x+iy representations of complex numbers. compiler-explorer.com/z/YavE4M provides some motivation.
Am I missing something?

compiler-explorer.comCompiler Explorer - C++ int main() { using C = std::complex<double>; std::cout << C() * -C() << '\n'; std::cout << 0. * -C() << '\n'; }

Ah, the classic tale of a coder thinking #SIMD would make their code fly 🚀, only to discover it trips over its own feet 👟. Our hero's memory seems as patchy as their #benchmarks, but fear not, the valuable lesson here is clear: #optimization is just a synonym for #headache. 🤦‍♂️
genna.win/blog/convolution-sim #coding #woes #lessons #HackerNews #ngated

genna.winPerformance optimization, and how to do it wrong | Just wing itOptimization is hard. And sometimes, the compiler makes it even harder.

Hey friends!
For folks interested in #RISCV, and especially #RVV, here's some information on the #tenstorrent in house designed CPU!

High level, vector is 2x256, full RVV1.0 as well as a fair few of the optional extras to RVV1.0!

Phoronix article here: phoronix.com/news/LLVM-20-Tens

LLVM patches here: github.com/llvm/llvm-project/p

One Pager: cdn.sanity.io/files/jpb4ed5r/p

www.phoronix.comLLVM Merges Support The For Tenstorrent TT-Ascalon-D8 RISC-V CPU

#simd

there's this trick i randomly found a few years ago and i've been wondering if there's a name for it or if other people have done this before

```
for enforcing floating point determinism with realigned buffers

if we have
x x x 0 1 2 3 4 5 6 7 x x x

where x is the identity for my operation, and our operation is commutative (not necessarily associative)

then adding x padding doesn't affect the result as long as we do a tree reduction at the end

e.g.

accumulate in register: v = 0+4 1+5 2+6 3+7

tree reduction step 0: (0+4)+(2+6) (1+5)+(3+7)
tree reduction step 1: ((0+4)+(2+6)) + ((1+5)+(3+7))

if we add padding (e.g., by realigning the buffer and using a masked load)

accumulate in register: v = x+1+5 x+2+6 x+3+7 0+4+x

tree reduction step 0: (1+5)+(3+7) (0+4)+(2+6)
tree reduction step 1: ((1+5)+(3+7)) + ((0+4)+(2+6))

commuting the elements shows us that this is the exact same result as the previous one, so the bit pattern of the final result is unaffected (modulo signed zero, nan, etc)
```

Yesterday, one year ago... (Still wondering how many people actually have read or tried out any of these)

mastodon.thi.ng/@toxi/11134859

Mastodon Glitch EditionKarsten Schmidt (@toxi@mastodon.thi.ng)#HowToThing #Epilogue #LongRead: After 66 days of addressing 30 wildly varied use cases and building ~20 new example projects of varying complexity to illustrate how #ThingUmbrella libraries can be used & combined, I'm taking a break to concentrate on other important thi.ngs... With this overall selection I tried shining a light on common architectural patterns, but also some underexposed, yet interesting niche topics. Since there were many different techniques involved, it's natural not everything resonated with everyone. That's fine! Though, my hope always is that readers take an interest in a wide range of topics, and so many of these new examples were purposefully multi-faceted and hopefully provided insights for at least some parts, plus (in)directly communicated a core essence of the larger project: Only individual packages (or small clusters) are designed & optimized for a set of particular use cases. At large, though, thi.ng explicitly does NOT offer any such guidance or even opinion. All I can offer are possibilities, nudges and cross-references, how these constructs & techniques can be (and have been) useful and/or the theory underpinning them. For some topics, thi.ng libs provide multiple approaches to achieve certain goals. This again is by design (not lack of it!) and stems from hard-learned experience, showing that many (esp. larger) projects highly benefit from more nuanced (sometimes conflicting approaches) compared to popular defacto "catch-all" framework solutions. To avid users (incl. myself) this approach has become a somewhat unique offering and advantage, yet in itself seems to be the hardest and most confusing aspect of the entire project to communicate to newcomers. So seeing this list of new projects together, to me really is a celebration (and confirmation/testament) of the overall #BottomUpDesign #ThingUmbrella approach (which I've been building on since ~2006): From the wide spectrum/flexibility of use cases, the expressiveness, concision, the data-first approach, the undogmatic mix of complementary paradigms, the separation of concerns, no hidden magic state, only minimal build tooling requirements (a bundler is optional, but recommended for tree shaking, no more) — these are all aspects I think are key to building better (incl. more maintainable & reason-able) software. IMO they are worth embracing & exposing more people to and this is what I've partially attempted to do with this series of posts... ICYMI here's a summary of the 10 most recent posts (full list in the https://thi.ng/umbrella readme). Many of those examples have more comments than code... 021: Iterative animated polygon subdivision & heat map viz https://mastodon.thi.ng/@toxi/111221943333023306 022: Quasi-random voronoi lattice generator https://mastodon.thi.ng/@toxi/111244412425832657 023: Tag-based Jaccard similarity ranking using bitfields https://mastodon.thi.ng/@toxi/111256960928934577 024: 2.5D hidden line visualization of DEM files https://mastodon.thi.ng/@toxi/111269505611983570 025: Transforming & plotting 10k data points using SIMD https://mastodon.thi.ng/@toxi/111283262419126958 026: Shader meta-programming to generate 16 animated function plots https://mastodon.thi.ng/@toxi/111295842650216136 027: Flocking sim w/ neighborhood queries to visualize proximity https://mastodon.thi.ng/@toxi/111308439597090930 028: Randomized, space-filling, nested 2D grid layout generator https://mastodon.thi.ng/@toxi/111324566926701431 029: Forth-like DSL & livecoding playground for 2D geometry https://mastodon.thi.ng/@toxi/111335025037332972 030: Procedural text generation via custom DSL & parse grammar https://mastodon.thi.ng/@toxi/111347074558293056 #ThingUmbrella #OpenSource #TypeScript #JavaScript #Tutorial
Replied in thread

@Methylzero I had an idea last year around adding an extension to use the #FP16 FPUs as 10 bit int pipelines to save a cycle on IFMAs and I16ADD over the int16 MAC/add instructions, but they were seen as too niche (even for x86)

There was already precedent on this sort of thing (avx512 IFMA did this for the FP64 pipes)

Idea was saving a cycle (3.5 instead of 4.5) and saving some power (but not dealing with the extra 6 bits of a normal int16)

I feel like I’m missing something with how the overlap of RV extensions is done, namely the implied support of types in #RVV.

Ex: Given a chip that supports rv64gfdv, does that imply that both the scalar and vector units *must* support FP32 and FP64?

Or would an implementation that has FP64 on the scalar units only, but FP32 on both the scalar and the vector units, be valid?

Namely the passage I’m uncertain of is github.com/riscv/riscv-v-spec/

Clang/LLVM friends, trying to understand *why* Clang (18) doesn't see through what seems to me like an obvious optimization.

#compiler_explorer link here, explanation of what I don't understand follows:
godbolt.org/z/j8WqsMjb6

Going through Hackers delight and doing some of the dirt simple exercises, I dumped the assembly for Chapter 1 exercise 2 "loop that goes from 1 to 0xFFFFFFFF". (changed to not fault in CE)

(continues in next post, but putting hashtags here)

godbolt.orgCompiler ExplorerCompiler Explorer is an interactive online compiler which shows the assembly output of compiled C++, Rust, Go (and many more) code.