norden.social is one of the many independent Mastodon servers you can use to participate in the fediverse.
Moin! Dies ist die Mastodon-Instanz für Nordlichter, Schnacker und alles dazwischen. Folge dem Leuchtturm.

Administered by:

Server stats:

3.5K
active users

#multithreading

0 posts0 participants0 posts today

#FluidX3D #CFD v3.2 is out! I've implemented the much requested #GPU summation for object force/torque; it's ~20x faster than #CPU #multithreading. 🖖😋
Horizontal sum in #OpenCL was a nice exercise - first local memory reduction and then hardware-supported atomic floating-point add in VRAM, in a single-stage kernel. Hammering atomics isn't too bad as each of the ~10-340 workgroups dispatched at a time does only a single atomic add.
Also improved volumetric #raytracing!
github.com/ProjectPhysX/FluidX

Remember when I mentioned we had ported our #fire propagation #cellularAutomaton from #Python to #Julia, gaining performance and the ability to parallelize more easily and efficiently?

A couple of days ago we had to run another big batch of simulations and while things progressed well at the beginning, we saw the parallel threads apparently hanging one by one until the whole process sat there doing who know what.

Our initial suspicion was that we had come across some weird #JuliaLang issue with #multithreading, which seemed to be confirmed by some posts we found on the Julia forums. We tried the workarounds suggested there, to no avail. We tried a different number of threads, and this led to the hang occurring after a different percent completion. We tried restarting the simulations skipping the ones already done. It always got stuck at the same place (for the same number of threads).

So, what was the problem?

1/n

Multithreaded CLI developers: let your users configure the number of threads.

Entire classes of use cases are hiding inside that will make your life easier as a dev -- and threads=1 is usually not hard to add.

One example: if your multithreaded tool works significantly faster on a single file when I force your tool to just use a single thread and parallelize it with parallel --pipepart --block instead, then either:

  1. you might decide to develop sharding the I/O of the physical file yourself, or

  2. you might consciously decide to not develop it, and leave that complexity to parallel (which is fine!)

But if your tool has no threads=N option, I have no workaround.

Configurable thread count lets me optimize in the meantime (or instead).

Have you ever programmed a human computer? Having 30 people walking around the room to exchange information between RAM addresses and CPU registers, and human CPUs execute operations on the clock is a very special experience*.

This week I learned more than in a ~year of self-study, thanks to the 16th Advanced Scientific Programming in #Python aspp.school
We covered version control, packaging, testing, debugging, computer architecture, some #numpy and #pandas -fu, programming patterns aka what goes into a class and what doesn't, big-O to understand the scaling of various operations and how to find the fastest one for the given data type and size, and an intro to #multithreading and #multiprocessing 🍭

A personal highlight for me was pair programming. I never thought writing code in with a buddy would be so much fun, but I learned a lot from my buddies and now I don't want to go back to writing code alone 😅

Very indebted to the teachers and organizers; aspp.school/wiki/faculty if you ever meet one of those people, please buy them a drink for what they have done for a better code karma state in the universe

*our human computer didn't manage to execute the simplest sorting algorithm and the CPUs started to sweat; we experienced what happens when the code is ambiguous and imprecise 😱🫨

Leslie Lamport, of LaTeX fame, is a very accomplished mathematician and computer scientist with a Turing award for his work on “fundamental contributions to the theory and
practice of distributed and concurrent systems”. He just published a draft of his new book:

"A science of concurrent programs"

lamport.azurewebsites.net/tla/

True to his pedagogic approach to everything he does, "The book assumes only that you know the math one learns before entering a university." Even the appendices are fantastic. Can only wish I'll remain this lucid at his 82 years old.

At what point does setting more threads for OpenBLAS actually help?

For example, I have an SVD operation in #RStats on largish matrices (6000 rows and 6000 columns; doing an inverse), where default BLAS on Ubuntu is ~ 20 min.

OpenBLAS with 1 or 4 threads takes ~ 2 min (10X speedup!). With 4 threads, I can see the additional usage of cores, but overall time is the same as 1 thread.

Is there some magic size where using more threads for SVD will actually help?