Krzysztof Dmowski @xaphanpl

**Tweede golf** @tweedegolf@fosstodon.org · May 27

Tweede golf @tweedegolf@fosstodon.org

SIMD blog series: @folkertdev shows examples of using SIMD in the zlib-rs project.

Part 2 explains what to do when the compiler is not capable of using the SIMD capabilities of modern CPUs effectively. We end up with a basic, but very effective, example of a custom SIMD implementation beating the compiler.

https://tweedegolf.nl/en/blog/155/simd-in-zlib-rs-part-2-compare256

@trifectatech

tweedegolf.nlSIMD in zlib-rs (part 2): compare256 - Blog - Tweede golfIn part 1 of the "SIMD in zlib-rs" series, we've seen that, with a bit of nudging, autovectorization can produce optimal code for some problems. But that does not always work: with SIMD clever pr ...

#rustlang #datacompression #simd

**nietras** @nietras@mastodon.social · May 9

May 9

nietras @nietras@mastodon.social

New blog post "Sep 0.10.0 - 21 GB/s CSV Parsing Using SIMD on AMD 9950X "

Sep #performance from 7 GB/s to 21 GB/s over last two years
#csharp #SIMD and #x64 assembly on #dotnet 9.0
Tweaks and new #AVX512-to-256 parser
Lots of benchmarks

https://nietras.com/2025/05/09/sep-0-10-0/

**Tweede golf** @tweedegolf@fosstodon.org · Apr 15

Apr 15

Tweede golf @tweedegolf@fosstodon.org

New blog series: @folkertdev shows how we use SIMD in the zlib-rs project.

SIMD is crucial to good performance, but learning how to use it can be daunting. In this series we'll show concrete examples of using SIMD in a real world project.

Part 1 explains how the compiler already uses SIMD for us, how to evaluate whether it's doing a good job, and how to use a more optimal version when the current CPU supports it.

https://tweedegolf.nl/en/blog/153/simd-in-zlib-rs-part-1-autovectorization-and-target-features

@trifectatech

tweedegolf.nlSIMD in zlib-rs (part 1): Autovectorization and target features - Blog - Tweede golfI'm fascinated by the creative use of SIMD instructions. When you first learn about SIMD, it is clear that doing more multiplications in a single instruction is useful for speeding up matrix multi ...

#rustlang #datacompression #simd

**mkretz** @mkretz@floss.social · Apr 3

Apr 3

mkretz @mkretz@floss.social

While implementing complex numbers for #simd I tripped over failures wrt. negative zero. After multiple re-readings of C23 Annex G and considering the meaning of infinite infinities on a 2D plane (with zeros simply being their inverse) I believe #C and #CPlusPlus should ignore the sign of zeros and infinities in their x+iy representations of complex numbers. https://compiler-explorer.com/z/YavE4MnMj provides some motivation.
Am I missing something?

40%No, ignore signs: r=0 or r=∞ => θ is indeterminate
40%Yes, the 8 different 0s and ∞s tell me something
20%What are you talking about?

compiler-explorer.comCompiler Explorer - C++ int main() { using C = std::complex<double>; std::cout << C() * -C() << '\n'; std::cout << 0. * -C() << '\n'; }

**Ayan Shafqat** @ashafq@hachyderm.io · Mar 18

Mar 18

Ayan Shafqat @ashafq@hachyderm.io

Forget the AI hype - FFT is the real unsung hero of computing...

The Fast Fourier Transform (FFT) is everywhere: multiplying large numbers, audio and video compression, high-frequency trading, weather prediction - you name it. It’s also the foundation of other key transforms: DCT for image compression, MDCT for audio compression, MFCC for machine learning, and more.

FFT is the most underrated algorithm of the 20th and 21st century — change my mind.

The first time I saw the Fourier Matrix and finally understood the Cooley-Tukey FFT, I was hooked. There’s something beautiful and elegant about its tree-like structure. Someday, I will probably write about what happens when you unravel FFT's recursion, and how it is related to the `rbit` instruction on ARM CPU. And sometimes, I just sit at my computer, and code away to make FFT run faster. It's relaxing...

Here’s one of my little achievement: A 4-point complex-to-complex FFT in just **11** AVX2 instructions. By itself, a 4-point FFT isn’t much, but as a kernel, it helps build higher-order FFTs with blazing efficiency.

Full demo implementation is on GitHub, which computes 256 point FFT under 1 micro-second on 12th gen Intel Processors.

https://gist.github.com/ashafq/eef8ef391fb58be85b325c259ce591e3

#signalprocessing #programming #simd

**Ayan Shafqat** @ashafq@hachyderm.io · Mar 18

Mar 18

Ayan Shafqat @ashafq@hachyderm.io

SIMD and IIR filters are like oil and water, hard to mix! But with some clever math tricks, we can make IIR filters parallel utilizing SIMD instructions. Check out my new (or not so new) post!

https://shafq.at/vectorizing-iir-filters.html

Ayan Shafqat · Feb 12Vectorizing IIR Filters: What are you Recursing?Disclaimer: This article took quite a while to prepare. Although I’ve made every effort to fact-check and ensure the accuracy of the content, there may still be errors. If you notice any mistakes, please feel free to reach out and let me know! I like writing programs that run …

#signalprocessing #C #programming

**Несерьёзный Выдумщик** @grumb@idealists.su · Mar 6

Mar 6

Несерьёзный Выдумщик @grumb@idealists.su

Открытие Эндрю Крапивина о хеш-таблицах и микро-указателях?
Чисто гипотетически, может и актуально, но лишь в чистой и голой computer science теории.
На практике же полно нюансов реализации, сводящихся к оптимизациям конкретных аппаратных платформ.

Например, есть #SwissTable известные с 2018 года, недавно #Golang перешёл на них (с версии 1.24). И до него на SwissTable перейти успел #Rust.

Хеш-таблицы Google SwissTable и Facebook F14 примерно одинаковые, одно лишь вариант другого.

Идея оптимизации работы вокруг использования #SIMD инструкций для поиска занятых ячеек и проверки ключа. И в тотально подавляющем большинстве случаев хватает одной проверки блока из восьми элементов.

Надо ещё много раз поиграться с вариантами реализации какой-либо идеи из чистого computer science. Посмотрев как оно ложится на аппаратную платформу сродни x86-64.

Есть prefetching памяти и работа с ОЗУ идёт через загрузку целиком всей cache line в ЦПУ, даже при обращении на чтение лишь к одному значению в пару байт.
Предыдущий пункт не только про cache misses, но и «локальность данных». Как повышающую производительность, так и приводящих к false sharing при многопоточном использовании структуры данных.
Необходимо учитывать и размер страницы виртуальной памяти, чтобы снизить «давление» на TLB и уйти от TLB miss.

Для пример, в нагруженных системах используется донастройка системы на huge pages, например, все кто используют модный #DPDK сам по себе или с каким-нибудь #Seastar:

Выбравшие не оригинальную #Kafka, а её более производительный аналог #RedPanda.
Использующие вместо Apache #Cassandra более производительную #ScyllaDB

Голая теория computer science это хорошо и замечательно, но практика омерзительна свой приземлённостью. Прямой проход перебором по небольшому массиву оказывается быстрее, чем использование binary search tree. И совершенно не важно какого именно красно-чёрного или же АВЛ.

Это не вопрос ретроградства и вызова 40-летней теории :)

#software #SoftwareDevelop #программирование #разработка #programming @russian_mastodon @ru @Russia

idealists.suAkkoma

**IT News** @itnewsbot@schleuss.online · Dec 23, 2024

Dec 23, 2024

IT News @itnewsbot@schleuss.online

Faster Integer Division with Floating Point - Multiplication on a common microcontroller is easy. But division is much more diff... - https://hackaday.com/2024/12/22/faster-integer-division-with-floating-point/ #softwaredevelopment #softwarehacks #optimization #assembly #avx-512 #x86_64 #simd #x86

Hackaday · Dec 23, 2024Faster Integer Division With Floating PointMultiplication on a common microcontroller is easy. But division is much more difficult. Even with hardware assistance, a 32-bit division on a modern 64-bit x86 CPU can run between 9 and 15 cycles.…

**mattst88** @mattst88@fosstodon.org · Dec 18, 2024

Dec 18, 2024

mattst88 @mattst88@fosstodon.org

I landed some improvements and small optimizations to #pixman's AltiVec code. See https://gitlab.freedesktop.org/pixman/pixman/-/merge_requests/136

It was fun working with a new (to me) instruction set and trying to figure out how to puzzle together the pieces into something that improved the `pix_multiply()` function (which is kind of the core primitive of most fast paths).

I couldn't figure out a way to use the `vec_mradds`/`vmhraddshs` instruction. Maybe you can? (see https://gitlab.freedesktop.org/pixman/pixman/-/merge_requests/136#note_2699795)

GitLabvmx: Many improvements (!136) · Merge requests · Pixman / pixman · GitLabMatt Turner (19): vmx: Remove unnecessary variable vmx: Remove unpack_565_to_8888() and associated constants vmx: Remove unpack_128_2x128_16() vmx: Remove...

#altivec #powerpc #simd

**FCLC** @fclc@mast.hpc.social · Dec 9, 2024 *

Dec 9, 2024 *

FCLC @fclc@mast.hpc.social

Channeling my inner @shafik, assuming a standard, compliant #riscv processor, what kind of float instructions can be executed on the vector unit of a processor that advertises

"RV32IMFDZve64f"

#HPC #IEEE754 #SIMD #RISCV #RVV

https://github.com/riscvarchive/riscv-v-spec/releases/download/v1.0/riscv-v-spec-1.0.pdf

0%1xFP32, 0xFP64
0%1xFP32, 1xFP64
100%2xFP32, 1xFP64
0%2xFP32, 0xFP64

**mattst88** @mattst88@fosstodon.org · Dec 5, 2024

Dec 5, 2024

mattst88 @mattst88@fosstodon.org

I fixed an issue in pixman's Altivec code the other day -- https://cgit.freedesktop.org/pixman/commit/?id=207626180d0282bb14a50f2e494174f54ac8a6ce

And in the process, I read through the Altivec docs and discovered that there are vector instructions that pack and unpack between a8r8g8b8 and a1r5g5b5 formats (but nothing fo r5g6b5).

Any clues why? Was a1r5g5b5 really common on Mac OS or something? I don't think I've seen a1r5g5b5 used anywhere.

cgit.freedesktop.orgvmx: Fix is_opaque, is_zero, is_transparent functions - pixman - Pixman: The pixel-manipulation library for X and cairo. (mirrored from https://gitlab.freedesktop.org/pixman/pixman)

#powerpc #altivec #simd

**FCLC** @fclc@mast.hpc.social · Nov 28, 2024

Nov 28, 2024

FCLC @fclc@mast.hpc.social

Hey friends!
For folks interested in #RISCV, and especially #RVV, here's some information on the #tenstorrent in house designed CPU!

High level, vector is 2x256, full RVV1.0 as well as a fair few of the optional extras to RVV1.0!

Phoronix article here: https://www.phoronix.com/news/LLVM-20-Tenstorrent-Ascalon

LLVM patches here: https://github.com/llvm/llvm-project/pull/115100

One Pager: https://cdn.sanity.io/files/jpb4ed5r/production/6a28f7d59b6d1300fccdbdd394e192a4fd5f54c6.pdf

www.phoronix.comLLVM Merges Support The For Tenstorrent TT-Ascalon-D8 RISC-V CPU

#HPC #SIMD

**mkretz** @mkretz@floss.social · Nov 23, 2024

Nov 23, 2024

mkretz @mkretz@floss.social

C++26 will have data-parallel types (or std::simd as it came to be known; unless we rename it next meeting — don't settle in for the name just yet) #cpp #cplusplus #cpp26 #simd

**Karsten Schmidt** @toxi@mastodon.thi.ng · Nov 4, 2024

Nov 4, 2024

Karsten Schmidt @toxi@mastodon.thi.ng

Yesterday, one year ago... (Still wondering how many people actually have read or tried out any of these)

https://mastodon.thi.ng/@toxi/111348591236791838

#ThingUmbrella #HowToThing #TypeScript

**Marcin Juszkiewicz** @hrw@society.oftrolls.com · Sep 18, 2024

Sep 18, 2024

Marcin Juszkiewicz @hrw@society.oftrolls.com

If you work with SIMD and wonder how it looks on the other architectures then VectorCamp has launched website which helps.

On https://simd.info/ you can look which intrinsics are available on Arm, Power and x86-64 (RISC-V RVV will be there too). Compare them etc.

There is a search function, tree of operations and links to the official documentation.

simd.infoHome | SIMD.info

#simd #neon #avx

Replied in thread

**NLnet Labs** @nlnetlabs@fosstodon.org · Sep 12, 2024

Sep 12, 2024

NLnet Labs @nlnetlabs@fosstodon.org

@resingm @ximon18 Meanwhile, it's day 4 and @bal4e is seriously on a mission with making the `domain` zone file parser lightning fast. ️ #DNS #SIMD #rustlang️ https://github.com/NLnetLabs/domain/pull/388

GitHubOverhaul parsing from the presentation format by bal-e · Pull Request #388 · NLnetLabs/domainBy bal-e

**NLnet Labs** @nlnetlabs@fosstodon.org · Sep 6, 2024 *

Sep 6, 2024 *

NLnet Labs @nlnetlabs@fosstodon.org

Jeroen Koekkoek, one of our lead developers, has collaborated with @lemire to create a blazingly fast #DNS zone file parser that is now part of our authoritative nameserver NSD.

They have now published a paper outlining how they enhanced parsing throughput using data parallelism, specifically Single Instruction Multiple Data (#SIMD) instructions available on commodity processors. #programming https://www.authorea.com/1222979

**mkretz** @mkretz@floss.social · Aug 29, 2024 *

Aug 29, 2024 *

mkretz @mkretz@floss.social

European #GNURadio Days this week. (It's just a few steps from my regular office at #GSI_Helmholtzzentrum_für_Schwerionenforschung.) This week has a focus on GNURadio 4, which was developed by colleagues at #FAIR/#GSI. I'm happy that I was able to contribute a small part in design and implementation of the new core. And this new core makes use of `stdx::simd` and https://github.com/mattkretz/vir-simd. I will talk about the #SIMD parts later today (1:30 pm CEST) and you can tune in at
https://www.youtube.com/watch?v=8xnPsPdy5AQ

**Librecast** @librecast@chaos.social · Jul 11, 2024

Jul 11, 2024

Librecast @librecast@chaos.social

It's a new release of lcrq!

lcrq now makes use of a CPU dispatcher to detect the available SIMD instruction sets at runtime, ensuring that the code runs as fast as possible on the target machine.

Thanks to @nlnet and #NGIAssure for funding this work.

https://codeberg.org/librecast/lcrq/releases/tag/v0.2.0

Codeberg.orglcrqC implementation of RFC6330 RaptorQ Codes

#simd #RaptorQ #lcrq

**Ivan Enderlin** @hywan@fosstodon.org · Jun 19, 2024

Jun 19, 2024

Ivan Enderlin @hywan@fosstodon.org

wide, https://github.com/Lokathor/wide.

> [it] has portable "wide" data types that do their best to be SIMD when possible.

> On x86, x86_64, wasm32 and aarch64 neon this is done with explicit intrinsic usage (via safe_arch), and on other architectures this is done by carefully writing functions so that LLVM hopefully does the right thing. When Rust stabilizes more explicit intrinsics then they can go into safe_arch and then they can get used here.

GitHubGitHub - Lokathor/wide: A crate to help you go wide. By which I mean use SIMD stuff.A crate to help you go wide. By which I mean use SIMD stuff. - Lokathor/wide

#RustLang #SIMD #DataType

Recent searches

Search options

Administered by:

Server stats:

#simd