I decided to share my Arm NEON optimizations for the FFmpeg Cinepak encoder. On Apple Silicon / RPI / NEON 32/64-bit, it gets a 250-300% speedup for encoding:
I decided to share my Arm NEON optimizations for the FFmpeg Cinepak encoder. On Apple Silicon / RPI / NEON 32/64-bit, it gets a 250-300% speedup for encoding:
SIMD blog series: @folkertdev shows examples of using SIMD in the zlib-rs project.
Part 2 explains what to do when the compiler is not capable of using the SIMD capabilities of modern CPUs effectively. We end up with a basic, but very effective, example of a custom SIMD implementation beating the compiler.
https://tweedegolf.nl/en/blog/155/simd-in-zlib-rs-part-2-compare256
New blog series: @folkertdev shows how we use SIMD in the zlib-rs project.
SIMD is crucial to good performance, but learning how to use it can be daunting. In this series we'll show concrete examples of using SIMD in a real world project.
Part 1 explains how the compiler already uses SIMD for us, how to evaluate whether it's doing a good job, and how to use a more optimal version when the current CPU supports it.
https://tweedegolf.nl/en/blog/153/simd-in-zlib-rs-part-1-autovectorization-and-target-features
While implementing complex numbers for #simd I tripped over failures wrt. negative zero. After multiple re-readings of C23 Annex G and considering the meaning of infinite infinities on a 2D plane (with zeros simply being their inverse) I believe #C and #CPlusPlus should ignore the sign of zeros and infinities in their x+iy representations of complex numbers. https://compiler-explorer.com/z/YavE4MnMj provides some motivation.
Am I missing something?
Forget the AI hype - FFT is the real unsung hero of computing...
The Fast Fourier Transform (FFT) is everywhere: multiplying large numbers, audio and video compression, high-frequency trading, weather prediction - you name it. It’s also the foundation of other key transforms: DCT for image compression, MDCT for audio compression, MFCC for machine learning, and more.
FFT is the most underrated algorithm of the 20th and 21st century — change my mind.
The first time I saw the Fourier Matrix and finally understood the Cooley-Tukey FFT, I was hooked. There’s something beautiful and elegant about its tree-like structure. Someday, I will probably write about what happens when you unravel FFT's recursion, and how it is related to the `rbit` instruction on ARM CPU. And sometimes, I just sit at my computer, and code away to make FFT run faster. It's relaxing...
Here’s one of my little achievement: A 4-point complex-to-complex FFT in just **11** AVX2 instructions. By itself, a 4-point FFT isn’t much, but as a kernel, it helps build higher-order FFTs with blazing efficiency.
Full demo implementation is on GitHub, which computes 256 point FFT under 1 micro-second on 12th gen Intel Processors.
https://gist.github.com/ashafq/eef8ef391fb58be85b325c259ce591e3
SIMD and IIR filters are like oil and water, hard to mix! But with some clever math tricks, we can make IIR filters parallel utilizing SIMD instructions. Check out my new (or not so new) post!
Открытие Эндрю Крапивина о хеш-таблицах и микро-указателях?
Чисто гипотетически, может и актуально, но лишь в чистой и голой computer science теории.
На практике же полно нюансов реализации, сводящихся к оптимизациям конкретных аппаратных платформ.
Например, есть #SwissTable известные с 2018 года, недавно #Golang перешёл на них (с версии 1.24). И до него на SwissTable перейти успел #Rust.
Хеш-таблицы Google SwissTable и Facebook F14 примерно одинаковые, одно лишь вариант другого.
Идея оптимизации работы вокруг использования #SIMD инструкций для поиска занятых ячеек и проверки ключа. И в тотально подавляющем большинстве случаев хватает одной проверки блока из восьми элементов.
Надо ещё много раз поиграться с вариантами реализации какой-либо идеи из чистого computer science. Посмотрев как оно ложится на аппаратную платформу сродни x86-64.
Есть prefetching памяти и работа с ОЗУ идёт через загрузку целиком всей cache line в ЦПУ, даже при обращении на чтение лишь к одному значению в пару байт.
Предыдущий пункт не только про cache misses, но и «локальность данных». Как повышающую производительность, так и приводящих к false sharing при многопоточном использовании структуры данных.
Необходимо учитывать и размер страницы виртуальной памяти, чтобы снизить «давление» на TLB и уйти от TLB miss.
Для пример, в нагруженных системах используется донастройка системы на huge pages, например, все кто используют модный #DPDK сам по себе или с каким-нибудь #Seastar:
Голая теория computer science это хорошо и замечательно, но практика омерзительна свой приземлённостью. Прямой проход перебором по небольшому массиву оказывается быстрее, чем использование binary search tree. И совершенно не важно какого именно красно-чёрного или же АВЛ.
Это не вопрос ретроградства и вызова 40-летней теории :)
#software #SoftwareDevelop #программирование #разработка #programming @russian_mastodon @ru @Russia
Faster Integer Division with Floating Point - Multiplication on a common microcontroller is easy. But division is much more diff... - https://hackaday.com/2024/12/22/faster-integer-division-with-floating-point/ #softwaredevelopment #softwarehacks #optimization #assembly #avx-512 #x86_64 #simd #x86
I landed some improvements and small optimizations to #pixman's AltiVec code. See https://gitlab.freedesktop.org/pixman/pixman/-/merge_requests/136
It was fun working with a new (to me) instruction set and trying to figure out how to puzzle together the pieces into something that improved the `pix_multiply()` function (which is kind of the core primitive of most fast paths).
I couldn't figure out a way to use the `vec_mradds`/`vmhraddshs` instruction. Maybe you can? (see https://gitlab.freedesktop.org/pixman/pixman/-/merge_requests/136#note_2699795)
Channeling my inner @shafik, assuming a standard, compliant #riscv processor, what kind of float instructions can be executed on the vector unit of a processor that advertises
"RV32IMFDZve64f"
#HPC #IEEE754 #SIMD #RISCV #RVV
https://github.com/riscvarchive/riscv-v-spec/releases/download/v1.0/riscv-v-spec-1.0.pdf
I fixed an issue in pixman's Altivec code the other day -- https://cgit.freedesktop.org/pixman/commit/?id=207626180d0282bb14a50f2e494174f54ac8a6ce
And in the process, I read through the Altivec docs and discovered that there are vector instructions that pack and unpack between a8r8g8b8 and a1r5g5b5 formats (but nothing fo r5g6b5).
Any clues why? Was a1r5g5b5 really common on Mac OS or something? I don't think I've seen a1r5g5b5 used anywhere.
Hey friends!
For folks interested in #RISCV, and especially #RVV, here's some information on the #tenstorrent in house designed CPU!
High level, vector is 2x256, full RVV1.0 as well as a fair few of the optional extras to RVV1.0!
Phoronix article here: https://www.phoronix.com/news/LLVM-20-Tenstorrent-Ascalon
LLVM patches here: https://github.com/llvm/llvm-project/pull/115100
One Pager: https://cdn.sanity.io/files/jpb4ed5r/production/6a28f7d59b6d1300fccdbdd394e192a4fd5f54c6.pdf
C++26 will have data-parallel types (or std::simd as it came to be known; unless we rename it next meeting — don't settle in for the name just yet)
#cpp #cplusplus #cpp26 #simd
If you work with SIMD and wonder how it looks on the other architectures then VectorCamp has launched website which helps.
On https://simd.info/ you can look which intrinsics are available on Arm, Power and x86-64 (RISC-V RVV will be there too). Compare them etc.
There is a search function, tree of operations and links to the official documentation.
Jeroen Koekkoek, one of our lead developers, has collaborated with @lemire to create a blazingly fast #DNS zone file parser that is now part of our authoritative nameserver NSD.
They have now published a paper outlining how they enhanced parsing throughput using data parallelism, specifically Single Instruction Multiple Data (#SIMD) instructions available on commodity processors. #programming https://www.authorea.com/1222979
European #GNURadio Days this week. (It's just a few steps from my regular office at #GSI_Helmholtzzentrum_für_Schwerionenforschung.) This week has a focus on GNURadio 4, which was developed by colleagues at #FAIR/#GSI. I'm happy that I was able to contribute a small part in design and implementation of the new core. And this new core makes use of `stdx::simd` and https://github.com/mattkretz/vir-simd. I will talk about the #SIMD parts later today (1:30 pm CEST) and you can tune in at
https://www.youtube.com/watch?v=8xnPsPdy5AQ
It's a new release of lcrq!
lcrq now makes use of a CPU dispatcher to detect the available SIMD instruction sets at runtime, ensuring that the code runs as fast as possible on the target machine.
Thanks to @nlnet and #NGIAssure for funding this work.