AVX2
Devc1, unfinished.
Advanced Vector Extensions
AVX or (Advanced Vector Extensions) are extensions to the x86 architecture introduced by Intel with the [SandyBridge] micro-architecture. AVX adds 86 instructions to the CPU instruction set, it extends the 128 Bit XMM registers to 256 Bit YMM registers, these registers operate as lower-upper halves meaning that XMMx contains the low 128 bits of YMMx, thus, the AVX instruction set increases the size of memory transfers and parallel floating point computations. Effective usage of these extensions may vastly increase the performance of your program.
The SIMD extensions (MMX/SSE/AVX...) can perform packed and scalar operations on multiple data types : Single Precision Floats (float), Double Precision Floats (double), Bytes, Words, DWORDs and QWORDs.
Scalar and Packed computations
Scalar computations
- in SIMD, a scalar computation (arithmetic, bit manipulations...) acts only on the low value of the SIMD register, scalar computations will not clear the high bits of the register and are only applied to floating point values.
Packed computations
- Packed computations involve parallel computing of different values across the SIMD register (XMM/YMM/ZMM), the number of values that can be computed at once is decided by the length of the register and the value. For example : You can divide 2 QWORDS at once in an XMM, 4 QWORDS in an YMM, and 4 DWORDS in an XMM.
Example : Here we will use AVX to multiply 8 floats at once, the assembly function Mul8floats will do that, it will multiply 8 values in Dest by Src and store the result in Dest.
; extern void Mul8floats(float* Dest, float* Src) Mul8floats: ; RCX contains pointer to Src ; RDX contains pointer of Dest vmovups ymm0, [rcx] vmovups ymm1, [rdx] vmulps ymm0, ymm0, ymm1 ; Packed multiply of 8 floats across ymm0 by 8 floats in ymm1 and store result in ymm0 vmovups [rcx], ymm0 ; Store the result in memory (float* Dest) ret
Result on MSVC with intrinsics:
Mul 134217728 values, normal : 261.955000ms , AVX : 34.871000ms
AVX2
AVX2 expands the AVX instruction set, it includes an expansion of the SSE Vector integer instructions to 256 Bits, Gather support, vector shifts and more.
Integer Arithmetic
To introduce AVX2, we should see how these registers work : lets take for example the previous instruction VPADDB/W/D/Q
; Available in AVX vmulpd ymm0, ymm0, ymm1 ; (Floating point computations) vpaddb xmm0, xmm1, xmm2 ; (Addition of 16 Byte integers in XMM1 by integers in XMM2 and store result in XMM0) ; Requires AVX2 vpaddb ymm0, ymm1, ymm2 ; (Addition of 32 Byte integers in YMM1 by integers in YMM2 and store result in YMM0)