SSE
Real numbers, coprocessors and vector units |
---|
Technical |
X86 implementations |
Streaming SIMD Extensions (SSE)
Streaming SIMD Extensions (SSE)
Introduction
SSE was introduced in the Pentium III and offered an additional 70 instructions to the Intel Instruction Set. SSE instructions can help give an increase in data thouroughput due to Single Instruction, Multiple Data (SIMD) instructions. These instructions can execute a common expression on multiple data in parallel.
There are 8 (16 in 64-bit mode) XMM registers (XMM0-7(15)) that come with SSE, and they are 128-bit registers. Certain SSE instructions (movntdqa, movdqa, movdqu, etc...) can load 16 bytes from memory or store 16 bytes to memory in a single operation. Also, SSE introduces a few non-temporal hint instructions (movntdqa and movntdq) that allow one-shot memory locations to be stored in non-temporal memory so those location references to do not pollute the small on-chip caches.
Since this change added new registers, it is disabled by default as the typical operating system of that time was not yet able to save those registers on a task switch. To support SSE, you will need to implement separate code paths for saving and restoring SSE state (as those instructions will cause an exception on processors that do not support it), and handlers for the new exceptions. After that, you can tell the CPU to enable SSE use in userland tasks.
Checking for SSE
to check for SSE CPUID.01h:EDX.SSE[bit 25] needs to be set
mov eax, 0x1
cpuid
test edx, 1<<25
jz .noSSE
;SSE is available
Adding support
In order to allow SSE instructions to be executed without generating a #UD, we need to alter the CR0 and CR4 registers.
clear the CR0.EM bit (bit 2) [ CR0 &= ~(1 << 2) ] set the CR0.MP bit (bit 1) [ CR0 |= (1 << 1) ] set the CR4.OSFXSR bit (bit 9) [ CR4 |= (1 << 9) ] set the CR4.OSXMMEXCPT bit (bit 10) [ CR4 |= (1 << 10) ]
Here is an asm example:
;now enable SSE and the like
mov eax, cr0
and ax, 0xFFFB ;clear coprocessor emulation CR0.EM
or ax, 0x2 ;set coprocessor monitoring CR0.MP
mov cr0, eax
mov eax, cr4
or ax, 3 << 9 ;set CR4.OSFXSR and CR4.OSXMMEXCPT at the same time
mov cr4, eax
ret
FXSAVE and FXRSTOR
FXSAVE and FXRSTOR are used to save and load the complete SSE, x87 FPU, and MMX states from memory. The host needs to allocate 512 bytes for the storage and use that memory pointer as an operand to either FXSAVE or FXRSTOR. Before using either of those instructions, make sure to check the CPUID features for the FXSR bit. Also, like most SSE instructions, the memory operand needs to be 16-byte aligned or a #GP exception will occur. Remember to execute FXSAVE *before* any MXCSR modifications happen, or else it the register will most likely get overwritten or set to 0 based on the unknown state of the MXCSR_MASK.
Example usage:
char fxsave_region[512] __attribute__((aligned(16)));
asm volatile(" fxsave %0 "::"m"(fxsave_region));
or in asm:
segment .code
SaveFloats:
fxsave [SavedFloats]
segment .data
align 16
SavedFloats: TIMES 512 db 0
Pitfalls: only one level of saving supported.
MXCSR and its helpers LDMXCSR and STMXCSR
The MXCSR register holds all of the masking and flag information for use with SSE floating-point operations. Just like the x87 FPU control word, if you would like to mask certain exceptions from occuring or would like to specify rounding types, MXCSR will need to be modified. Bits 16-31 are reserved and will cause a #GP exception if set. LDMXCSR and STMXCSR load and write the MXCSR register respectively. They both require a 32-bit memory operand. SSE support needs to already be set up before using either of these instructions (CR4.OSFXSR = 1, CR0.EM = 0, and CR0.TS = 0). If bits 7-12 are set, all SSE floating-point exceptions are masked. Bits 0-5 are exception status flags that are set if the corresponding exception has occured. Bits 13-14 are the RC (Rounding Control) bits. RC:0 = to nearest, RC:1 = down, RC:2 = up, RC:3 = truncate.
Updates to SSE
Later processors have added more instructions for different work to be performed on the vector registers. Supporting them with SSE support in place doesn't require any effort on the part of the OS (except for AVX, see below). The actual user of the instructions should however check if those instructions actually exist.
CPUID bits
SSE2
Streaming SIMD Extensions 2 (SSE2)
The bit for SSE2 can be found on CPUID page 1, in EDX bit 26.
SSE3
Streaming SIMD Extensions 3 (SSE3)
The bit for SSE3 can be found on CPUID page 1, in ECX bit 0.
SSSE3
Supplemental Streaming SIMD Extensions 3 (SSSE3)
The bit for SSSE3 can be found on CPUID page 1, in ECX bit 9.
SSE4
Streaming SIMD Extensions 4 (SSE4)
The bit for SSE4.1 can be found on CPUID page 1, in ECX bit 19
The bit for SSE4.2 can be found on CPUID page 1, in ECX bit 20
The bit for SSE4A can be found on CPUID page 1, in ECX bit 6
SSE5
Streaming SIMD Extensions 5 (SSE5)
SSE5 was planned as one unit, but split into several:
XOP
The bit for XOP can be found on CPUID page 1, in ECX bit 11
FMA4
The bit for FMA4 can be found on CPUID page 1, in ECX bit 16
CVT16
The bit for CVT16 can be found on CPUID page 1, in ECX bit 29
AVX
The bit for AVX can be found on CPUID page 1, in ECX bit 28
XSAVE
The bit for XSAVE (needed to manage extended processor states) can be found on CPUID page 1, in ECX bit 26
AVX2
The bit for AVX2 can be found on CPUID page 7, 0, in EBX bit 5
AVX-512
The bits for AVX-512 are in CPUID page 0x0D, 0x0, EAX bits 5-7
AVX512 implements separate features that can also be detected in CPUID page 7, 0. Basic support is detected by checking the AVX512F Bit (AVX-512 Foundation) in CPUID page 7, 0 EBX Bit 16, you can also check various AVX512 Features through the same CPUID Function, the bits are listed [here]
X86_64
When the X86-64 architecture was introduced, AMD demanded a minimum level of SSE support to simplify OS code. Any system capable of long mode should support at least SSE and SSE2, which means that the kernel does not need to care about the old FPU save code. X86-64 adds 8 SSE registers (xmm8 - xmm15) to the mix. However, you can only access these in 64 bit mode.
Advanced Vector Extensions is a SIMD (Single Instruction, Multiple Data) instruction set introduced by Intel in 2011.
AVX
AVX needs to be enabled by the kernel before being used. Forgetting to do this will raise an #UD on the first AVX call. Both SSE and OSXSAVE must be enabled before allowing. Failing to do so will also produce an #UD.
AVX is enabled by setting bit 2 of the XCR0 register. Bit 1 of XCR0 must also be set (indicating SSE support).
Here is an example of assembly code enabling AVX after SSE has been enabled (you should check AVX and XSAVE are supported first, see above):
enable_avx:
push rax
push rcx
push rdx
xor rcx, rcx
xgetbv ;Load XCR0 register
or eax, 7 ;Set AVX, SSE, X87 bits
xsetbv ;Save back to XCR0
pop rdx
pop rcx
pop rax
ret
To enable AVX-512, set the OPMASK (bit 5), ZMM_Hi256 (bit 6), Hi16_ZMM (bit 7) of XCR0. You must ensure that these bits are valid first (see above).
See Also
- MMX
- The optimisation library of 01000101, containing example code
References
- The Wikipedia article on SSE
- The Wikipedia article on AVX-512