|Real numbers, coprocessors and vector units|
Streaming SIMD Extensions (SSE)
Streaming SIMD Extensions (SSE)
SSE was introduced in the Pentium III and offered an additional 70 instructions to the Intel Instruction Set. SSE instructions can help give an increase in data thouroughput due to Single Instruction, Multiple Data (SIMD) instructions. These instructions can execute a common expression on multiple data in parallel.
There are 8 (16 in 64-bit mode) XMM registers (XMM0-7(15)) that come with SSE, and they are 128-bit registers. Certain SSE instructions (movntdqa, movdqa, movdqu, etc...) can load 16 bytes from memory or store 16 bytes to memory in a single operation. Also, SSE introduces a few non-temporal hint instructions (movntdqa and movntdq) that allow one-shot memory locations to be stored in non-temporal memory so those location references to do not pollute the small on-chip caches.
Since this change added new registers, it is disabled by default as the typical operating system of that time was not yet able to save those registers on a task switch. To support SSE, you will need to implement separate code paths for saving and restoring SSE state (as those instructions will cause an exception on processors that do not support it), and handlers for the new exceptions. After that, you can tell the CPU to enable SSE use in userland tasks.
Checking for SSE
to check for SSE CPUID.01h:EDX.SSE[bit 25] needs to be set
mov eax, 0x1 cpuid test edx, 1<<25 jz .noSSE ;SSE is available
In order to allow SSE instructions to be executed without generating a #UD, we need to alter the CR0 and CR4 registers.
clear the CR0.EM bit (bit 2) [ CR0 &= ~(1 << 2) ] set the CR0.MP bit (bit 1) [ CR0 |= (1 << 1) ] set the CR4.OSFXSR bit (bit 9) [ CR4 |= (1 << 9) ] set the CR4.OSXMMEXCPT bit (bit 10) [ CR4 |= (1 << 10) ]
Here is an asm example:
;now enable SSE and the like mov eax, cr0 and ax, 0xFFFB ;clear coprocessor emulation CR0.EM or ax, 0x2 ;set coprocessor monitoring CR0.MP mov cr0, eax mov eax, cr4 or ax, 3 << 9 ;set CR4.OSFXSR and CR4.OSXMMEXCPT at the same time mov cr4, eax ret
FXSAVE and FXRSTOR
FXSAVE and FXRSTOR are used to save and load the complete SSE, x87 FPU, and MMX states from memory. The host needs to allocate 512 bytes for the storage and use that memory pointer as an operand to either FXSAVE or FXRSTOR. Before using either of those instructions, make sure to check the CPUID features for the FXSR bit. Also, like most SSE instructions, the memory operand needs to be 16-byte aligned or a #GP exception will occur. Remember to execute FXSAVE *before* any MXCSR modifications happen, or else it the register will most likely get overwritten or set to 0 based on the unknown state of the MXCSR_MASK.
char fxsave_region __attribute__((aligned(16))); asm volatile(" fxsave; "::"m"(fxsave_region));
or in asm:
segment .code SaveFloats: fxsave [SavedFloats] segment .data align 16 SavedFloats: TIMES 512 db 0
Pitfalls: only one level of saving supported.
MXCSR and its helpers LDMXCSR and STMXCSR
The MXCSR register holds all of the masking and flag information for use with SSE floating-point operations. Just like the x87 FPU control word, if you would like to mask certain exceptions from occuring or would like to specify rounding types, MXCSR will need to be modified. Bits 16-31 are reserved and will cause a #GP exception if set. LDMXCSR and STMXCSR load and write the MXCSR register respectively. They both require a 32-bit memory operand. SSE support needs to already be set up before using either of these instructions (CR4.OSFXSR = 1, CR0.EM = 0, and CR0.TS = 0). If bits 7-12 are set, all SSE floating-point exceptions are masked. Bits 0-5 are exception status flags that are set if the corresponding exception has occured. Bits 13-14 are the RC (Rounding Control) bits. RC:0 = to nearest, RC:1 = down, RC:2 = up, RC:3 = truncate.
Updates to SSE
Later processors have added more instructions for different work to be performed on the vector registers. Supporting them with SSE support in place doesn't require any effort on the part of the OS. The actual user of the instructions should however check if those instructions actually exist.
Streaming SIMD Extensions 2 (SSE2)
The bit for SSE2 can be found on CPUID page 1, in EDX bit 26.
Streaming SIMD Extensions 3 (SSE3)
The bit for SSE3 can be found on CPUID page 1, in ECX bit 0.
Supplemental Streaming SIMD Extensions 3 (SSSE3)
The bit for SSSE3 can be found on CPUID page 1, in ECX bit 9.
Streaming SIMD Extensions 4 (SSE4)
The bit for SSE4.1 can be found on CPUID page 1, in ECX bit 19
The bit for SSE4.2 can be found on CPUID page 1, in ECX bit 20
The bit for SSE4A can be found on CPUID page 1, in ECX bit 6
Streaming SIMD Extensions 5 (SSE5)
SSE5 was planned as one unit, but split into several:
The bit for XOP can be found on CPUID page 1, in ECX bit 11
The bit for FMA4 can be found on CPUID page 1, in ECX bit 16
The bit for CVT16 can be found on CPUID page 1, in ECX bit 29
The bit for AVX be found on CPUID page 1, in ECX bit 28
When the X86-64 architecture was introduced, AMD demanded a minimum level of SSE support to simplify OS code. Any system capable of long mode should support at least SSE and SSE2, which means that the kernel does not need to care about the old FPU save code. X86-64 adds 8 SSE registers (xmm8 - xmm15) to the mix. However, you can only access these in 64 bit mode.
- The Wikipedia article on SSE