User:01000101/optlib/

Intro

Welcome! This is the information and source for the OptLib (Optimized Library). I took great care in making the source as portable (to different OS's) as possible, this code should also compile (using GCC) as-is into 32-bit or 64-bit objects. There is no support for non-x86(_64) architectures (yet). This code does not follow any particular standards (as you will see added functions that aren't present in the "standard C libraries"), but I try not to make it "incompatible" with existing setups.

This article is orginized into code blocks. Each header file has a set of C functions. Then those C functions are broken into different sub-versions (SSE/non-SSE). Each header can have either functions (full functions) or wrappers (SSE instruction with a C interface).

There are a few "glue" components that need to be present for these to work. "size_t", "bool", "true/false", "uintx_t", and "asm (volatile)" are the core ones. I'd usually set those up as:

typedef unsigned char  uint8_t;
typedef unsigned short int  uint16_t;
typedef unsigned int  uint32_t;
typedef unsigned long  size_t;
typedef uint8_t bool;
#define true 1
#define false 0
#define asm  __asm__
#define volatile  __volatile__

You will also need to enable the FPU and unmask a few extra bits (SSE). There will be a few C wrappers for a few x86 FPU/SSE instructions as they are powerful on their own and it would be more of a waste to try and emulate their functionality.

The "result" area in some functions give my recorded timings using the RDTSC instruction with a serializing CPUID instrucion before it to make sure it's as accurate as possible. Also, these timings are all recorded in one run (both SSE and non-SSE) so the results of each test are able to be compared (as opposed by different runs). Interrupts are disabled at the time of the tests to remove the possibility of timing a full interrupt handler being processed.

License

This code is released into the Public Domain or into whatever state of pure freedom you can think of. With that said, I take no responsibility for damages done or suitability of this code for any purpose. I really only ask that you give back to the article if you gained something from it to begin with.

string.h

string.h is part of the standard C library. It contains the definition of several functions to manipulate C strings and arrays. See more here: [1]

memclr()

memclr() zeros "m_count" bytes from "m_start".

SSE2 version

This is an SSE2 version due to the MOVDQA (Aligned Double-Quadword Move). In the memclr_sse2 function, we aligned the source memory address (if needed) and copy 64-byte chunks per iteration, then cleanup the left-over bytes.

static const void * memclr_sse2(const void * const m_start, const size_t m_count)
{
    // "i" is our counter of how many bytes we've cleared
    size_t i;

    // find out if "m_start" is aligned on a SSE_XMM_SIZE boundary
    if((size_t)m_start & (SSE_XMM_SIZE - 1))
    {
        i = 0;

        // we need to clear byte-by-byte until "m_start" is aligned on an SSE_XMM_SIZE boundary
        // ... and lets make sure we don't copy 'too' many bytes (i < m_count)
        while(((size_t)m_start + i) & (SSE_XMM_SIZE - 1) && i < m_count)
        {
            asm("stosb;" :: "D"((size_t)m_start + i), "a"(0));
            i++;
        }
    }
    else
    {
        // if "m_start" was aligned, set our count to 0
        i = 0;
    }
 
    asm volatile("pxor %%xmm0,%%xmm0"::); // zero out XMM0
    // clear 64-byte chunks of memory (4 16-byte operations)
    for(; i + 64 <= m_count; i += 64)
    {
        asm volatile(" movdqa %%xmm0, 0(%0);	"    // move 16 bytes from XMM0 to %0 + 0
                     " movdqa %%xmm0, 16(%0);	"
                     " movdqa %%xmm0, 32(%0);	"
                     " movdqa %%xmm0, 48(%0);	"
                     :: "r"((size_t)m_start + i));
    }
 
    // copy the remaining bytes (if any)
    asm(" rep stosb; " :: "a"((size_t)(0)), "D"(((size_t)m_start) + i), "c"(m_count - i));

    // "i" will contain the total amount of bytes that were actually transfered
    i += m_count - i;

    // we return "m_start" + the amount of bytes that were transfered
    return (void *)(((size_t)m_start) + i);
}

non-SSE (standard) version

memclr_std() is the non-SSE version of memclr(). We clear all the D-Words (4-bytes) at a time, and then clear the remaining bytes (<4 bytes) using the STOSx (Store String) instruction combined with the REP prefix.

static const void * memclr_std(const void * const m_start, const size_t m_count)
{
    asm("rep stosl;"::"a"(0),"D"((size_t)m_start),"c"(m_count / 4));
    asm("rep stosb;"::"a"(0),"D"(((size_t)m_start) + ((m_count / 4) * 4)),"c"(m_count - ((m_count / 4) * 4)));

    return (void *)((size_t)m_start + m_count);
}

function stub

This is the "memclr" that everyone will call to clear memory. This function calls other sub-functions if various features are available on the CPU. In this case, if the CPU can use SSE2, we use the SSE2 memclr and if not, we use the standard non-SSE memclr. This way, older CPU's can still use this function, but newer CPUs with SSE support get to use a more optimized version.

const void * memclr(const void * const mem, const size_t count)
{
    // return if there is no count
    if(!count){ return mem; }
    // if the CPU is SSE2 capable, we favor the SSE2 version of this function first 
    else if(cpuid_features.SSE2)
    {
        return memclr_sse2(mem, count);
    }
    // if the CPU didn't support SSE2, then we use the generic standard function
    return memclr_std(mem, count);
}

results

10 sets of 100 16-byte aligned 4KB clears [memclr(buf, 4096)]:

Software	non-SSE	SSE2
BOCHS	1055.00	736.00
QEMU	12953.13	6362.38
VirtualBox OSE	43785.66	19031.96

10 sets of 100 16-byte unaligned 4KB clears [memclr(buf+1, 4096)]:

Software	non-SSE	SSE2
BOCHS	1055.00	881.00
QEMU	36143.71	11549.49
VirtualBox OSE	64734.35	23246.49

strlen()

SSE4.2 version

Experimental: UnTested!

static size_t strlen_sse4_2(const char * const str)
{
   size_t index;

   asm(" mov $-16, %0;                      "
       " pxor %%xmm0, %%xmm0;               "
       ".strlen_4_2_start:                  "
       " add $16, %0;                       "
       " pcmpistri $0x08, (%0,%1), %%xmm0;  "
       " jnz .strlen_4_2_start;             "
       " add %2, %0;                        "
       :"=a"(index):"d"((size_t)str),"c"((size_t)str));
 
    return index;
}

non-SSE (standard) version

static size_t strlen_std(const char * const str)
{
    size_t tmp;

    tmp = (size_t)str;
		
    asm(".strlen_start:"
    "    cmpb $0, 0(%1);"
    "	 jz .strlen_done;"
    "    cmpb $0, 1(%1);"
    "	 jz .strlen_1;"
    "    cmpb $0, 2(%1);"
    "	 jz .strlen_2;"
    "    cmpb $0, 3(%1);"
    "	 jz .strlen_3;"
    "    cmpb $0, 4(%1);"
    "	 jz .strlen_4;"
    "    cmpb $0, 5(%1);"
    "	 jz .strlen_5;"
    "    cmpb $0, 6(%1);"
    "	 jz .strlen_6;"
    "    cmpb $0, 7(%1);"
    "	 jz .strlen_7;"
    // if no 0s were found, add 8 and start again 
    "	 add $8, %1;"
    "	 jmp .strlen_start;"
    // here are all the expanded functions that get called once a 0 is found 
    ".strlen_1:"
    "	 inc %1;"
    "	 jmp .strlen_done;"
    ".strlen_2:"
    "	 add $2, %1;"
    "	 jmp .strlen_done;"
    ".strlen_3:"
    "	 add $3, %1;"
    "	 jmp .strlen_done;"
    ".strlen_4:"
    "	 add $4, %1;"
    "	 jmp .strlen_done;"
    ".strlen_5:"
    "	 add $5, %1;"
    "	 jmp .strlen_done;"
    ".strlen_6:"
    "	 add $6, %1;"
    "	 jmp .strlen_done;"
    ".strlen_7:"
    "	 add $7, %1;"
    // output the current value, and exit. 
    ".strlen_done:"
    "	mov %1, %0;"
    :"=r"(tmp):"0"(tmp));

    return tmp - (size_t)str;
}

function stub

size_t strlen(const char * const str)
{
    // check for SSE 4.2
    if(cpuid_features.SSE4_2)
    {
        return strlen_sse4_2(str);
    }

    return strlen_std(str);
}

math.h

wrappers

PADDB/W/D/Q

PADDB/W/D/Q is an SSE2 instruction that ADDs two packed integers. If PADDB is used, each vector is 16 packed bytes that is added together using a single SIMD instruction. These functions return the result in the DST array (the first one provided). Also, PADDx does not handle addition overflows, it just wraps around.

// add two 16 byte packed integers together
void paddb(uint8_t DST[16], const uint8_t v2[16])
{
    // we need an aligned buffer, just in case
    uint8_t SRC[16] __attribute__((aligned(16)));

    // if SRC is not aligned, we need to copy those values into our temp buffer
    if((size_t)SRC & (SSE_XMM_SIZE - 1))
    { memcpy(SRC, v2, 16); }

    // if SRC is not aligned, PADDB will GPF
    asm(" movdqu (%0), %%xmm1;   "   // move DST into XMM1
        " paddb (%1), %%xmm1;    "   // add DST+SRC into XMM1
        " movdqu %%xmm1, (%0);   "   // store XMM1 into DST
        :: "r"((size_t)DST), "r"((size_t)SRC));
}

// add two 8 word packed integers together
void paddw(uint16_t DST[8], const uint16_t v2[8])
{
    // we need an aligned buffer, just in case
    uint16_t SRC[8] __attribute__((aligned(16)));

    // if SRC is not aligned, we need to copy those values into our temp buffer
    if((size_t)SRC & (SSE_XMM_SIZE - 1))
    { memcpy(SRC, v2, 16); }

    // if SRC is not aligned, PADDB will GPF
    asm(" movdqu (%0), %%xmm1;   "   // move DST into XMM1
        " paddw (%1), %%xmm1;    "   // add DST+SRC into XMM1
        " movdqu %%xmm1, (%0);   "   // store XMM1 into DST
        :: "r"((size_t)DST), "r"((size_t)SRC));
}

// add two 4 double-word packed integers together
void paddd(uint32_t DST[4], const uint32_t v2[4])
{
    // we need an aligned buffer, just in case
    uint32_t SRC[4] __attribute__((aligned(16)));

    // if SRC is not aligned, we need to copy those values into our temp buffer
    if((size_t)SRC & (SSE_XMM_SIZE - 1))
    { memcpy(SRC, v2, 16); }

    // if SRC is not aligned, PADDB will GPF
    asm(" movdqu (%0), %%xmm1;   "   // move DST into XMM1
        " paddd (%1), %%xmm1;    "   // add DST+SRC into XMM1
        " movdqu %%xmm1, (%0);   "   // store XMM1 into DST
        :: "r"((size_t)DST), "r"((size_t)SRC));
}

// add two 2 quad-word packed integers together
void paddq(uint64_t DST[2], const uint64_t v2[2])
{
    // we need an aligned buffer, just in case
    uint64_t SRC[2] __attribute__((aligned(16)));

    // if SRC is not aligned, we need to copy those values into our temp buffer
    if((size_t)SRC & (SSE_XMM_SIZE - 1))
    { memcpy(SRC, v2, 16); }

    // if SRC is not aligned, PADDB will GPF
    asm(" movdqu (%0), %%xmm1;   "   // move DST into XMM1
        " paddq (%1), %%xmm1;    "   // add DST+SRC into XMM1
        " movdqu %%xmm1, (%0);   "   // store XMM1 into DST
        :: "r"((size_t)DST), "r"((size_t)SRC));
}

User:01000101/optlib/

Contents

Intro

License

string.h

memclr()

SSE2 version

non-SSE (standard) version

function stub

results

strlen()

SSE4.2 version

non-SSE (standard) version

function stub

math.h

wrappers

PADDB/W/D/Q

Navigation menu

User:01000101/optlib/

Intro

License

string.h

memclr()

SSE2 version

non-SSE (standard) version

function stub

results

strlen()

SSE4.2 version

non-SSE (standard) version

function stub

math.h

wrappers

PADDB/W/D/Q

Navigation menu

Search