This is version 14, the latest and best.
Overview:

Note that my same (but not word) aligned speed is faster than uclinux's word aligned speed, once copy length is long enough.
Overview of copy length to 64 bytes:

Graphs show copy lengths from 0 to 64 bytes. Click any graph to see it zoomed out to show 0 to 256 byte copy lengths, or click here to see all graphs showing 0-256 bytes.
(Raw data can be found here; raw data sorted into groups by type of alignment, here. Each raw data line is in the form:
dst address % 4, src address % 4, copy length, microseconds (usec) for C memcpy, usec for kernel memcpy, usec for uclinux memcpy, usec for my memcopy, the string "CKUE")
Word aligned copies (dst and src both are word aligned):

Word aligned copies (dst and src both are not word aligned, but have the same (mis) alignment):



Mis-aligned copies ( dst and src have different alignments), in three types: