[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Wrapping of Julians code more or less completed.



Peter,

>I have finished wrapping Julians athlon kernel into a .c file using gcc
>inline assembly. It provides all  four precisions and does N cleanup. N is
>always read at runtime.
>
>I have not looked at the prefetching, so that stuff is still only
>optimized for 30x30 dgemm, but hopefully it does not do to muh of a
>difference.
>
>Please test it thouroughly for speed, since I have a hard time testing it
>properly over my 56k modem.

Just got some initial results.  I have not yet built it into the full gemm,
but the kernel timing looks very good:

dmm :  995
zmm :  915
smm : 1039
cmm : 1025

So it looks like zmm has taken the biggest hit, which doesn't make a lot of
sense to me.  Julian's nasm kernel is getting 960 for dmm, so I'm thinking I
must have an old .o or something . . .

Anyway, if these numbers hold up for full gemm, this looks plenty good for the
stable to me . . .

Thanks,
Clint