[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]


Hello again!

R Clint Whaley <rwhaley@cs.utk.edu> writes:

> Camm,
> I don't know what specific requirements SSE forces on you, but the read/write
> pattern is much better for DDOT-based codes than DAXPY, because DAXPY
> does additional writes, whereas DDOT does additional reads, which are
> cheaper than writes . . .
> Cheers,
> Clint

You are of course right here, and I think, contrary to my earlier
guess, the complex case shows this to be the case even more.  But I'm
a bit confused:

sgemvT   prefetcht0   208 MFLOPS  stable
sgemvT   prefetchnta  238 MFLOPS  stable
sgemvN   prefetcht0   217 MFLOPS  fluctuates a bit
sgemvN   prefetchnta  242 MFLOPS  fluctuates a bit

cgemvT   prefetcht0   370 MFLOPS  stable
cgemvT   prefetchnta  383 MFLOPS  stable
cgemvN   prefetcht0   330 MFLOPS  stable
cgemvN   prefetchnta  250 MFLOPS  fluctuates a lot

What appears to be going on here is that the extra writes in the N
case pollute the L1 cache in an erratic fashion.  Apparently the nta
doesn't guarantee that the data is in all levels of cache, making
this disruption more evident.  The single precision appears to be
entirely ram bandwidth limited, but then why does the axpy in the N
case do *better*?  At least this seems to indicate that the complex N
case could \profit from a ddot implementation, no?

Take care,


Camm Maguire			     			camm@enhanced.com
"The earth is but one country, and mankind its citizens."  --  Baha'u'llah