[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Math-atlas-devel] prefetch ravings

To: atlas-comm@cs.utk.edu, math-atlas-devel@lists.sourceforge.net
Subject: Re: [Math-atlas-devel] prefetch ravings
From: Julian Ruhe <ruheejih@calvados.zrz.tu-berlin.de>
Date: Thu, 25 Oct 2001 14:28:36 +0200
Organization: TU Berlin
References: <200110242053.QAA27449@enterprise.cs.utk.edu>
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.0; en-US; rv:0.9.5) Gecko/20011011

Hello all,

I can only say some words to prefetching on Athlon. Prefetchings must be 
set here very
carefully. One reason is, that Athlon can handle "only" six oustanding 
prefetches a time all
following are simply ignored. The second reason is, that they sometimes 
decrease performance
in a unpredictable way (hello Clint!). BTW, the K6 series can handle 
only one outstanding prefetch, which
makes it worthless in praxis.
My personal stategy in my dgemm kernel was:
- Make sure that that the prefetches do not decrease performance of the 
"pure" kernel. Means, that
 there should be no (or less) performance difference between the kernel 
with and without prefetching
 enabled, when running the kernel in a loop with always the same three 
matrices (all in L1 cache)
- Place the prefetches right before the register exchange part of the 
kernel. This is the place
 where the mul and add pipelines are emptied. At least on Athlon the 
first store must wait 4 cycles
 for its data here.
- Unroll the two inner loops completely as the prefetching instructions 
for a column of B and of A+1
 must be sperated by at least mem_latency+5*(cachline refill time) cycles
- make sure that the prefetch of the next column of B starts 
mem_latency+5*(cachline refill time) cycles
 before the column is needed (if one wonders why I need 5 cachelines to 
prefetch: As the blocks are
 not cacheline aligned, each columm of each matrix can touch 5 
cachelines in worst case)

I was not able to implement prefetching of C succesfully although C is 
the most critical matrix because
it is not copied. The loss when going from LDC=30 (test) to LDC=M 
(reality) is ~40 MFLOPS on my
Athlon classic 600. This is due to TLB misses I think, so C was normally 
a good candidate for
prefetching.

Julian

Follow-Ups:
- Wrapping assembly into C files.
  - From: Peter Soendergaard <soender@cs.utk.edu>

References:
- prefetch ravings
  - From: R Clint Whaley <rwhaley@cs.utk.edu>

Prev by Date: prefetch ravings
Next by Date: Wrapping assembly into C files.
Prev by thread: prefetch ravings
Next by thread: Wrapping assembly into C files.
Index(es):
- Date
- Thread