[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: prefetch II

To: R Clint Whaley <rwhaley@cs.utk.edu>
Subject: Re: prefetch II
From: Camm Maguire <camm@enhanced.com>
Date: 04 Nov 2001 16:27:42 -0500
Cc: atlas-comm@cs.utk.edu, math-atlas-devel@lists.sourceforge.net
In-Reply-To: R Clint Whaley's message of "Fri, 2 Nov 2001 14:40:44 -0500 (EST)"
References: <200111021940.OAA28103@enterprise.cs.utk.edu>
Hi Clint!  Funny you posted this today, as I was working a bit on the
SSE l1 myself.

R Clint Whaley <rwhaley@cs.utk.edu> writes:

> Guys,
> 
> The main point here is to highly recommend the technique Julian suggested
> last time: time your code with no prefetch instructions on an in-cache
> timing, and then make sure that in-cache number does not go down as you add
> prefetch.  You then have a pretty good idea that the prefetch will not be
> adding overhead, even if the prefetch is useless.  Of course, you still do
> out-of-cache timings in order to see what prefetch you need . . .
> 

Do you think this would apply to l2 and l1 as well?

> The amazing thing to me is that I was too bone-headed to apply this to prefetch,
> since I have used this technique with other stuff (eg., register prefetch).
> For those wanting to employ it, if you pass moves="" in your ummcase line
> of the kernel timer, the timer will leave all operands in place, and thus
> time again and again on the in-cache data (assuming your nb is small enough
> to keep the operands there) . . .
> 
> I applied this technique to two previously written kernels, with modest results.
> On the ev6, the kernel runs at 94.5% of peak when in cache.  I managed to
> get the prefetched kernel to clock in at 91% of peak when the kernel timer
> flushed 10 times the actual cache size (more like 93% if I just flush the
> cache size).
> 
> However, this "best" case according to the kernel timer only got around 86%
> of peak for the full gemm.  Taking a smaller nb, that the kernel timer claimed
> got roughly 89-90% of peak, got my full gemm up to 88% of peak.  I would now
> take a bow, if Goto's GEMM hadn't been acheiving 92-93% of peak for several
> years :)
> 
> So, that seemed about as far as I could push my ev6 performance, so I returned
> to the sight of my former humiliation on the PIII.  If you remember, I had
> a kernel that got great performance according to the kernel timer, but ran
> slower in full gemm than no prefetch at all.
> 
> The new kernel (with all no-overhead prefetch) clocked in at 76% of peak, 
> whereas the generated kernel got a puny 70%.  The full gemm based on the
> generated kernel peaked around 70.6% of peak.  The full gemm based on my
> mighty new kernel peaked at . . . 71% of peak.  Wow, what a difference.
> One second before hurling my laptop across the room, I timed LU, and
> found a 3-10% performance advantage for the new kernel over generated for
> LU, so it seems that the gemm timer is not telling the whole story . . .
> 
> My guess is that CacheEdge helps full gemm out quite a bit, but LU's gemm-calls
> don't always have a good shape for CacheEdge, and then the prefetch can help
> matters (if CacheEdge is rolling, you don't really need prefetch as much, since
> the operands are already L2-contained).
> 
> Anyway, if you mess around with prefetch, do not forget to redo cacheedge.
> When I first timed my new full gemm, it was slower than the old, until
> I adjusted this quantity . . .
> 


What does it mean to 'redo' cacheedge?  This sounds reminiscent of the
issue we dealt with when I was first writing some l2 kernels, when we
found that performance was actually best when we lied to ?cases.dsc
and indicated some large inner loop length to defeat the blocking.  It
does seem that we might need to rethink where we expect the relevant
data to be (in which caches) at each stage of these operations (inside
and outside the kernel), and confine prefetch calls to those that make
sense in that place, leaving the cache-edge blocking mechanism to do
the rest of the cache work.  For example, if we believe the outer
atlas stuff will ensure that certain data is in l2 at kernel call, the
kernel should never use prefetcht2 on that array.  etc.  It also would
be nice to be able to come up with some theory as to the proper
prefetch distance in each circumstance.

> I still am not happy that the kernel timer does not always predict the
> best NB, and that it seems to be pretty innaccurate when prefetch comes in,
> but it is not clear to me what to do about it . . .
> 
> The kernel does not include the low order overheads (data copy, outside loop
> costs, movement of C, etc).  My guess is that in the past the use of CacheEdge
> offset these losses so the predicted rates were fairly accurate.  With prefetch,
> you are doing something CacheEdge will do, so you look a lot better in the
> kernel, but in practice the improvement is not so great.  Since you don't
> have CacheEdge offsetting the overheads, your kernel timings wind up higher
> than your full gemm.  Or at least that one fairly random guess :)
> 

The only thing about this theory is that it would require the
non-kernel overhead to be small, but measurable.  Your 70 -> 70.6
example above doesn't seem to bear this out :-)

Some level 1 struggles:

1) Cannot get any prefetch benefit for N=1000000 on nrm2 or scal on
   torc.  Previously, I had found prefetch to account for roughly half
   the l2 speedup on the PIII.  Completely mystified here.  It seems as
   though the P4 is already doing some kind of 'autoprefetch', is that
   possible?  

2) In cache - 1.3Gflops double nrm2, 2-3Gflops single nrm2, 1.5Gflops
   single scal.  Out of cache - 450 Mflops double nrm2, 880 single
   nrm2, 250 single scal.  Compared with x0 non-SSE routines: 550,
   580, 1 Mflop, 360, 516, 250.  In cache scal for x0 has some real
   weirdness going on.  All of this speedup is due to SSE and not
   prefetch, as far as I can tell.  Rough first attempt numbers.

3) No way to align 'dot', right?

4) In cache numbers never see reality, right?  

5) How do I figure out the max throughput from main memory with no
   cache misses?  Is it possible that torc is so fast that it can
   basically process all of cache in l1 type operations in less time
   then it minimally takes to fill it?

6) what l1 increase is worth it?  Should be attainable?  does MKL
   show? 

7) does anyone know of a good rudimentary reference on assembler
   optimization with which one could attempt to write down what should
   be optimal the first time, instead of having to tweak endlessly via
   trial and error?

Take care,

> Cheers,
> Clint
> 
> 

-- 
Camm Maguire			     			camm@enhanced.com
==========================================================================
"The earth is but one country, and mankind its citizens."  --  Baha'u'llah
References:
- prefetch II
  - From: R Clint Whaley <rwhaley@cs.utk.edu>
Prev by Date: laser supplies
Next by Date: Re: prefetch II
Prev by thread: prefetch II
Next by thread: Re: prefetch II
Index(es):
- Date
- Thread