[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: worth a flier

Hello Clint,

> cool.  Needed to be sure you weren't also including the integer component
> of gemm (loops and pointer arith) in the IPC . . .

Right, no pointer arithmetics is included in IPC.

> Yes.  The other Level 3 BLAS all call the full gemm, not the kernel, and they
> will therefore have the full gemm's access patterns . . .


> I think we need to get some common vocabulary to discuss this.  Gemm needs to
> be written with 3 loops.  If C is MxN, and A and B are MxK and KxN, then
> let us call the loops over a particular dimension M, N, and K loop.  In
> all ATLAS gemm (both kernel and full) implementations K is the innermost loop.
> For the kernel, which is M=N=K=NB, this means that the access of C is only
> 2*NB*NB, while the access of A and B are NB*NB*NB each, so C is a low order
> term on cost.  Also, LAPACK is a big user of the BLAS, and they call gemm
> a lot as a rank-K update (think of M=1000, N=1000, K=NB).  In this
> configuration, the cost of reading and writing C is one of the main costs
> of GEMM, and if you copy, you double it, which is a killer; These two factors
> together (C low-order term, and copy killing perf of rank-NB update) are
> why ATLAS does not guarantee copying C.  ATLAS actually *does* buffer C into
> a contiguous NBxNB workspace when K is sufficiently large . . .
>      (pp 11-13)
> There are two l1 caching strategies that can yield good performance,
> one input matrix block, and a few rows/cols of other input matrix fills cache,
> and all 3 blocks (A, B, C) fill cache (pp. 18).  On the Athlon, ATLAS
> presently gets better performance with the first of these, so C is essentially
> kept in L2 . . .

Ok, I knew you have your reasons for not copying C. Thank you for the explanation!
> >Now something serious: I am not sure if I can adopt our algorithm to C=At*B+C (Fortran column-major) as
> >required by ATLAS. Our algorithm was created for C=A*B+C (C row-major).
> I assume all matrices are row-major?  If so, then in col-major terms, you
> are doing C^T += B^T A^T, so we have a mismatch on access of C and the
> second input matrix.  See below for more details . . .


> OK, just to be sure, when you say scalar products, this is the Level 1 BLAS
> operation DDOT?  If so, this is exactly what ATLAS's generated kernel is
> doing: 5 simultaneous dot products (6 actually gives almost the same
> performance).

Oh, I thought all the time that ATLAS' mm uses the straightforward "split one dot product into five and
add the results" strategy which AMD recommended when we started writing our DGEMM. Can
I ask you how you came to the idea to crunch five dot products simultaneously? Or does
ATLAS find this solution automatically? Does ATLAS do it on all architectures in this way? 

> So, with K as the inner loop, and matrices column major, I chose
> C = A^T B + C as the transpose settings because it means that the innermost
> loop (K) accesses contiguous memory.  I take it you chose B^T (in col-major
> terms) 'cause it puts your 6 elts of the dot product contiguous instead?
> Do you think having to grab the elts of the dot product from
> *A, A[NB], A[NB2], A[NB3], A[NB4], A[NB5] will prove a performance killer
> (with the priviso that the next step of K will use subsequent elts)?

I personally don't think one method is superior to the other. But if so, I can't say
why because it is a hardware question. While I now a little about SDRAMs I don't know
anything about SRAMs and which access pattern they like. I think the performance killer
of ATLAS on Athlon is just the impossibility to create perfectly optimized kernel code from
a higher programming language. One MUST use assembly to achieve near peak results.
BTW, if both methods above are performance killers, there is one solution for it: One can
create a perfect access pattern with a modified copying routine. Who says that the nb x nb
blocks must look like they were cut out of the source matrix? One can bring the elements
of one block into any wanted order to create a perfect access pattern...But this seems
not suitable for ATLAS. 
> NB=30, eh?  ATLAS might be happier with NB=28, since that keeps it a multiple
> of the cacheline boundary; on the other hand, with your prefetching, perhaps
> it will not matter . . .
> >In the case I don't find a solution to adapt the algorithms to ATLAS' requirements, is there a way that I can
> >use C=A*B+C (row major) as kernel?

I think I'll do it with 30, because I use 6 simultaneous dot products (sorry that I
used the word "scalar product" in my last letter. In German we call it "Standard Skalarprodukt"
or "Euklidisches Skalarprodukt" and I sometimes tranfer some German words into English).
If using 28/6 I would waste some cycles (maybe I'll try both)

> I'm afraid not, at least without me rewriting quite a bit of the ATLAS
> internals.  I'm  not sure when I'd have the time to do that . . .
> Even if you don't achieve the full speed of your row-major kernel, you
> can surely improve on ATLAS's paltry 1.4;  Your 1.9 is the upper bound
> maybe, but any improvement to ATLAS is still an improvement . . .

I think At*B in col-major will show a good IPC
> >At the end of this mail you find some results of Alex full DEGMM
> Very good indeed.  The impressive thing is that it looks like it
> would make the Athlon beat the P4 (for ATLAS) clock-for-clock, in a fairly
> decisive manner.  I'm sure you've seen my messages about it, but presently
> the P4 is presently ATLAS's flops king, with a roughly 2Gflop DGEMM on a
> 1.5Ghz P4 . . .

Programming the SSE2 unit is no fun. As long as Intel does not tell us how many
uops each instruction generates, it is impossible to write good code. Once I have hit
P4's peak performance in synthetic code by chance ( 2mul + 2add + 2loads + some int instr./2 cyc), but
I have no explanation why. The same code with LESS integer instructions did not reach peak

Just to make sure that I did not think wrong (row-major vs col-major confuses me everytime).
The code I will write must look like this:




(c+5ldc)=(a+29)*(b+5nb+29)        ; 6 ddots done

(c+6ldc)=(a+0)*(b+6nb)           ; next 6 ddots



Now I see some problems:

(1) can't prefetch "lost" cachelines of C
(2) exchanging c's requires massive pointer arithmetics