# Re: worth a flier

```HI Clint,

> >BTW, if both methods above are performance killers, there is one solution for it: One can
> >create a perfect access pattern with a modified copying routine. Who says that the nb x nb
> >blocks must look like they were cut out of the source matrix? One can bring the elements
> >of one block into any wanted order to create a perfect access pattern...But this seems
> >not suitable for ATLAS.
>
> This is what I thought ATLAS already did.  Again, to my pea-brain,
> C += A^T B *is* the optimal access pattern, which is why I copy to it . . .

Ofter I can't express what I mean precisely enough in English. I did not mean optimal
access pattern in terms of transposed/not transposed. What I meant is to bring a block
into a form by copying that minimizes pointer arithmetics and integer register pressure also.
Going back to the C=A*B (row major) example, a "standard" copying routine would copy each block of
B in this way (in memory):

b00 .. b0(29) b10 .. b1(29) ...

I would give each block following shape in memory (suitable for 6 simultaneous dot products)

b00 b01 b02 b03 b04 b05 b10 b11 b12 b13 b14 b15 b20 b21 ... b(29)5 b06 b07 b08 b09 b0(10) b0(11) b16 b17 ...

This has some clear advantages:

(1) The elements of ALL three matrix blocks are accessed completely sequentially
(2) keeping in view that all offsets must be 8 bit on Athlon, pointer artithmetics
for offset calculations is minimized (I don't explain this in detail).

But I think this strategy is not suitable for ATLAS

> Right.  How about 36?  It's a multiple of 4 and 6, and should fit in cache.
> Did you guys find that NB > 30 causes does not stay in cache (and if you did,
> was it using contiguous A & B)?

30 was chosen only for theoretical reasons. We did not test other values.
30 is a multiple of 6, is small enough so that all data should stay in cache and
large enough to prefetch the two following matrix blocks in time on PC133 equiped
systems.

> How about we remove the ambiguity.  Here's essentially the code that ATLAS
> uses, with some loop editing to make it readable:

> Notice that, since I am operating on a column-major C, I choose the
> N-M-K loop ordering, and grab multiple elts of A instead of B.  I think the
> one you describe is M-N-K, with mult elts of B.  I think the one you describe
> is better for row-major, and the one above is better for col-major, but that
> they are exactly equivalent to produce . . .

OK, wonderful! Now I know everything to start my work

> >(1) can't prefetch "lost" cachelines of C
> >(2) exchanging c's requires massive pointer arithmetics
>
> (2) goes away for sure if you use the algorithm shown above.  I'm not exactly
> sure what you mean by (1), so I don't know about it . . .

(1) In the mm C=A*B (row major) example I execute "security" prefetches of cachlines
of A and C. Just to make sure that they are really in L1 when I need them and not "lost"
due to unwanted cache trashing. If the required 6 elements of C are not contiguous in memory
it is obviously not possible to prefetch all elements with one prefetch intruction.
(2) Ok

Regards

Julian
```