[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Math-atlas-results] SSE warnings, Band matrix request feature


>>>1) so I take it the level 3 proposal was for an extension of the blas
>>>   spec?
>>Yep, if you have banded and packed routines take a leading dimension and some
>>starting/stopping criteria, you can write level-3 based kernels (which are
>>slightly modified dense kernels for packed), and then use the same recursive
>>algorithms as in dense.  This gives major speedups.
>Interesting.  Too bad about the proposal.  I'm sure it is a lot of
>work, yes?

Yes.  Tragicly, it involves changing both the code generator, and the
entire recursive BLAS :)

>True.  Is there any way for the banded l2 for example to kick over to
>small loops over level 1 routines with narrow matrices, kind of like
>the small case test in gemm?

Sure.  Since we are talking about replacing Antoine's kernel with your
own, you'd simply have that switch in your kernel . . .

>You'd mentioned before that atlas treats these as ref blas.  Then why
>this? (sse1 850MHz p3)
>/usr/lib/atlas/xsl2blastst  -R tbmv -n 1000 -F 200
>----------------------------- TBMV ------------------------------
>==== ==== ==== ==== ==== ==== ==== ==== ====== ====== ===== =====
>   0    L    N    N 1000    1    2    1   0.00  158.7  1.00 -----
>   0    L    N    N 1000    1    2    1   0.00   36.6  0.23 PASS 
>1 tests run, 1 passed
>/usr/lib/atlas/xsl2blastst  -R gemv -n 1000 -F 200
>------------------------------- GEMV --------------------------------
>==== == ==== ==== ===== ==== ==== ===== ==== ====== ===== ===== =====
>   0  N 1000 1000   1.0 1000    1   1.0    1   0.03  67.4  1.00 -----
>   0  N 1000 1000   1.0 1000    1   1.0    1   0.01 238.7  3.54 PASS 
>1 tests run, 1 passed

I'm not sure I understand your question.  Looks like the numbers you are
showing merely underline what I said: gemv is optimized, TBMV is not . . .
If you are comparing against the F77 reference BLAS, realize that ATLAS is
using the same loops translated by Antoine into C.  C compilers are often
not as efficient as F77 on unoptimized loops (i.e., comparing equivalent
reference C and F77 implementations will often show F77 kicking butt) . . .

>One other item regarding possible precision.  I've written a
>quadratic program using basically a loop over gemv and ger.  I've run
>with both double and single precision, using both atlas and ref blas.
>And it seems that atlas is losing a lot more precision than ref blas
>for the single float case.  Here are my results (all these vectors
>should be the same to within rounding):

Uh, if I'm supposed to learn something from an output dump, a little more
terseness of output, or more explanation of how to interprate the bushels
of output might be in order.  Since I'm not planning on doing my Ph.D. on
this topic, I'm kind of blowing off the IO you sent . . .

As far as accuracy, single precision SSE will indeed lose quite a bit
of non-guaranteed accuracy.  The guaranteed accuracy remains the same:
IEEE single precision (32 bit).  x87 ops are promoted to 80 bit on both
double and single precision register ops on x86 chips.  The reason this
added accuracy is not guaranteed is that each time you write the register
to memory, you round to 32 (64 for double) accuracy.

So, any accuracy above 32-bit you don't want to count on: if the algorithm
changes it can change, and if you go to another platform (eg, alpha, sparc)
you won't have it . . .

So, I think this is roughly what you mentioned: SSE gives 32-bit accuracy (IEEE
standard, which is all anything **guarantees**), while x87 uses a mixture
of 80bit with rounding to give (possibly & unpredictably) more accurate
results . . .

Just as an aside, you will observe the same behavior on a P4, comparing SSE2
vs. x86 double precision.  SSE2 goes to 64 bit only, whereas x87 will have
80bit for register-register ops.  Again, ATLAS uses SSE2 because it does
not change the accuracy which we can guarantee . . .

>Agreed.  Have you looked into writing behind the reading pt at various
>distances, and the different flavours of prefetch?  

Again, I haven't extensively studied the non-unit-stride cases at all.
Even on unit stride, though, I have had very little luck with any type of
prefetch (I at least tried them all) with operations that write to vectors.
Doesn't mean it can't be done, but I have had no real luck (performance
improvements more like 5% than 50%).  I haven't scoped them enough; I am
confident there are ways to make this much better.  For instance, the best
dcopy I was able to produce was roughly 1/2 has fast as the best one I
found on the net (a rather long assembly language monstrosity found, I think,
in some version of the Linux kernel), so it is clear I have not figured out
the proper tricks yet . . .

I have not done a lot of buffering the output as you mentioned for x86,
simply due to the paucity of registers . . .