[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

*To*: atlas-comm@cs.utk.edu, math-atlas-devel@lists.sourceforge.net*Subject*: Re: [Math-atlas-results] SSE warnings, Band matrix request feature*From*: R Clint Whaley <rwhaley@cs.utk.edu>*Date*: Wed, 17 Oct 2001 00:32:23 -0400 (EDT)

Camm, >>>1) so I take it the level 3 proposal was for an extension of the blas >>> spec? >> >>Yep, if you have banded and packed routines take a leading dimension and some >>starting/stopping criteria, you can write level-3 based kernels (which are >>slightly modified dense kernels for packed), and then use the same recursive >>algorithms as in dense. This gives major speedups. >> >Interesting. Too bad about the proposal. I'm sure it is a lot of >work, yes? Yes. Tragicly, it involves changing both the code generator, and the entire recursive BLAS :) >True. Is there any way for the banded l2 for example to kick over to >small loops over level 1 routines with narrow matrices, kind of like >the small case test in gemm? Sure. Since we are talking about replacing Antoine's kernel with your own, you'd simply have that switch in your kernel . . . >You'd mentioned before that atlas treats these as ref blas. Then why >this? (sse1 850MHz p3) > >/usr/lib/atlas/xsl2blastst -R tbmv -n 1000 -F 200 > >----------------------------- TBMV ------------------------------ >TST# UPLO TRAN DIAG N K LDA INCX TIME MFLOP SpUp TEST >==== ==== ==== ==== ==== ==== ==== ==== ====== ====== ===== ===== > 0 L N N 1000 1 2 1 0.00 158.7 1.00 ----- > 0 L N N 1000 1 2 1 0.00 36.6 0.23 PASS > >1 tests run, 1 passed > >/usr/lib/atlas/xsl2blastst -R gemv -n 1000 -F 200 > >------------------------------- GEMV -------------------------------- >TST# TR M N ALPHA LDA INCX BETA INCY TIME MFLOP SpUp TEST >==== == ==== ==== ===== ==== ==== ===== ==== ====== ===== ===== ===== > 0 N 1000 1000 1.0 1000 1 1.0 1 0.03 67.4 1.00 ----- > 0 N 1000 1000 1.0 1000 1 1.0 1 0.01 238.7 3.54 PASS > >1 tests run, 1 passed I'm not sure I understand your question. Looks like the numbers you are showing merely underline what I said: gemv is optimized, TBMV is not . . . If you are comparing against the F77 reference BLAS, realize that ATLAS is using the same loops translated by Antoine into C. C compilers are often not as efficient as F77 on unoptimized loops (i.e., comparing equivalent reference C and F77 implementations will often show F77 kicking butt) . . . >One other item regarding possible precision. I've written a >quadratic program using basically a loop over gemv and ger. I've run >with both double and single precision, using both atlas and ref blas. >And it seems that atlas is losing a lot more precision than ref blas >for the single float case. Here are my results (all these vectors >should be the same to within rounding): Uh, if I'm supposed to learn something from an output dump, a little more terseness of output, or more explanation of how to interprate the bushels of output might be in order. Since I'm not planning on doing my Ph.D. on this topic, I'm kind of blowing off the IO you sent . . . As far as accuracy, single precision SSE will indeed lose quite a bit of non-guaranteed accuracy. The guaranteed accuracy remains the same: IEEE single precision (32 bit). x87 ops are promoted to 80 bit on both double and single precision register ops on x86 chips. The reason this added accuracy is not guaranteed is that each time you write the register to memory, you round to 32 (64 for double) accuracy. So, any accuracy above 32-bit you don't want to count on: if the algorithm changes it can change, and if you go to another platform (eg, alpha, sparc) you won't have it . . . So, I think this is roughly what you mentioned: SSE gives 32-bit accuracy (IEEE standard, which is all anything **guarantees**), while x87 uses a mixture of 80bit with rounding to give (possibly & unpredictably) more accurate results . . . Just as an aside, you will observe the same behavior on a P4, comparing SSE2 vs. x86 double precision. SSE2 goes to 64 bit only, whereas x87 will have 80bit for register-register ops. Again, ATLAS uses SSE2 because it does not change the accuracy which we can guarantee . . . >Agreed. Have you looked into writing behind the reading pt at various >distances, and the different flavours of prefetch? Again, I haven't extensively studied the non-unit-stride cases at all. Even on unit stride, though, I have had very little luck with any type of prefetch (I at least tried them all) with operations that write to vectors. Doesn't mean it can't be done, but I have had no real luck (performance improvements more like 5% than 50%). I haven't scoped them enough; I am confident there are ways to make this much better. For instance, the best dcopy I was able to produce was roughly 1/2 has fast as the best one I found on the net (a rather long assembly language monstrosity found, I think, in some version of the Linux kernel), so it is clear I have not figured out the proper tricks yet . . . I have not done a lot of buffering the output as you mentioned for x86, simply due to the paucity of registers . . . Cheers, Clint

- Prev by Date:
**Re: [Math-atlas-results] SSE warnings, Band matrix request feature** - Next by Date:
**Athlon results** - Prev by thread:
**Re: [Math-atlas-results] SSE warnings, Band matrix request feature** - Next by thread:
**SSE warnings, Band matrix request feature** - Index(es):