[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

*To*: camm@enhanced.com*Subject*: Re: [Math-atlas-results] SSE warnings, Band matrix request feature*From*: R Clint Whaley <rwhaley@cs.utk.edu>*Date*: Fri, 12 Oct 2001 20:37:41 -0400 (EDT)*Cc*: atlas-comm@cs.utk.edu, math-atlas-devel@lists.sourceforge.net

>1) so I take it the level 3 proposal was for an extension of the blas > spec? Yep, if you have banded and packed routines take a leading dimension and some starting/stopping criteria, you can write level-3 based kernels (which are slightly modified dense kernels for packed), and then use the same recursive algorithms as in dense. This gives major speedups. >2) My comment was that the existing kernels would of course not work. > Why can't (different) kernels be used with narrow band cases? What I meant by this is that narrow-band guys are essentially Level 1 ops, which means optimization is not as good as 2 or 3, and that reusing kernels is difficult, because there are no low-order costs you can ignore (think matrix copy for Level 3, vector copy Level 2) . . . >a) a[i]*=b[i]; (should be a ?sbmv with k=0) >b) a[i]+=const. These would be additional Level 1 ops, not banded or packed, surely? Adding them as additional Level 1 ops would not be hard with the templates already in place . . . >c) a[i][j]-=b[i]+b[j]-const What the hell kind of operation is this? >d) ffts Kind of a big topic. What about fftw (http://www.fftw.org/)? I've never used it myself (I can't spell fft), but have heard good things about it. Uses some of the same kinds of ideas as ATLAS, as I understand it . . . By the way, you mentioned optimization of of non-unit stride vectors. the speedups to be had are pretty meager, even in the best case (read only). My feeling is that 5% would be heroic. Probably not worth thinking about except in exceptional cases. For level 1 ops, memory bandwidth is the big constraint most of the time, and prefetch is the only real anelgesic. From my limited experience, writing to the vector tends to kill the advantage of prefetch a great deal (I guess the bus is to busy to prefetch), so your big wins come on scalar-output routines like nrm2, iamax, ddot, etc. Cheers, Clint

**Follow-Ups**:**Re: [Math-atlas-results] SSE warnings, Band matrix request feature***From:*Camm Maguire <camm@enhanced.com>

- Prev by Date:
**Re: [Math-atlas-results] SSE warnings, Band matrix request feature** - Next by Date:
**8q7O@T4O4Y** - Prev by thread:
**Re: [Math-atlas-results] SSE warnings, Band matrix request feature** - Next by thread:
**Re: [Math-atlas-results] SSE warnings, Band matrix request feature** - Index(es):