[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Math-atlas-results] SSE warnings, Band matrix request feature
>1) so I take it the level 3 proposal was for an extension of the blas
Yep, if you have banded and packed routines take a leading dimension and some
starting/stopping criteria, you can write level-3 based kernels (which are
slightly modified dense kernels for packed), and then use the same recursive
algorithms as in dense. This gives major speedups.
>2) My comment was that the existing kernels would of course not work.
> Why can't (different) kernels be used with narrow band cases?
What I meant by this is that narrow-band guys are essentially Level 1 ops,
which means optimization is not as good as 2 or 3, and that reusing kernels
is difficult, because there are no low-order costs you can ignore (think
matrix copy for Level 3, vector copy Level 2) . . .
>a) a[i]*=b[i]; (should be a ?sbmv with k=0)
These would be additional Level 1 ops, not banded or packed, surely? Adding
them as additional Level 1 ops would not be hard with the templates already
in place . . .
What the hell kind of operation is this?
Kind of a big topic. What about fftw (http://www.fftw.org/)? I've never
used it myself (I can't spell fft), but have heard good things about it.
Uses some of the same kinds of ideas as ATLAS, as I understand it . . .
By the way, you mentioned optimization of of non-unit stride vectors. the
speedups to be had are pretty meager, even in the best case (read only).
My feeling is that 5% would be heroic. Probably not worth thinking about
except in exceptional cases. For level 1 ops, memory bandwidth is the big
constraint most of the time, and prefetch is the only real anelgesic. From
my limited experience, writing to the vector tends to kill the advantage of
prefetch a great deal (I guess the bus is to busy to prefetch), so your
big wins come on scalar-output routines like nrm2, iamax, ddot, etc.