[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: SSE Level 3 drop in gemm
R Clint Whaley <firstname.lastname@example.org> writes:
> If you wind up with a kernel, turn it in anyway. Kernels are preferable to
> >2) The xsmmtst always doubles ldc, even with single real precision.
> > This makes it difficult to fully capitalize on he compile-time
> > constant nature of the dimensions (i.e. one must read ldc runtime
> > if one wants a routine that will past both the tester and the
> > timer.)
> That's why the macro NB2 exists: it is just NB*2 as a constant . . .
> Is this what you are talking about?
Actually, I was in error here. For some reason, I thought that C was
copied as well to "block-major" storage, but ldc really is a runtime
> >3) I found it useful to also define NB4,MB4, and KB4 in emit_mm.c,
> > for obvious (In the case of SSE) reasons.
> What are these macros? NB*4?
Yes. Currently I define these and KB8... in emit_mm, which I think is
a bit excessive. I can get around it with my (very ugly) cpp
arithmetic hack if necessary. Does anyone know a better way of doing
simple arithmetic in cpp? The result cannot be an expression, but an
actual number fit for an assembler string. Currently, I do the
#define P_1008_252 1260
#define P_1008_256 1264
#define XS(a_,b_) P_ ## b_ ## _ ## a_
#define S(a_,b_) XS(a_,b_)
In this particular case, I'm trying to avoid storing lda (for example)
in a register, and rely on its being a compile time constant.
Otherwise, I could do something like "nn(%eax,%ecx,4)".
> >8) Sure would be nice, since a copy is being done anyway, to align
> > data to 16 bytes. Anywhere I can change this locally just to see
> > what it adds to the performance?
> Yep, in ATLAS/include/atlas_misc.h, change
> #define ATL_Cachelen 32
> #define ATL_Cachelen 128
OK. I did this, but it doesn't seem to affect the mmtst.c and fc.c
programs for testing and timing respectively. fc.c already aligns
things quite nicely, but I've added the following ugliness to mmtst.c
(at line 528) so far:
It turns out that alignment helps *a lot* in this case. The kernel is
up to 666MFLOPS, most of the gain over the previous 550 being in
alignment (and its consequent simplifications).
A few other items:
1) Taking a working sgemm and testing with pre=c fails to compile,
failing to find the b1 and bX routines. The emit_mm added headers
defines ATLAS_USERMM to b0, how are the others supposed to link in?
I'll temporarily get around this by changing the name of the
routine according to BETA
2) I currently have a very small, but frustrating kludge in the
kernel. For some reason, calling my assembler with the __asm__
__inline__ (... :::"ax","bx",...); construct does not end up
pushing the registers that fc.c is using, leading to a segfault
unless I add an arbitrary "push %ebx\n\t"/"pop %ebx\n\t" pair
around the kernel.
3) Currently, the search algorithm selects 56 for nb, apparently due
to the size of my cache. How variable is this? I'd like to define
macros that unroll the k loop maximally according to KB. What is a
reasonable upper bound on this? Would having such a variable
unrolling confuse the search process, which reads ku from
> Thanks for all the work,
Camm Maguire email@example.com
"The earth is but one country, and mankind its citizens." -- Baha'u'llah