[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: efficient summing of vector.

To: Peter Soendergaard <soender@cs.utk.edu>
Subject: Re: efficient summing of vector.
From: Camm Maguire <camm@enhanced.com>
Date: 12 Mar 2001 10:26:06 -0500
Cc: R Clint Whaley <rwhaley@cs.utk.edu>, atlas-comm@cs.utk.edu
In-Reply-To: Peter Soendergaard's message of "Mon, 12 Mar 2001 08:14:49 -0500 (EST)"
References: <Pine.LNX.4.10.10103120809230.17034-100000@torc10.cs.utk.edu>

Hi Peter!  

Peter Soendergaard <soender@cs.utk.edu> writes:

> Hi Camm, master in the ways of intel-assembly.
> 

*please*, not true at all!  I'm fishing around in the dark like
 everyone else!

> You once wrote that you shaved an instruction of the way I sum a sse
> register. I use a sequence like this to sum the register in #reg using

This, if memory serves, was not in the vector sum at the end, of the
k-loop, but in the main block looping over the 4 columns of A doing
the add-multiply.  There is also a way to make the "C write" step more
efficient, I think, but that's not what I was referring to above.  The
best strategy for the latter that I've thought of so far seems to lie
in combining the fragments of the various C results (where possible)
into the same registers, doubling the effective workload of a given
movhlps, and winding up with the final 4 (single precision) C answers,
(when the problem specifies that they will be contiguous), in a single
register, and written out to memory in a single step.

This for example is my windup step for SREAL:

#define z f(t0,0,cx) pc(4,0) pul(5,4) pc(6,1) puh(5,0) pul(7,6)  \
          pa(0,4) puh(7,1) pc(4,2) pa(1,6) ps(68,6,4) ps(238,6,2) pa(4,2) pu(2,0,cx)

i.e.
	"movaps %%xmm4,%%xmm0\n\t"
	"unpcklps %%xmm5,%%xmm4\n\t"
	"movaps %%xmm6,%%xmm1\n\t"
	"unpckhps %%xmm5,%%xmm0\n\t"
	"unpcklps %%xmm7,%%xmm6\n\t"
	"addps %%xmm0,%%xmm4\n\t"
	"unpckhps %%xmm7,%%xmm1\n\t"
	"movaps %%xmm4,%%xmm2\n\t"
	"addps %%xmm1,%%xmm6\n\t"
	"shufps $68,%%xmm6,%%xmm4\n\t"
	"shufps $238,%%xmm6,%%xmm2\n\t"
	"addps %%xmm4,%%xmm2\n\t"
	"movups %%xmm2,(%ecx)\n\t"

Sorry this is so rushed and unclear.  Of course, this needs to be
changes somewhat for the other cases.

Take care,

> xmm7 as scratch, and it seems like a clumsy way to do it. How can it be
> done in 4 instructions?
> 
>         __asm__ __volatile__ ("movhlps " #reg ", %%xmm7\n"\
>     			      "addps " #reg ", %%xmm7\n"\
>     			      "movaps %%xmm7, " #reg "\n"\
>                               "shufps $1, " #reg ", %%xmm7\n"\
>     			      "addss %%xmm7, " #reg "\n"\
> 
> 
> Hope you can help me,
> 
> Cheers,
> Peter
> 
> 
> 

-- 
Camm Maguire			     			camm@enhanced.com
==========================================================================
"The earth is but one country, and mankind its citizens."  --  Baha'u'llah

Follow-Ups:
- Re: efficient summing of vector.
  - From: Peter Soendergaard <soender@cs.utk.edu>

References:
- efficient summing of vector.
  - From: Peter Soendergaard <soender@cs.utk.edu>

Prev by Date: efficient summing of vector.
Next by Date: Re: efficient summing of vector.
Prev by thread: efficient summing of vector.
Next by thread: Re: efficient summing of vector.
Index(es):
- Date
- Thread