[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

*To*: Peter Soendergaard <soender@cs.utk.edu>*Subject*: Re: efficient summing of vector.*From*: Camm Maguire <camm@enhanced.com>*Date*: 12 Mar 2001 10:26:06 -0500*Cc*: R Clint Whaley <rwhaley@cs.utk.edu>, atlas-comm@cs.utk.edu*In-Reply-To*: Peter Soendergaard's message of "Mon, 12 Mar 2001 08:14:49 -0500 (EST)"*References*: <Pine.LNX.4.10.10103120809230.17034-100000@torc10.cs.utk.edu>

Hi Peter! Peter Soendergaard <soender@cs.utk.edu> writes: > Hi Camm, master in the ways of intel-assembly. > *please*, not true at all! I'm fishing around in the dark like everyone else! > You once wrote that you shaved an instruction of the way I sum a sse > register. I use a sequence like this to sum the register in #reg using This, if memory serves, was not in the vector sum at the end, of the k-loop, but in the main block looping over the 4 columns of A doing the add-multiply. There is also a way to make the "C write" step more efficient, I think, but that's not what I was referring to above. The best strategy for the latter that I've thought of so far seems to lie in combining the fragments of the various C results (where possible) into the same registers, doubling the effective workload of a given movhlps, and winding up with the final 4 (single precision) C answers, (when the problem specifies that they will be contiguous), in a single register, and written out to memory in a single step. This for example is my windup step for SREAL: #define z f(t0,0,cx) pc(4,0) pul(5,4) pc(6,1) puh(5,0) pul(7,6) \ pa(0,4) puh(7,1) pc(4,2) pa(1,6) ps(68,6,4) ps(238,6,2) pa(4,2) pu(2,0,cx) i.e. "movaps %%xmm4,%%xmm0\n\t" "unpcklps %%xmm5,%%xmm4\n\t" "movaps %%xmm6,%%xmm1\n\t" "unpckhps %%xmm5,%%xmm0\n\t" "unpcklps %%xmm7,%%xmm6\n\t" "addps %%xmm0,%%xmm4\n\t" "unpckhps %%xmm7,%%xmm1\n\t" "movaps %%xmm4,%%xmm2\n\t" "addps %%xmm1,%%xmm6\n\t" "shufps $68,%%xmm6,%%xmm4\n\t" "shufps $238,%%xmm6,%%xmm2\n\t" "addps %%xmm4,%%xmm2\n\t" "movups %%xmm2,(%ecx)\n\t" Sorry this is so rushed and unclear. Of course, this needs to be changes somewhat for the other cases. Take care, > xmm7 as scratch, and it seems like a clumsy way to do it. How can it be > done in 4 instructions? > > __asm__ __volatile__ ("movhlps " #reg ", %%xmm7\n"\ > "addps " #reg ", %%xmm7\n"\ > "movaps %%xmm7, " #reg "\n"\ > "shufps $1, " #reg ", %%xmm7\n"\ > "addss %%xmm7, " #reg "\n"\ > > > Hope you can help me, > > Cheers, > Peter > > > -- Camm Maguire camm@enhanced.com ========================================================================== "The earth is but one country, and mankind its citizens." -- Baha'u'llah

**Follow-Ups**:**Re: efficient summing of vector.***From:*Peter Soendergaard <soender@cs.utk.edu>

**References**:**efficient summing of vector.***From:*Peter Soendergaard <soender@cs.utk.edu>

- Prev by Date:
**efficient summing of vector.** - Next by Date:
**Re: efficient summing of vector.** - Prev by thread:
**efficient summing of vector.** - Next by thread:
**Re: efficient summing of vector.** - Index(es):