[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: efficient summing of vector.
I must admit that I am already using your 12 instruction sequence. I could
not read you macro code, so I used objdump --disassemble to read your code
I was just looking for something efficient for the complex situation,
where the result can not be written to continous memory.
Cheers and thanks,
On 12 Mar 2001, Camm Maguire wrote:
> Hi Peter!
> Peter Soendergaard <firstname.lastname@example.org> writes:
> > Hi Camm, master in the ways of intel-assembly.
> *please*, not true at all! I'm fishing around in the dark like
> everyone else!
> > You once wrote that you shaved an instruction of the way I sum a sse
> > register. I use a sequence like this to sum the register in #reg using
> This, if memory serves, was not in the vector sum at the end, of the
> k-loop, but in the main block looping over the 4 columns of A doing
> the add-multiply. There is also a way to make the "C write" step more
> efficient, I think, but that's not what I was referring to above. The
> best strategy for the latter that I've thought of so far seems to lie
> in combining the fragments of the various C results (where possible)
> into the same registers, doubling the effective workload of a given
> movhlps, and winding up with the final 4 (single precision) C answers,
> (when the problem specifies that they will be contiguous), in a single
> register, and written out to memory in a single step.
> This for example is my windup step for SREAL:
> #define z f(t0,0,cx) pc(4,0) pul(5,4) pc(6,1) puh(5,0) pul(7,6) \
> pa(0,4) puh(7,1) pc(4,2) pa(1,6) ps(68,6,4) ps(238,6,2) pa(4,2) pu(2,0,cx)
> "movaps %%xmm4,%%xmm0\n\t"
> "unpcklps %%xmm5,%%xmm4\n\t"
> "movaps %%xmm6,%%xmm1\n\t"
> "unpckhps %%xmm5,%%xmm0\n\t"
> "unpcklps %%xmm7,%%xmm6\n\t"
> "addps %%xmm0,%%xmm4\n\t"
> "unpckhps %%xmm7,%%xmm1\n\t"
> "movaps %%xmm4,%%xmm2\n\t"
> "addps %%xmm1,%%xmm6\n\t"
> "shufps $68,%%xmm6,%%xmm4\n\t"
> "shufps $238,%%xmm6,%%xmm2\n\t"
> "addps %%xmm4,%%xmm2\n\t"
> "movups %%xmm2,(%ecx)\n\t"
> Sorry this is so rushed and unclear. Of course, this needs to be
> changes somewhat for the other cases.
> Take care,
> > xmm7 as scratch, and it seems like a clumsy way to do it. How can it be
> > done in 4 instructions?
> > __asm__ __volatile__ ("movhlps " #reg ", %%xmm7\n"\
> > "addps " #reg ", %%xmm7\n"\
> > "movaps %%xmm7, " #reg "\n"\
> > "shufps $1, " #reg ", %%xmm7\n"\
> > "addss %%xmm7, " #reg "\n"\
> > Hope you can help me,
> > Cheers,
> > Peter
> Camm Maguire email@example.com
> "The earth is but one country, and mankind its citizens." -- Baha'u'llah