[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: sgemm questions



> 2) Peter's ideas of a) unrolling fully with KB ~ 56, b) 1x4 strategy
>    c) loading C at the beginning rather than at the end  and
>    (shockingly) d) doing no pipelining at all all seem to be wins.  I
>    couldn't believe d) when I saw it, but its apparently true -- the
>    PIII likes code like load(a) mul(b,a) add(a,c) best.  Apparently,
>    the parallelism between muls and adds mentioned by Doug Aberdeen in
>    his earlier email only appears fully when the intermediary register
>    is the same.  Doug, maybe you can try this and see if you can get
>    better than 0.75 clock?  Or maybe I misunderstand you?

The following sequence gets 0.84 IPC, an improvement over 0.75, and
the best performance to date:

    MULPS(0, 1);
    ADDPS(1, 2);

    MULPS(3, 4);
    ADDPS(4, 5);

    MULPS(6, 7);
    ADDPS(7, 0);

    MULPS(3, 4);
    ADDPS(4, 5);

Note that each pair of MULPS ADDPS don't have dependencies on
ajacent sets. If they do, the IPC drops to 0.29, which is for the following 
code:

    MULPS(0, 1);
    ADDPS(1, 2);
    MULPS(2, 3);
    ADDPS(3, 4);
    MULPS(4, 5);
    ADDPS(5, 6);
    MULPS(6, 7);
    ADDPS(7, 0);

There's pipeline stalls all over this one. I don't quite understand
how the first code does so well, since there should be a stall between
each MUL and ADD. They must do something funky in the hardware, or
perhaps the instruction re-ordering works really well in this case.

-- 

-Doug  -- http://beaker.anu.edu.au, Ph:(02) 6279-8608, Fax:(02) 6279-8651
A pessimist is just a realist who has not been proved right... yet.