[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: sgemm questions

To: Camm Maguire <camm@enhanced.com>
Subject: Re: sgemm questions
From: Peter Soendergaard <soender@cs.utk.edu>
Date: Wed, 22 Nov 2000 12:29:17 -0500 (EST)
cc: R Clint Whaley <rwhaley@cs.utk.edu>, atlas-comm@cs.utk.edu
In-Reply-To: <54k89w9d6s.fsf@intech19.enhanced.com>

> =============================================================================
> Peter's code (sent in email, couldn't get complex to work?):
> =============================================================================

My fault. Complex should work in the next release.

> A few comments:
> 
> 1) I like Peter's idea of using a generator to write C code and then
>    compile, better than my approach of having the cpp preprocessor
>    generate assembly from defined macros.  I'd originally adopted the
>    latter because I couldn't get rid of register thrashing as gcc
>    switched between its asm and mine, but Peter's code generates very
>    clean assembly, and gcc always handles the loop overhead best.  I
>    was further a little concerned about the documentation, which seems
>    to indicate that gcc is free to insert whatever it wishes between
>    asm() calls.  We can currently produce good asm using multiple
>    asm() calls because a) gcc currently doesn't reference the extended
>    registers, and b) if we don't reference the ordinary registers in
>    the asm() explicitly, gcc's optimizer can do a good job of
>    maximizing register use across asm() calls.  If and when gcc
>    starts emitting references to SSE/MMX registers, of course, things
>    will have to change.  

Yes, I was counting on the same thing: That gcc never touches the extended
registers, and I never touch the normal registers. This is my first
experience with writing gcc inline assembly, so if you have any comments
on the macros I would welcome it. I am a bit concerned with the macros I
have now, because I don't specify that I am using the extended registers,
so, as you say, it will break down one day with a newer compiler.

> 
> 2) Peter's ideas of a) unrolling fully with KB ~ 56, b) 1x4 strategy
>    c) loading C at the beginning rather than at the end  and
>    (shockingly) d) doing no pipelining at all all seem to be wins.  I
>    couldn't believe d) when I saw it, but its apparently true -- the
>    PIII likes code like load(a) mul(b,a) add(a,c) best.  Apparently,
>    the parallelism between muls and adds mentioned by Doug Aberdeen in
>    his earlier email only appears fully when the intermediary register
>    is the same.  Doug, maybe you can try this and see if you can get
>    better than 0.75 clock?  Or maybe I misunderstand you?
> 

The reason that things work well with only one intermediate register might
be, that as soon as a new load into that register occurs, it is mapped to
another register, so you end up using a whole new register. I dont now how
good the PIII is for doing these things, or how many physical registers it
actually has.

> 3) I noticed the practice of checking the loops at the end, so that the
>    code fails if called with any length = 0.  This seems reasonable,
>    but I thought I'd point it out to ensure that atlas is making the
>    calls accordingly. 
> 
> 4) I really only did three things, and a few minor cleanups, to
>    Peter's code: a) shaved an instruction off the main block of 4
>    multiplies, b) tightened the writing of C, and c) with these, and
>    the elimination of a few extraneous instructions, increased the
>    optimal KB to 60 or 64.   
> 
> 5) Peter, if you'd like to make these changes in your generator, and
>    maintain this code or its equivalent, that would be just fine with
>    me.  You're doing a great job, and atlas is all the better for it!

Thank you. Please send the changes to me in some easy-to-read way, and I
will update the codegenerator. Thanks for your feedback. 


> 
> 6) I've got a cleanup too, which works but isn't fully optimized, if
>    anyone would like to look at it.
> 

I am working on the k-cleanup now, with an idea of yours, that Clint
mentioned: To loop over every 4th column of A, beacuse then you can use 
aligned loads since they will be aligned the same way. Hopefully I can get
something working, but it seems to be the toughest problem to get good
performance from.

Cheers,

Peter 

> Take care,
> 
> 
> -- 
> Camm Maguire			     			camm@enhanced.com
> ==========================================================================
> "The earth is but one country, and mankind its citizens."  --  Baha'u'llah
>

References:
- Re: sgemm questions
  - From: Camm Maguire <camm@enhanced.com>

Prev by Date: Re: sgemm questions
Next by Date: Re: DGEMV problems
Prev by thread: Re: sgemm questions
Next by thread: Re: sgemm questions
Index(es):
- Date
- Thread