[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Abysmal performance with new gccs


I got access to gcc 2.95 on the athlon.  This allowed me to scope the
assembler being produced by gcc 3.0/2.96-80 and compare it directly with
that being produced by 2.95.  The main difference is in the scheduling.

What ATLAS usually does is stick all the loads at the top of the loop, and
then do all the computations, letting the compiler schedule the loads and
stores as it sees fit.  2.95 changed this to a mixed load/compute loop
automatically.  2.96-80 and 3.0 do not, they leave the loop as all loads,
and then all computations.

Fortunately, ATLAS can do it's own load scheduling if it has to, so I
turned that on, and was able to reproduce the old load pattern by having
the generator put it into the C code (as opposed to having gcc do it, as
in 2.95).  So, so far what we have is (1.2Ghz Athlon):

gcc2.95   + no atlas scheduling:  1332Mflops
gcc2.96-80+ no atlas scheduling:   730Mflops
gcc2.96-80+    atlas scheduling:  1245Mflops

It appears that it is not just the case that gcc 3.0 does not do scheduling:
it actually does scheduling back to the all instructions, all computation
model.  If I have atlas do the scheduling, but then throw the -fschedule-insns
flag, it reorders it so that all the fetches are at top, followed by all
computations, thus slowing the 1245 back down to the 730.  I still need to
get access to gcc3.0/296-80 on a PIII, but I'm wondering if it likes this
scheduling pattern, 'cause the hardware reordering finds it easy to work with.
However, I have user reports that 2.96-80 causes a slowdown on PIIIs to, so
I'm not ready to bet on that yet (see below for more on this) . . .

OK, so now the question is, why 1245 rather than 1332.  If I scope the
assembler of gcc2.95 and gcc2.96-80+atlas scheduling, the main difference
I see is in the use of the register stack.  2.96 is throwing in an
fxch between a lot of the faddp and fmul instructions.  If I remember correctly
fxch is an inst. to swap registers on the x87 stack (anyone want to correct
me on this?), which is free if done in the right place.

To give you an idea, 2.95 issues 2 fxch instruction for the entire loop, while
2.96-80 issues 42.

OK, here is the guesswork part: I think we are seeing a huge slowdown on
Athlon's 'cause the scheduling is all store/all compute, which is good for
the Pentiums, but bad for athlons.  Then, we are seeing a smaller slowdown
on all x86 platforms because of a new register-stack handling scheme, which
is not as effective as the old.  My guess is that this new scheme handles
code which does no register blocking just fine, and since very little x86
code is register blocked, they never caught this slowdown in testing . . .

In case anyone wants to scope all of this for themselves, I include a tarfile
below with the various assembler routines:

dmm_295.s                : 1332Mflop
dmm_AtlasSched_296-85.s  : 1245Mflop
dmm_gccsched_296-85.s    :  730Mflop

In general, gcc3.0 appears to ATLAS like a completely new compiler, not a new
release of gcc.  For instance, it is looking likely that all of the
architectural defaults that work for gcc 2.95 and earlier are very non-optimal
on the gcc3.0.  I scoped the ev6, and there 3.0 actually is faster than
2.95 or 2.8 (2.8 is faster than 2.95 for alphas), but you need different
arch defaults.  This is going to really balkanize the defaults, and I'm 
pretty sure we'll eventually have to support defaults for only the old stuff
or the new stuff, but that both would be a bit onerous . . .