[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: binary installation issues (cont'd)


Thanks for the code.  The NCPU, in particular, is godsend.  So, let me handle
the easy, P.S. part first:

>PS Some buglets in 3.1.4D:
>	- config.c line 2131: There is a newline in the string
>	  which causes HP's compiler to complain/barf

Thanks, it'll be fixed in next developer release.

>	- Instead of using "-Aa -D_INCLUDE_POSIX_SOURCE"
>	  or simply "-Aa" for HP flags, you should probably use
>	  "-Ae" instead.

The man page on my decrepit HP-UX box does not discuss -Ae, though cc seems
to take it without warnings.  What does -Ae mean?

>	- Add HP-UX specific code to discover number of CPUs

Thanks a lot for this code!  It will be in the next developer release.
Can syscall also give the size of various caches, or other arch info?

>	- HP-PA machines generally only have an L1 cache,
>	  they don't have an L2/L3 cache.  However, the L1
>	  cache is usually between 256KB and 2MB for many
>	  machines.  tune/sysinfo/L1CacheSize.c assumes
>	  that L1 caches are at most 256KB.  I think a better
>	  algorithm might be to use binary search where the
>	  two end points are very small (say 1K) and very
>	  large (say 2x MaxL2CacheSize).  The binary search
>	  would be on the log2() size of the prospective cache
>	  sizes.

OK, the first thing to know is that at the present time, the Level 3 BLAS
will get no advantage from pumping L1CacheSize larger than 64K.  That's
not to say it won't use larger L1 caches, but just that the L1CacheSize
setting will not effect performance if cranked above 64K.  It may have
a good effect on the Level 2 BLAS performance, though. I don't have
time at the moment to scope the changes you've made to the L1 cache detector,
but I'll try to scope it out at some point . . .

>I apologize --- I looked a bit more carefully at the search code,
>and it looks like you minimize search time by using binary
>search instead of a full grid search.  

It's not precisely a binary search, rather it is a heuristic that
uses a variety of techniques in differing parts of the search.  An
exhaustive search of the code generator's parameter space alone would take
weeks if not months . . .

>Given that a significant fraction of the time spent tuning ATLAS
>is spent compiling, and given that you already use binary 
>search, I think that building binaries for the full grid search 
>would likely be prohibitively expensive.  Also, I think the
>resulting installation package would be huge, since the 
>aggregate size of all .o files for the full search space would
>be overwhelming.
>If the package were distributed by CD-ROM, size wouldn't
>matter so much, and since the build only happens once,
>the huge build time would not be a showstopper.  However,
>it feels a bit like a kludge to me...

I think you better change "CD-ROM" to DVD, at the miminum.  I am willing
to bet ATLAS in source + gccs for all known platforms would be
significantly smaller than the object code for a full parameter space of
the code generator . . .

>So, do you have any suggestions as to whether it might be
>possible to extend/enhance/modify ATLAS to enable
>distributors to build/redistribute binary packages of ATLAS
>which could tune themselves with minimal delay and
>recompilation on target machines?

I really don't think compilation is that big a part of the cost in general.
I can't speak too much to HP-UX, since I now have access to exactly one
HP-UX machine, an ancient HP9000/735, that does indeed take overnight *at
least* to do an ATLAS install.  The other over-nighter install is our
SGI machine, which seems to spend most of it's time in "ar".  I always
wonder how much of these slow-ass machine's problems are caused by NFS.
Have you installed in /tmp or somewhere like that?

Unfortunately, I don't believe there is any way to do a binary-only package
of ATLAS with any real architecture independance.  However, one targeted
to a series of related architectures might possibly be viable.  Again,
I can't speak to HP-UX, but Intel provides an example of this.  There is
presently only a CacheEdge (L2-blocking) difference between
PII/PIII-external cache and PIII-on-chip cache, so both these archs
could be supported by 1 lib, and if you have PIII with on-chip
cache, you replace the something like 2 routines that have CacheEdge.
However, this quickly breaks down; for instance, we now have SSE stuff in
the developer releases, and SSE will work or not depending on your kernel,
so we have a new lib to get, and rapidly we see the precompiled libs
becoming very specific, with less and less object overlap.  At this point,
it makes sense to just precompile libs for every conceivable arch, and save
yourself the complexity of building partial libs with insane interdependencies.

So, for binary-only vendors, that is what I think I would do: a binary-only
distribution is by necessity more restricted than a source-only, so I would
be tempted to try to build the complete, specific libs for all expected
archs, and include them all, with an install script that uncompresses only
the correct one for your arch.  Obviously, this is not as flexible as
installing from source, but I don't see how do be so, without putting in
an incredible amount of work that would need to be updated in lockstep
with ATLAS changes . . .

Install time continues to rise as we add more and more functionality
to ATLAS (i.e., now we search Level 2 and 3, and we plan to add Level 1,
as well as extending already existing stuff);  as this happens, I think
we will need to allow users to guide how we optimize: i.e., they indicate
which routines to optimize, so we don't spend 2 hours optimizing level 1
for a guy who doesn't use it.