From stevens@mcs.anl.gov Sat Jan  7 17:03:08 1995

Summary of ANL Projects for the National HPCC Software Exchange

ANL's work in progress to support the NHSE is focusing on the
following areas:

Modular Web Robot---

Development of a modular programmable web robot that is designed to
efficiently cache web pages on a local server based on programmable
starting locations, keywords, file types and other search criteria.
This web robot is designed to be run in parallel to allow
high-performance gathering of web pages.  Its modular design enables
it to be rapidly modified for experimental purposes.  This robot has
collected data on thousands of WWW sites.  The robots purpose wrt to
the NHSE is to provide the raw WWW pages needed for queries of various
types. The Web Robot is written in Perl5 and runs on several platforms
including suns and RS6K.

Parallel Web Indexing Engine---

We are developing a parallel extension of the Glimpse (University of
Arizona) indexing system for rapidly indexing web pages (*.html and
other file types) on parallel systems and for providing rapid regular
expression based parallel searches of Web page caches, such as those
generated by our web robot. We are also developing extensions to the
query system specifically allowing us to locate "software" in the
midst of other Web information, thus supporting the ability to search
for data that contains software (source files, binaries, tar files,
makefiles etc.) across the WWW.  This web indexing engine should in
principle be scalable to millions of URLs.  A five million URL test
run in planned for the near future.

DNS/Geographical Database and Mapping Software---

We are developing a database to support mapping internet site domain
names to geographical places (coordinates and place names) for display
on a variety of GIS systems.  This database (which is
quasi-automatically built from various internet lookup services)
provides the ability to map WWW usage log data to geographical
coordinates thereby allowing users to visualize the geographical
distribution of requests to WWW servers.  The mapping software provide
various views of the US and other geographical areas and provides the
background for displaying the locations of server connections and
downloads.  This tool provides the NHSE contributor or maintainer an
instant overview of the number and location of sites that have
downloaded NHSE software and data.

Autonomous Agents---

We have begun work on the design and implementation of several types
of search agents. (Software that given instructions from the user can
automatically use various internet mechanisms to locate or monitor
data for the user).  Our focus here has been on two types of agents.
The first is designed to comprehensive build up a database of network
available data/information/software based on a keyword list and
provide the user with daily updates regarding changes to this database
(e.g. new internet sites that contain data that match the keywords or
changes to existing sites). The idea here would be to allow software
providers to monitor the redistribution of software via ftp or WWW and
to provide indications of incorrect version propagation for example.

The second type of agent is designed to monitor a set of WWW sites and
determine significant changes in the web structures for these sites.
For example there may be a set of sites that are developing linear
algebra software and they all have links to a set of other Web sites
relevant for work in linear algebra, the user would like to be
notified when something new is referenced by say more than four of the
chosen sites, giving some indication that multiple sites view this
new site worthy of attention.

---------------------------------------------------------------------

Work at UT on the NHSE, 12-17-94 to 1-13-95

Survey of NHSE Software
-----------------------

We are in the process of completing a survey of all the software
currently pointed to by the NHSE HTML pages.  The purposes of this
survey are the following:
 - to get some idea of the size and scope of the software base
   the NHSE will provide an interface to
 - to provide raw material for derivation of search vocabulary
 - to prototype a combined search/browse roadmap-like interface
 - to provide a well-studied collection of HPCC software that may
   be used for measurements of the effectiveness of various search tools
We have compiled
a listing of around 250 items with descriptions.  This list
includes parallel processing tools, numerical software, and
application software.  We are writing a survey report that will
be available in both HTML and Postscript formats.
We will compile a preliminary list of thesaurus terms that
will be assigned to the software descriptions.
The software descriptions will also be input to a natural
language processing engine that automatically extracts noun
phrases from a body of documents.  


Further Design of URN/LIFN System
---------------------------------

Recall from last month's report that we are working on a publishing system
for Internet-accessible files that includes location-independent
naming (URN/LIFNs) and authentication mechanisms.
The team working on the URN/LIFN system has met frequently during
the past month to hash out various design decisions.  These
decisions have involved varous nitty-gritty details such as:
 - character sets to be supported
 - the format of assertions and certificates for cataloguing information
 - the provision for a history of a URN in terms of the sequence of
   LIFNs it has been associated with 
 - steps publishers will need to carry out to initially publish assets
   and to update assertions/certificates.
We are drafting a technical report giving the rationale for
the URN/LIFN project and detailing our design decisions.
We are proceeding with implementation of the URN/LIFN server system,
client library, and publishing tool.