From stevens@mcs.anl.gov Sat Jan 7 17:03:08 1995 Summary of ANL Projects for the National HPCC Software Exchange ANL's work in progress to support the NHSE is focusing on the following areas: Modular Web Robot--- Development of a modular programmable web robot that is designed to efficiently cache web pages on a local server based on programmable starting locations, keywords, file types and other search criteria. This web robot is designed to be run in parallel to allow high-performance gathering of web pages. Its modular design enables it to be rapidly modified for experimental purposes. This robot has collected data on thousands of WWW sites. The robots purpose wrt to the NHSE is to provide the raw WWW pages needed for queries of various types. The Web Robot is written in Perl5 and runs on several platforms including suns and RS6K. Parallel Web Indexing Engine--- We are developing a parallel extension of the Glimpse (University of Arizona) indexing system for rapidly indexing web pages (*.html and other file types) on parallel systems and for providing rapid regular expression based parallel searches of Web page caches, such as those generated by our web robot. We are also developing extensions to the query system specifically allowing us to locate "software" in the midst of other Web information, thus supporting the ability to search for data that contains software (source files, binaries, tar files, makefiles etc.) across the WWW. This web indexing engine should in principle be scalable to millions of URLs. A five million URL test run in planned for the near future. DNS/Geographical Database and Mapping Software--- We are developing a database to support mapping internet site domain names to geographical places (coordinates and place names) for display on a variety of GIS systems. This database (which is quasi-automatically built from various internet lookup services) provides the ability to map WWW usage log data to geographical coordinates thereby allowing users to visualize the geographical distribution of requests to WWW servers. The mapping software provide various views of the US and other geographical areas and provides the background for displaying the locations of server connections and downloads. This tool provides the NHSE contributor or maintainer an instant overview of the number and location of sites that have downloaded NHSE software and data. Autonomous Agents--- We have begun work on the design and implementation of several types of search agents. (Software that given instructions from the user can automatically use various internet mechanisms to locate or monitor data for the user). Our focus here has been on two types of agents. The first is designed to comprehensive build up a database of network available data/information/software based on a keyword list and provide the user with daily updates regarding changes to this database (e.g. new internet sites that contain data that match the keywords or changes to existing sites). The idea here would be to allow software providers to monitor the redistribution of software via ftp or WWW and to provide indications of incorrect version propagation for example. The second type of agent is designed to monitor a set of WWW sites and determine significant changes in the web structures for these sites. For example there may be a set of sites that are developing linear algebra software and they all have links to a set of other Web sites relevant for work in linear algebra, the user would like to be notified when something new is referenced by say more than four of the chosen sites, giving some indication that multiple sites view this new site worthy of attention. --------------------------------------------------------------------- Work at UT on the NHSE, 12-17-94 to 1-13-95 Survey of NHSE Software ----------------------- We are in the process of completing a survey of all the software currently pointed to by the NHSE HTML pages. The purposes of this survey are the following: - to get some idea of the size and scope of the software base the NHSE will provide an interface to - to provide raw material for derivation of search vocabulary - to prototype a combined search/browse roadmap-like interface - to provide a well-studied collection of HPCC software that may be used for measurements of the effectiveness of various search tools We have compiled a listing of around 250 items with descriptions. This list includes parallel processing tools, numerical software, and application software. We are writing a survey report that will be available in both HTML and Postscript formats. We will compile a preliminary list of thesaurus terms that will be assigned to the software descriptions. The software descriptions will also be input to a natural language processing engine that automatically extracts noun phrases from a body of documents. Further Design of URN/LIFN System --------------------------------- Recall from last month's report that we are working on a publishing system for Internet-accessible files that includes location-independent naming (URN/LIFNs) and authentication mechanisms. The team working on the URN/LIFN system has met frequently during the past month to hash out various design decisions. These decisions have involved varous nitty-gritty details such as: - character sets to be supported - the format of assertions and certificates for cataloguing information - the provision for a history of a URN in terms of the sequence of LIFNs it has been associated with - steps publishers will need to carry out to initially publish assets and to update assertions/certificates. We are drafting a technical report giving the rationale for the URN/LIFN project and detailing our design decisions. We are proceeding with implementation of the URN/LIFN server system, client library, and publishing tool.