Failure Detection

Next: Failure Robustness Up: Fault Tolerance Previous: Fault Tolerance

Failure Detection

Failures may occur at different levels of the NetSolve protocols. Generally they are due to a network malfunction, to a server disappearance, or to a server failure. A NetSolve process (i.e., a client, a server, or a utility process created by a server) detects such a failure when trying to establish a TCP connection with a server. The connection might have failed or have reached a timeout before completion. In this case, this NetSolve process reports the error to the NetSolve agent, which takes the failure into account.

One of the prerequisites for NetSolve was that a server can be stopped and restarted safely. Therefore, all the error reports contain information to determine whether the server was restarted after the error occurred. Indeed, since NetSolve can be used over a wide area network, some old failure reports may very likely arrive after the server that failed has been restarted. In other words, a NetSolve server can always be stopped and restarted safely.

When the agent takes a failure into account, it marks the failed server in its data structures and does not remove it. A server will be removed only after a given time, and only if it has not been restarted.

Joint Institute for Computational Science
Mon Apr 29 13:00:40 EDT 1996