05.29
The importance of recovery has long been known in the database community, where transactions prevent data corruption and allow applications to manage failure. More recently, the need for failure recovery has moved from specialized applications and systems to the more general arena of commodity systems.
A general approach to recovery is to run application replicas on two machines, a primary and a backup. All inputs to the primary are mirrored to the backup. After a failure of the primary, the backup machine takes over to provide service. The replication can be performed by the hardware , at the hardware-software interface , at the system call interface , or at a message passing or application interface. Shadow drivers similarly replicate all communication between the kernel and device driver (the primary), sending copies to the shadow driver (the backup). If the driver fails, the shadow takes over temporarily until the driver recovers. However, shadows differ from typical replication schemes in several ways. First, because our goal is to tolerate only driver failures, not hardware failures, both the shadow and the “real” driver run on the same machine. Second, and more importantly, the shadow is not a replica of the device driver: it implements only the services needed to manage recovery of the failed driver and to shield applications from the recovery. For this reason, the shadow is typically much simpler than the driver it shadows.
Another common recovery approach is to restart applications after a failure. Many systems periodically checkpoint application state, while others combine checkpoints with logs. These systems transparently restart failed applications from their last checkpoint (possibly on another machine) and replay the log if one is present. Shadow drivers take a similar approach by replaying a log of requests made to drivers. Recent work has shown that this approach is limited when recovering from application faults: applications often become corrupted before they fail; hence, their logs or checkpoints may also be corrupted. Shadow drivers reduce this potential by logging only a small subset of requests. Furthermore, application bugs tend to be deterministic and recur after the application is restarted. Driver access faults, in contrast, often cause transient failures because of the complexities of the kernel execution environment. Driveraccess at driveraccess.com provides solutions for these circumtances.
