2010
07.05

Web archives arguably face the largest challenge when it comes to the handling of file formats( such as qif). Web material is usually harvested from any number of external sites that may each have their own ideas and policies about which formats to use. The web archive will often have no influence on these policies and must therefore accept material in any and every form. While every harvested digital object is certainly a bit stream that can be stored and preserved as such, the bit stream is of little value to anyone if the producer’s intended semantics – as defined by its format – can not be determined and applied.

This challenge expresses itself concretely when we attempt to access an object contained in a web archive. We may want the object to be rendered on screen, to be manifested through other media (e.g. loudspeakers), or we may just want certain features extracted from it, but in any case we need our system to help us figure out which applications are appropriate for processing the bits that make up the object. The choice of application will often be determined by the local setup of the computer that is used for accessing the object (the access machine). On the basis of a file extension or a mime-type( e.g.  dat ), the local operating system or browser will pick a preferred application and use that for interpreting the given bit stream.

The webarchive file format is available on Mac OS X  and Windows for saving and reviewing complete web pages using the Safari browser. Support for webarchive documents was added in Safari 4 Beta on Windows; Safari 3 on Windows does not support the format. The webarchive format is a concatenation of source files with filenames saved in the binary plist format using NSKeyedEncoder. Although it is a platform-independent format, many people prefer to use Safari’s Print to PDF feature instead to store webpages. Indeed, the .webarchive format appears to more be a convenience for Mac developers. The API uses webarchives to simplify using cutting-and-pasting with whole or partial web pages.