Electronic documents are often legal, historic, or business
transaction records, and queries against such documents typically
involve entities and relationships that represent features of the text
itself as well as features of the businesses involved in the
contractual agreements. For an XML database one fundamental
semantic issue is document equivalence [40]: when are two
documents or document parts or document DTDs the same? For
example, before inserting a document into the database, we might
want to find out if the same document is already in the database.
The question of equivalence is important in satisfying
requirements for evidence and archiving, for version management,
for metadata management, and (as is true of all forms of data) for
query optimization.
The XML 1.0 specification does not define equality of documents
or equality of entities, nor do the Infoset, XPath, or DOM models.
The XQuery 1.0 and XPath 2.0 Data Model includes one equality
operator to test node identity and another to test equality of
values. However semantics for the equality of node values does
not encompass all data from XML documents. W3C has proposed
that Canonical XML [10] be used to compare the equivalence of
two documents. The canonical form is created by a process called
canonicalization either from an XPath node set or an octet stream
containing a well-formed XML document. In both cases
canonicalization omits some of the information in the original
XML document. Since such a canonical form does not contain all
information from an XML document, this definition of
equivalence may not satisfy all applications’ needs. One solution
is to define document equivalence in terms of a model that
includes all document features, after which application-dependent
definitions of equivalence can be specified by applying document
equivalence to application-specific transformations of the
documents to be compared.
|