pxdom

pxdom 1.5
A Python DOM implementation

pxdom is a W3C DOM Level 3 implementation for XML 1.0/1.1 with/without namespaces, using Python and OMG-style (_get/_set) bindings. All features described in the Core and LS Recommendations are supported, with the following exceptions:

validation;
asynchronous LSParsers;
name character checking is only completely rigorous for XML 1.1.

pxdom runs on Python 1.5.2 or later, and has been tested up to 2.5.2. Certain features are dependent on Python version:

for Unicode, Python 1.6 or later is required;
using an LSSerializer to write to an HTTP URI requires Python 2.0 or later;
Unicode character normalisation options require Python 2.3 or later.

Installation

Copy pxdom.py into any folder in your Python path, for example /usr/lib/python/site-packages or C:\Python23\Lib\site-packages. Pre-compile bytecode version with ‘import pxdom’ if necessary.

pxdom can also be included and imported as a submodule of another package. This is a good strategy if you wish to distribute a DOM-based application without having to worry about the versions of Python and/or PyXML installed on users’ machines; the only dependencies are the standard library string-handling and URL-related modules.

Usage

The pxdom module implements the DOMImplementationSource interface from DOM Level 3 Core. So to parse a document from a file, use eg.:

dom= pxdom.getDOMImplementation('') parser= dom.createLSParser(dom.MODE_SYNCHRONOUS, None) document= parser.parseURI('file:///f|/data/doc.xml')

And to serialise and save a document to a file, try:

serialiser= document.implementation.createLSSerializer() serialiser.writeToURI(document, 'file:///f|/data/doc.xml')

These interfaces take URIs; you can convert a local filepath to a URI using the standard library urllib module:

uri= 'file:'+urllib.pathname2url(path)

Many features of parsing and serialisation can be set using the domConfig objects in LSParser and LSSerializer, as well as creating LSInput and LSOutput objects for more control over the source and destination of these operations. For example to serialise a document explicitly to the Latin-1 encoding:

output= document.implementation.createLSOutput() output.systemId= 'file:///f|/data/doc.xml' output.encoding= 'utf-8' serialiser= document.implementation.createLSSerializer() serialiser.write(document, output)

For full details on using these standard features, see the DOM Level3 LS specification.

Shortcuts

As a slightly less verbose alternative to the W3C standard parser interface, the pxdom module offers the convenience functions parse and parseString, which work like the Python minidom module’s functions of the same names:

doc= pxdom.parse(r'F:\data\doc.xml') doc= pxdom.parseString('<el attr="val">content</el>')

You can also get a quick character-serialization by accessing the pxdomContent property of any node.

DOMConfiguration parameters

The result of the parse operation depends on the parameters set on the LSParser.domConfig mapping. By default, in accordance with the DOM specification, all CDATA sections will be replaced with plain text nodes and all bound entity references will be replaced by the contents of the entity referred to. This includes external entity references and the external subset.

If you use the parse and parseString functions, pxdom will default the parameter ‘cdata-sections’ to True, allowing CDATA sections to stay in the document, and the parameter ‘pxdom-resolve-resources’ to False so external entities and the external subset are left alone. This is to emulate the behaviour of the Python standard library’s minidom module.

If you prefer also to receive EntityReference nodes in your document, set the ‘entities’ parameter to a true value. For example:

parser= dom.createLSParser(dom.MODE_SYNCHRONOUS, None) parser.domConfig.setParameter('cdata-sections', 1) parser.domConfig.setParameter('entities', 1) doc= parser.parseURI('file:///home/data/doc.xml')

Or, using the parse/parseString shortcut functions, you can pass in an optional dictionary of extra DOMConfiguration parameters to set, like:

doc= pxdom.parse('file:///home/data/doc.xml', {'entities': 1})

(Of course, this usage would no longer be minidom-compatible.) See the DOM 3 Core and LS specifications for more DOMConfiguration parameters.

Extensions

pxdom supports some supplemental non-standard features. Their names are always prefixed with ‘pxdom’ to avoid confusion with the standard.

Extra DOMConfiguration parameters

Configuration parameters in DOM Level 3 may affect parsing, serialisation and normalisation operations. pxdom adds a few new parameters not defined in the specification.

If you want to set a pxdom extra parameter to a non-default value but still be compatible with any other DOM Level 3 implementation, you can use the DOMConfiguration.canSetParameter method to ensure that the parameter is supported first.

pxdom-resolve-resources

Applies to: parsing. Default: True (except with parse/parseString functions).

Dictates whether resources external to the document file will be resolved and used. This affects external entities and the DTD external subset.

pxdom uses only the SYSTEM identifier in fetching an external resource, so parsing an XHTML document, for example, would make many requests to the W3C server to grab the document type information. This is quite slow. Note also that at the time of writing the DTD referenced by XHTML 1.1 documents has acknowledged bugs in it, which pxdom is unable to parse. (This has been corrected for the forthcoming XHTML Modularization Second Edition specification.)

To do something with PUBLIC identifiers, such as supply local copies of DTDs, you would have to provide a standard DOM LSResourceResolver object to the configuration parameter ‘resource-resolver’. Resource resolvers will never be called if ‘pxdom-resolve-resources’ is set to false.

When the convenience functions parse and parseString are called, ‘pxdom-resolve-resources’ will be false by default, instead of true, for minidom compatibility. This is also the safest option for parsing simple standalone XML.

pxdom-normalize-text

Applies to: normalisation. Default: True.

Dictates whether text node normalisation (as performed by the DOM Level 1 Core Node.normalize method) will take place when the DOM Level 3 Core Document.normalizeDocument method is called.

By default, matching the DOM specification, text node normalisation does occur, but pxdom allows this to be turned off if unwanted.

pxdom-update-entities

Applies to: normalisation. Default: True.

Dictates whether entity reference nodes have their content child nodes updated from the declaration stored in the doctype. This may result in descendants with different namespaces when the entity reference has been moved, if the entity contains prefixes whose namespaces are not declared in the entity.

By default, matching the DOM specification, entities are updated, but pxdom allows this to be turned off if unwanted.

pxdom-reset-identity

Applies to: normalisation. Default: True.

Dictates whether attributes should have their user-specified-IDness (as set by the setAttributeId etc. methods) reset to false during document normalisation.

By default, matching the DOM specification, this does occur, but pxdom allows this to be turned off if unwanted.

pxdom-preserve-base-uri

Applies to: parsing, normalisation, serialisation. Default: True.

When enabled, pxdom attempts to preserve the base URI context whenever a node that changes base URI is replaced by its contents. This can happen when an element with an xml:base attribute is SKIPped by a DOM 3 LS filter, or when an entity reference with a different base URI to its parent is flattened.

By default, matching the DOM specification, base URIs are preserved. However, the extra xml:base attributes added to child elements may be unwanted if you are working with entities (especially external entities) but do not wish to use XML Base, so pxdom allows it to be turned off. If you do so, the DOMError warning ‘pi-base-uri-lost’ will also not be generated.

pxdom-assume-element-content

Applies to: parsing, normalisation, serialisation, isElementContentWhitespace. Default: False.

In order to support the feature Text.isElementContentWhitespace, pxdom must know the content model of the particular element that contains the text node. Often this is only defined in the DTD external subset, which might have been omitted or not read.

Normally, following the XML Information Set specification, pxdom will guess that elements with unknown content models do not contain ‘element content’ — so Text.isElementContentWhitespace will always return False for elements not mentioned in the DOCTYPE internal subset.

However, if the DOMConfiguration parameter ‘pxdom-assume-element-content’ is True, it will guess that unknown elements do contain element content, and so whitespace nodes inside them will be ‘element content whitespace’ (often referred to as ‘ignorable whitespace’).

This parameter can be combined with the ‘element-content-whitespace’ parameter to parse an XML file and return a DOM tree containing no superfluous whitespace nodes whatsoever, which can make subsequent processing much simpler:

parser= dom.createLSParser(dom.MODE_SYNCHRONOUS, None) parser.domConfig.setParameter('element-content-whitespace', 0) parser.domConfig.setParameter('pxdom-assume-element-content', 1) doc= parser.parse('file:///data/foo.xml')

pxdom-html-compatible

Applies to: serialisation. Default: False.

Optionally ensures serialisation operations return markup that is as far as possible compatible with legacy HTML parsers. In particular, satisfies XHTML 1.0’s HTML compatibility guidelines C.2, C.3 and C.10.

Extra object properties

Node.pxdomLocation

Read-only property giving a DOM Level 3 DOMLocator object for any Node. If the Node was created by a parsing operation this will reveal the file and row/column number in which the node was found: particularly useful for error-reporting purposes.

Node.pxdomContent

A convenience property to get the markup for a node, or replace the node with alternative parsed markup, without having to create a separate LSSerializer or LSParser.

All nodes have a readable pxdomContent, but only those at content level are writable (attribute nodes, for instance, are not). The document’s domConfig is used to give parameters for parse and serialise operations invoked by pxdomContent.

The value read from pxdomContent is a character string, not a byte string, so it is not suitable for writing directly to a file. Use an LSSerializer to serialise a document to a byte stream.

pxdomContent is an extended replacement for the ElementLS.markupContent property that was in earlier Working Drafts of the DOM 3 LS spec.

Entity.pxdomAvailable

A flag indicating whether the entity’s replacement content is available in the childNodes property. Internal entities are always available; unparsed external entities never are; for parsed external entities it depends on whether external resources were resolved at parse-time.

Entity.pxdomDocumentURI

On external entities, gives the actual URI the entity was read from, after applying the systemId to the baseURI and going through any LSResourceResolver redirection. For internal and unavailable entities this property is null.

DocumentType.pxdomElements/Attlists

In addition to entities and notations, pxdom includes NamedNodeMaps in the DocumentType for the other two types of declaration that might occur in the DTD. They can be read to get more information on content models than the DOM Level 3 TypeInfo interface makes available.

Extra pxdom node types

ElementDeclaration

ElementDeclarations can be obtained from the DocumentType.pxdomElements map. Its nodeName is the element name given in the corresponding DTD <!ELEMENT> declaration).

ElementDeclaration nodes have an integer contentType property with enum keys EMPTY_CONTENT, ANY_CONTENT, MIXED_CONTENT and ELEMENT_CONTENT. In the case of mixed and element content, the elements property gives more information on the child elements allowed.

AttributeDeclarationList

AttributeDeclarationLists can be obtained from the DocumentType.pxdomAttlists map. Its nodeName is the name of the element whose attributes it is defining, as given in the <!ATTLIST> declaration).

AttributeListDeclarations hold a NamedNodeMap in their declarations property, mapping attribute names from the declaration to corresponding AttributeDeclaration nodes.

AttributeDeclaration

AttributeDeclaration nodes have an integer attributeType property with enum keys ID_ATTR, IDREF_ATTR, IDREFS_ATTR, ENTITY_ATTR, ENTITIES_ATTR, NMTOKEN_ATTR, NMTOKENS_ATTR, NOTATION_ATTR, CDATA_ATTR and ENUMERATION_ATTR.

In the case of enumeration and notation attribute types, the typeValues property holds a list of possible string values. There is also an integer defaultType property with enum keys REQUIRED_VALUE, IMPLIED_VALUE, DEFAULT_VALUE and FIXED_VALUE. In the case of fixed and defaulting attributes, the childNodes property holds any text and/or entity reference nodes that make up the default value.

Changelog

Updates from 1.4 to 1.5

Allow a DOCTYPE declaration to be parsed for a non-namespace-well-formed root element name, when the 'namespaces' DOMConfiguration parameter is turned off in the parser.
Defer DOM Level 3 UserDataHandler callbacks until the end of a deep clone/import/adopt operation, to ensure the related nodes are in the expected final state. (Hat tip: Anjan Samanta.)
Element content parser made non-recursive. In theory this allows a document with elements nested a thousand levels deep to be parsed without causing a Python RecursionError. However, trees of this depth may still cause RecursionErrors when dealt with using other recursive algorithms (such as normalisation and serialisation). The main rationale behind making it work for parsing is to ensure a more useful error is generated when trying to parse a long, non-well-formed document that habitually leaves its elements open.

Updates from 1.3 to 1.4

Restored Python 1.5 compatibility by removing string method usage.

Updates from 1.2 to 1.3

Added DOMConfiguration parameter pxdom-html-compatible.
Made the (implementation-defined in spec) Document.cloneNode() do the most likely-useful action, namely creating a new Document (and copying child content with new ownerDocument if it is a deep clone).
Fixed bug that disallowed resetting of NamedNodeNS prefixes to None.
Added specific checks in child-altering methods (appendChild et al) so that they raise an error when a disallowed null is passed instead of letting the operation silently do nothing. Changed order of checks in creating NS-aware nodes so that a more helpful error results from illegal characters.
Various alterations to parser and serialiser handling of narrow strings. Try where possible to coerce string to unicode, resulting in more consistent results with unusual character encodings.
Parser: fixed bug disallowing DOCTYPE declaration with no publicId, systemId, internal subset or whitespace. Fixed possible denormalised text parsing in entities. Fixed possible parameter entity edge cases (spec is woolly here). Use sets where available (Python 2.3+) for marginal performance improvement.
Serialiser: ensured newLine property was consistently used throughout the document. Encoded NEL and Unicode Line Separator characters as character references.

Updates from 1.1 to 1.2

Redid entity reference parse/normalise/serialise operations, hopefully resulting in more consistent results in the face of combinations of text node normalisation, NodeFilter SKIPping and baseURI loss.
Fixed typo in OutputBuffer causing NameError to be raised instead of UnsupportedEncodingErr with DOMError handling, if an unknown encoding is used (thanks to: Andrew Johnson)
Added pxdom-preserve-base-uri parameter to control the baseURI preservation that now also works with element skipping and normalisation. xml:base attributes are now added as non-specified Attrs, similar to default attributes.
Redid node name character handling to cope more gracefully with narrow strings and apply the XML 1.1 restrictions on what Unicode characters can be in a node name. (The additional restrictions of XML 1.0 are not enforced, largely because its character model is an insanity). Updated parsing and serialisation of character references to cope with characters outside the Basic Multilingual Plane in ‘narrow’ (UTF-16) Python builds.
Optimised away an order-N-squared method in LSParser, resulting in faster — though still slow, obv. — parse times for long documents (eg. 10x speedup for the commonly-used test file ot.xml). (Hat tip: Frederik Lundh.)
Changed isID property to conform to new interpretation of spec: reading Attr.isID returns True if the attribute has schema-determined-IDness or user-determined-IDness, but setting it (through the setAttributeId etc. methods) only affects the user-determined-IDness; xml:id is interpreted as part of schema-determined-IDness. Added new configuration parameter pxdom-reset-identity to allow the removal of user-determined IDness on normalisation to be disabled.
Fixed namespace undeclarations (xmlns:something="") causing namespaceURIs to become empty strings instead of unbound/null at parse-time. Fixed namespace fixup to stop extra redundant declarations being added.
Fixed stupid DTD-parsing bugs that crept into 1.1 before release without tripping the Test Suite. When pxdom-resolve-resources was False, external general entities could cause errors, and INCLUDE sections were broken.
Added trivial repr functions so Nodes are easier to read.

Updates from 1.0 to 1.1

Entity parsing rewritten to include external entity/subset support, full checking for parameter entities in the external subset, catching of circular references and fixing treatment of character references in replacement text. Also ensured serialisation and normalisation with ‘entities’ only use replacement content for entities that are bound and available. Added ‘pxdom-update-entities’ parameter to disable normalizeDocument entity behaviour.
Namespace lookup and handling at parse, serialise and normalisation phases redone to ensure namespaces are correct even when content is filtered out (and also improve parse speed marginally). Following Recommendation, the public lookup methods made ignorant about built-in namespaces.
Made parser use the ‘namespaces’ parameter to decide whether to create Level 1 or Level 2 nodes. Added exceptions and DOMErrors for parsing namespace-ill-formed names or normalising/serialising Level 1 nodes at Level 2. (The previous parsing fall-back-to-Level-1 behaviour is now available by using a DOMErrorHandler that asks to continue parsing.)
Integrated DOMException, LSException and DOMError into the same class to avoid having to use an extra level of exception wrapping. Made normalizeDocument not throw exceptions connected to fatal DOMErrors (unlike LSParser and LSSerializer). Disallowed XML 1.1 output without prolog.
Following change in final Recommendation, made setting ’entities’ to false no longer remove the entities map in the doctype, and renamed DOMError ‘cdata-section-splitted’ to ‘cdata-sections-splitted’. Made CDATA splitting use Tim Bray’s suggested method from the Annotated XML specification, instead of creating an unnecessary Text node.
Fixed possible spurious exceptions in Node.getFeature and normalizeDocument with check-character-normalization failure, namespace declaration removal and entity removal. Made Node.normalize respect ‘normalize-characters’, but not ‘check-character-normalization’, following Recommendation. Add ‘pxdom-normalize-text’ parameter to disable Level 1-style text node normalisation in normalizeDocument.
Added typeInfo.isDerivedFrom method and fixed DERIVED constants following Recommendation (though this is not relevant to a non-schema-validating implementation like pxdom).
Serialisation now doesn’t escape tabs in non-Attr text content, and doesn’t allow Attr children to be filtered.
Catch case of trying to renameNode to empty string, and trying to use renameNode to make a Level 1 node with non-null namespaceURI. Return the correct type of error on attempt to rename a document.
Fixed insertBefore/replaceChild used with the same node for both parameters following discussion on www-dom list, to do the sensible-but-not-required thing instead of throwing an embarrassingly inaccurate exception.
Made reading a file with invalid byte sequences fall back to replacement instead of raising UnicodeDecodeError. This should make reading non-Unicode files easier.
Redid baseURI to return null for the things the infoset mapping says should be null, and return the right answer for the new external entities. Made parsing without entities add xml:base attributes to elements whose Entity baseURIs are different from the parent, supporting pi-base-not-preserved DOMError too.
Following resolution on www-dom list, no longer use Text nodes to represent white space at the Document level. Instead act as if there is always one newline character between each Document child node.
Made writing to pxdomContent replace the node, rather than its children, as it should have been all along.
Fixed bug where comments are parsed as text when disabled from the configuration. (Yikes! How did the Test Suite miss that one?)
Added support for xml:id attributes. XML ID is currently only a Working Draft but it seems so obviously useful it should make it through standardisation.

Updates from 0.9 to 1.0

Tracking changes in the new DOM 3 Proposed Recommendations, renamed LS config properties, added LSException, changed default newLine behaviour, removed pxdom prefix from previously-non-standard pxdom-no-input-specified error, allow LS namespace parameters to be set False, changed output filter call order
Added support for DOMConfiguration parameters ‘format-pretty-print’ and ‘supported-media-types-only’
Following discussion on www-dom list, changed encoding-to-string to use the string’s native encoding, unless overridden by output.encoding
Added extra error checks for cases in the L3 DOM Test Suite
Fixed recursive readonlyness of entities, notations, entity references
Fixed setting textContent on non-Text-containing nodes
Fixed very silly canSetParameter bug causing occasional erroneous return-false
Added compareDocumentPosition to public interface, and fixed fault in comparison of non-child nodes
Renamed parameterNameList parameterNames and made it return a proper DOM-style List object instead of a Python one
Made namespace/prefix lookup results match the reference algorithm more closely
Reorganised parse/serialisation, allowing application-side LSInput and LSOutput objects to be used
Made isEntityContentWhitespace cope with nodes inside entity references
Fixed possibly-incorrect namespaceURI of unprefixed default attributes
Fixed baseURI for entity references and doctype

Updates from 0.8 to 0.9

Lots of interface alterations and renamings to track changes in the new DOM 3 Candidate Recommendations.
Node.pxdomContent replaces ElementLS.markupContent (removed from CR). Other old DocumentLS, ElementLS interfaces removed.
Module code rearranged into separate aspects to cut down on some of the ‘monster-class’ readability problems.
Serialisation mostly rewritten to conform better to specification, particularly the escaping of characters that can't be reproduced in the current encoding.
Normalisation partially rewritten, support for Unicode character normalisation added.
Support for DOMConfiguration parameter ‘canonical-form’.
Parameter pxdom-resolve-resources added as placeholder for future external entity support .
Made PIs with no data part parse and serialise correctly.
Many changes to LSFilters, which were a bit broken.
Allow multiple attributes with the same namespaceURI and localName (but different prefix) to be parsed. (For support of non-namespace-well-formed docs that use attribute names with colons, and unbound namespaces in entities.)
Renamed DocumentType.elements and .attlists to pxdom-prefixed versions, as they are non-standard extensions.
Fixed parsing of <!ATTLIST>s with NMTOKENS, IDREF, IDREFS (whoops!).
Made attribute value normalization happen in more places it should and fixed entref/charref whitespace-char replacement issues.
Fixed normalizeDocument namespace-declarations=false option.
Support for ‘well-formed’ parameter, tightened up invalid character checks at DOM level too.
Made splitTexting a CDATASection correctly create a new CDATASection node, not text.

Updates from 0.7 to 0.8

Tracking forthcoming changes to spec, getDOMImplementations renamed getDOMImplementationList, isWhitespaceInElementContext method becomes isElementContentWhitespace property, isId method becomes property, DOMLocator.offset becomes byteOffset/utf16Offset (non-functional).
Don’t claim to support DOM Core 1.0 — following discussion on www-dom-ts, there is no such feature.
Allow getDOMImplementation[List] to be called with no argument, as a shortcut.
Allow empty string to be passed in to namespaceURI arguments, meaning same as None.
Added NODE_ADOPTED UserDataHandler event, compliance fixes to AdoptNode (ents, default attrs).
Added DOMConfig.parameterNameList.
Added minidom-style NamedNodeMap dictionary accessors for compatibility (thanks to: Paul Boddie).
Implemented element-content-whitespace option, added pxdom-assume-element-content to make it more useful.
Refuse to parse invalid < in attribute values (makes finding well-formedness errors easier).

Updates from 0.6 to 0.7

Tracking forthcoming changes to spec, DOMSerialiser.writeURI renamed to writeToURI.
Fix typos in Document.isDefaultNamespace and Text.replaceWholeText raising exceptions (oops).
Made renameNode and writes to Node.prefix update NodeListByTagName objects correctly.
Made ParseError return non-Unicode string for easier debugging.

Future work

Consider supporting DOM Level 3 Events and/or Level 2 Traversal/Range (any interest for these?);
Consider non-monolithic package distribution option: pxdom being a single module is convenient for distribution but it is getting big.

Additional thanks to all responsible for the DOM Test Suite (which has caught many gotchas in previous pxdom versions, regardless of the bugs I keep filing against it), particularly Curt Arnold (for fixing many of them).

Licence (new-BSD-style)

Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:

Redistributions must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.
The name of the copyright holder may not be used to endorse or promote products derived from this software without specific prior written permission.

This software is provided by the copyright holder and contributors “as is” and any express or implied warranties, including, but not limited to, the implied warranties of merchantability and fitness for a particular purpose are disclaimed. In no event shall the copyright owner or contributors be liable for any direct, indirect, incidental, special, exemplary, or consequential damages (including, but not limited to, procurement of substitute goods or services; loss of use, data, or profits; or business interruption) however caused and on any theory of liability, whether in contract, strict liability, or tort (including negligence or otherwise) arising in any way out of the use of this software, even if advised of the possibility of such damage.