pxdom

pxdom 0.9
A Python DOM implementation

pxdom is a W3C DOM Level 3 Core/XML/Load/Save implementation with Python and OMG (_get/_set) bindings. All features in the November 2003 Candidate Recommendations are supported, with the following exceptions:

validation and inclusion of external entities;
the LSSerializer ‘format-pretty-print’ feature;
asynchoronous LSParsers.

Additionally, Unicode encodings are only supported on Python 1.6 and later, and Unicode character normalisation features are only available on Python 2.3 and later.

Installation

Copy pxdom.py into any folder in your Python path, for example /usr/lib/python/site-packages or C:\Python23\Lib\site-packages.

pxdom can also be included and imported as a submodule of another package. This is a good strategy if you wish to distribute a DOM-based application without having to worry about the version of Python or other XML tools installed.

The only dependencies are the standard library string, StringIO, urllib and urlparse modules.

Usage

The pxdom module implements the DOMImplementationSource interface from DOM Level 3 Core. So to parse a document from a file, use eg.:

import pxdom dom= pxdom.getDOMImplementation('') parser= dom.createLSParser(dom.MODE_SYNCHRONOUS, None) doc= parser.parseURI('file:///f|/data/doc.xml')

For more on using DOM Level 3 Load to create documents from various sources, see the DOM Level 3 Load/Save specification.

Alternatively, the pxdom module offers the convenience functions parse and parseString, which work like the Python minidom module’s functions of the same names:

doc= pxdom.parse('F:\\data\\doc.xml') doc= pxdom.parseString('<el attr="val">content</el>')

DOMConfiguration parameters

The result of the parse operation depends on the parameters set on the LSParser.config mapping. By default, according to the DOM 3 spec, all bound entity references will be replaced by the contents of the entity referred to, and all CDATA sections will be replaced with plain text nodes.

If you use the parse/parseString functions, pxdom will set the parameter ‘cdata-sections’ to True, allowing CDATA sections to stay in the document. This is to emulate the behaviour of minidom.

If you prefer to receive entity reference nodes too, set the ‘entities’ parameter to a true value. For example:

parser= dom.createLSParser(dom.MODE_SYNCHRONOUS, None) parser.config.setParameter('entities', 1) doc= parser.parseURI('file:///home/data/doc.xml')

Or, using the parse/parseString shortcut functions, you can pass in an optional dictionary of extra DOMConfiguration parameters to set, like:

doc= pxdom.parse('file:///home/data/doc.xml', {'entities': 1})

(Of course, this usage would no longer be minidom-compatible.)

Extensions

pxdom supports a few features which aren’t available in the DOM standard. Their names are always prefixed with ‘pxdom’.

Node.pxdomContent

A convenience property to get the markup for a node, or replace the node with alternative parsed markup, without having to create a separate LSSerializer or LSParser.

All nodes have a readable pxdomContent, but only those at content level are writable (ie. attribute nodes are not). The document’s domConfig is used to set parameters for parse and serialise operations invoked by pxdomContent.

pxdomContent is a replacement for the ElementLS.markupContent property that was in earlier Working Drafts of the DOM 3 LS spec.

pxdom-resolve-resources

pxdom is a non-validating, non-external-entity-including DOM implementation. However, it is possible that future versions may support external entities. If this is implemented, it will be turned on by default in new LSParser objects.

If you wish to be sure external entities will never be used in future versions of pxdom, set the LSParser.config parameter ‘pxdom-resolve-resources’ to a false value. Alternatively, use the parse/parseString functions, which will never resolve external entities (as minidom does not).

pxdom-assume-element-content

In order to support the feature Text.isElementContentWhitespace, pxdom must know the content model of the particular element that contains the text node. Often this is only defined in the DTD external subset, which pxdom doesn’t read.

Normally pxdom will (as per spec) guess that elements with unknown content models do not contain ‘element content’ — so Text.isElementContentWhitespace will always return False for elements not defined in the internal subset. However, if the DOMConfiguration parameter ‘pxdom-assume-element-content’ is set to a true value, it will guess that unknown elements do contain element content, and so whitespace nodes inside them will be ‘element content whitespace’ (aka ‘ignorable whitespace’).

This parameter can be combined with the ‘element-content-whitespace’ parameter to parse an XML file and return a DOM tree containing no superfluous whitespace nodes whatsoever, which can make subsequent processing much simpler:

parser= dom.createLSParser(dom.MODE_SYNCHRONOUS, None) parser.config.setParameter('element-content-whitespace', 0) parser.config.setParameter('pxdom-assume-element-content', 1) doc= parser.parse('file:///data/foo.xml')

DocumentType.pxdomElements, pxdomAttlists

In addition to the DocumentType NamedNodeMaps ‘entities’ and ‘notations’, pxdom includes maps for the other two types of declaration that might occur in the DTD internal subset. They can be read to get more information on the content models than the schemaTypeInfo interface makes available.

pxdomElements is a NamedNodeMap of element content declaration nodes (as created by the <!ELEMENT> declaration). ElementDeclaration nodes have an integer contentType property with enum keys EMPTY_CONTENT, ANY_CONTENT, MIXED_CONTENT and ELEMENT_CONTENT. In the case of mixed and element content, the elements property gives more information on the child elements allowed.

pxdomAttlists is a NamedNodeMap of elements’ declared attribute lists (as created by the <!ATTLIST> declaration). AttributeListDeclarations hold a NamedNodeMap in their declarations property of attribute names to AttributeDeclaration nodes.

AttributeDeclaration nodes have an integer attributeType property with enum keys ID_ATTR, IDREF_ATTR, IDREFS_ATTR, ENTITY_ATTR, ENTITIES_ATTR, NMTOKEN_ATTR, NMTOKENS_ATTR, NOTATION_ATTR, CDATA_ATTR and ENUMERATION_ATTR. In the case of ENUMERATIONs and NOTATIONs, the typeValues property holds a list of possible string values. There is also an integer defaultType property with enum keys REQUIRED_VALUE, IMPLIED_VALUE, DEFAULT_VALUE and FIXED_VALUE. In the case of FIXED and DEFAULT, the childNodes property holds any Text and/or EntityReference nodes that make up the default value.

Changelog

Updates from 0.8 to 0.9

Lots of interface alterations and renamings to track changes in the new DOM 3 Candidate Recommendations.
Node.pxdomContent replaces ElementLS.markupContent (removed from CR). Other old DocumentLS, ElementLS interfaces removed.
Module code rearranged into separate aspects to cut down on some of the ‘monster-class’ readability problems.
Serialisation mostly rewritten to conform better to specification, particularly the escaping of characters that can't be reproduced in the current encoding.
Normalisation partially rewritten, support for Unicode character normalisation added.
Supports DOMConfiguration parameter ‘canonical-form’.
Parameter pxdom-resolve-resources added as placeholder for future external entity support .
Made PIs with no data part parse and serialise correctly.
Many changes to LSFilters, which were a bit broken.
Allow multiple attributes with the same namespaceURI and localName (but different prefix) to be parsed. (For support of non-namespace-well-formed docs that use attribute names with colons, and unbound namespaces in entities.)
Renamed DocumentType.elements and .attlists to pxdom-prefixed versions, as they are non-standard extensions.
Fixed parsing of <!ATTLIST>s with NMTOKENS, IDREF, IDREFS (whoops!).
Made attribute value normalization happen in more places it should and fixed entref/charref whitespace-char replacement issues.
Fixed normalizeDocument namespace-declarations=false option.
Supports ‘well-formed’ parameter, tightened up invalid character checks at DOM level too.
Made splitTexting a CDATASection correctly create a new CDATASection node, not text.

Updates from 0.7 to 0.8

Tracking forthcoming changes to spec, getDOMImplementations renamed getDOMImplementationList, isWhitespaceInElementContext method becomes isElementContentWhitespace property, isId method becomes property, DOMLocator.offset becomes byteOffset/utf16Offset (non-functional).
Don’t claim to support DOM Core 1.0 — following discussion on www-dom-ts, there is no such feature.
Allow getDOMImplementation[List] to be called with no argument, as a shortcut.
Allow empty string to be passed in to namespaceURI arguments, meaning same as None.
Added NODE_ADOPTED UserDataHandler event, compliance fixes to AdoptNode (ents, default attrs).
Added DOMConfig.parameterNameList.
Added minidom-style NamedNodeMap dictionary accessors for compatibility (hat tip: Paul Boddie).
Implemented element-content-whitespace option, added pxdom-assume-element-content to make it more useful.
Refuse to parse invalid < in attribute values (makes finding well-formedness errors easier).

Updates from 0.6 to 0.7

Tracking forthcoming changes to spec, DOMSerialiser.writeURI renamed to writeToURI.
Fix typos in Document.isDefaultNamespace and Text.replaceWholeText raising exceptions (oops).
Made renameNode and writes to Node.prefix update NodeListByTagName objects correctly.
Made ParseError return non-Unicode string for easier debugging.

Licence (new-BSD)

Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:

Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.

Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.

The name of Andrew Clover may not be used to endorse or promote products derived from this software without specific prior written permission.

This software is provided by the copyright holder and contributors “as is” and any express or implied warranties, including, but not limited to, the implied warranties of merchantability and fitness for a particular purpose are disclaimed. in no event shall the copyright owner or contributors be liable for any direct, indirect, incidental, special, exemplary, or consequential damages (including, but not limited to, procurement of substitute goods or services; loss of use, data, or profits; or business interruption) however caused and on any theory of liability, whether in contract, strict liability, or tort (including negligence or otherwise) arising in any way out of the use of this software, even if advised of the possibility of such damage.