pxdom

pxdom 0.9
A Python DOM implementation

pxdom is a W3C DOM Level 3 Core/XML/Load/Save implementation with Python and OMG (_get/_set) bindings. All features in the November 2003 Candidate Recommendations are supported, with the following exceptions:

Additionally, Unicode encodings are only supported on Python 1.6 and later, and Unicode character normalisation features are only available on Python 2.3 and later.

Installation

Copy pxdom.py into any folder in your Python path, for example /usr/lib/python/site-packages or C:\Python23\Lib\site-packages.

pxdom can also be included and imported as a submodule of another package. This is a good strategy if you wish to distribute a DOM-based application without having to worry about the version of Python or other XML tools installed.

The only dependencies are the standard library string, StringIO, urllib and urlparse modules.

Usage

The pxdom module implements the DOMImplementationSource interface from DOM Level 3 Core. So to parse a document from a file, use eg.:

import pxdom
dom= pxdom.getDOMImplementation('')
parser= dom.createLSParser(dom.MODE_SYNCHRONOUS, None)
doc= parser.parseURI('file:///f|/data/doc.xml')

For more on using DOM Level 3 Load to create documents from various sources, see the DOM Level 3 Load/Save specification.

Alternatively, the pxdom module offers the convenience functions parse and parseString, which work like the Python minidom module’s functions of the same names:

doc= pxdom.parse('F:\\data\\doc.xml')
doc= pxdom.parseString('<el attr="val">content</el>')

DOMConfiguration parameters

The result of the parse operation depends on the parameters set on the LSParser.config mapping. By default, according to the DOM 3 spec, all bound entity references will be replaced by the contents of the entity referred to, and all CDATA sections will be replaced with plain text nodes.

If you use the parse/parseString functions, pxdom will set the parameter ‘cdata-sections’ to True, allowing CDATA sections to stay in the document. This is to emulate the behaviour of minidom.

If you prefer to receive entity reference nodes too, set the ‘entities’ parameter to a true value. For example:

parser= dom.createLSParser(dom.MODE_SYNCHRONOUS, None)
parser.config.setParameter('entities', 1)
doc= parser.parseURI('file:///home/data/doc.xml')

Or, using the parse/parseString shortcut functions, you can pass in an optional dictionary of extra DOMConfiguration parameters to set, like:

doc= pxdom.parse('file:///home/data/doc.xml', {'entities': 1})

(Of course, this usage would no longer be minidom-compatible.)

Extensions

pxdom supports a few features which aren’t available in the DOM standard. Their names are always prefixed with ‘pxdom’.

Node.pxdomContent

A convenience property to get the markup for a node, or replace the node with alternative parsed markup, without having to create a separate LSSerializer or LSParser.

All nodes have a readable pxdomContent, but only those at content level are writable (ie. attribute nodes are not). The document’s domConfig is used to set parameters for parse and serialise operations invoked by pxdomContent.

pxdomContent is a replacement for the ElementLS.markupContent property that was in earlier Working Drafts of the DOM 3 LS spec.

pxdom-resolve-resources

pxdom is a non-validating, non-external-entity-including DOM implementation. However, it is possible that future versions may support external entities. If this is implemented, it will be turned on by default in new LSParser objects.

If you wish to be sure external entities will never be used in future versions of pxdom, set the LSParser.config parameter ‘pxdom-resolve-resources’ to a false value. Alternatively, use the parse/parseString functions, which will never resolve external entities (as minidom does not).

pxdom-assume-element-content

In order to support the feature Text.isElementContentWhitespace, pxdom must know the content model of the particular element that contains the text node. Often this is only defined in the DTD external subset, which pxdom doesn’t read.

Normally pxdom will (as per spec) guess that elements with unknown content models do not contain ‘element content’ — so Text.isElementContentWhitespace will always return False for elements not defined in the internal subset. However, if the DOMConfiguration parameter ‘pxdom-assume-element-content’ is set to a true value, it will guess that unknown elements do contain element content, and so whitespace nodes inside them will be ‘element content whitespace’ (aka ‘ignorable whitespace’).

This parameter can be combined with the ‘element-content-whitespace’ parameter to parse an XML file and return a DOM tree containing no superfluous whitespace nodes whatsoever, which can make subsequent processing much simpler:

parser= dom.createLSParser(dom.MODE_SYNCHRONOUS, None)
parser.config.setParameter('element-content-whitespace', 0)
parser.config.setParameter('pxdom-assume-element-content', 1)
doc= parser.parse('file:///data/foo.xml')

DocumentType.pxdomElements, pxdomAttlists

In addition to the DocumentType NamedNodeMaps ‘entities’ and ‘notations’, pxdom includes maps for the other two types of declaration that might occur in the DTD internal subset. They can be read to get more information on the content models than the schemaTypeInfo interface makes available.

pxdomElements is a NamedNodeMap of element content declaration nodes (as created by the <!ELEMENT> declaration). ElementDeclaration nodes have an integer contentType property with enum keys EMPTY_CONTENT, ANY_CONTENT, MIXED_CONTENT and ELEMENT_CONTENT. In the case of mixed and element content, the elements property gives more information on the child elements allowed.

pxdomAttlists is a NamedNodeMap of elements’ declared attribute lists (as created by the <!ATTLIST> declaration). AttributeListDeclarations hold a NamedNodeMap in their declarations property of attribute names to AttributeDeclaration nodes.

AttributeDeclaration nodes have an integer attributeType property with enum keys ID_ATTR, IDREF_ATTR, IDREFS_ATTR, ENTITY_ATTR, ENTITIES_ATTR, NMTOKEN_ATTR, NMTOKENS_ATTR, NOTATION_ATTR, CDATA_ATTR and ENUMERATION_ATTR. In the case of ENUMERATIONs and NOTATIONs, the typeValues property holds a list of possible string values. There is also an integer defaultType property with enum keys REQUIRED_VALUE, IMPLIED_VALUE, DEFAULT_VALUE and FIXED_VALUE. In the case of FIXED and DEFAULT, the childNodes property holds any Text and/or EntityReference nodes that make up the default value.

Changelog

Updates from 0.8 to 0.9

Updates from 0.7 to 0.8

Updates from 0.6 to 0.7

Licence (new-BSD)

Copyright © 2003, Andrew Clover. All rights reserved.

Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:

Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.

Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.

The name of Andrew Clover may not be used to endorse or promote products derived from this software without specific prior written permission.

This software is provided by the copyright holder and contributors “as is” and any express or implied warranties, including, but not limited to, the implied warranties of merchantability and fitness for a particular purpose are disclaimed. in no event shall the copyright owner or contributors be liable for any direct, indirect, incidental, special, exemplary, or consequential damages (including, but not limited to, procurement of substitute goods or services; loss of use, data, or profits; or business interruption) however caused and on any theory of liability, whether in contract, strict liability, or tort (including negligence or otherwise) arising in any way out of the use of this software, even if advised of the possibility of such damage.