Handling White Space with XmlTextReader

Article
10/17/2014

White space can be categorized in two ways: significant and insignificant. Significant white space is any white space inside a mixed content model defined by the DTD, or white space inside the scope of the special attribute, xml:space, when the xml:space is set to "preserve". Significant white space is any white space that you need to have preserved from the original document to the final document. Insignificant white space is white space that you do not need to preserve from the reading of the document to the output document. White space can be any of the following characters:

Space (ASCII space, 0x20)
Carriage return (CR, 0x0D)
Line feed (LF, 0x0A)
Horizontal tab (0X09)

The W3C standards dictate that white space be handled differently depending on where in the document it occurs, and depending on the setting of the xml:space attribute. If the characters occur within the mixed element content or inside the scope of the xml:space="preserve", they must be preserved and passed without modification to the application. Any other white space does not need to be preserved.

The XmlTextReader only preserves white space that occurs within an xml:space="preserve" context. Because the XmlTextReader does not parse a DTD, the reader does not preserve the white space that is defined as mixed content in the DTD as the reader will not know that there has been mixed content defined. If you need to preserve the white space in a mixed content element, then the XmlValidatingReader can be used as it parses the DTD and recognizes mixed content elements.

To see what type of white space is in the current node, use the XmlReader.NodeType property. Significant white space is returned with an enumeration of SignificantWhitespace, whereas insignificant white space is returned with an enumeration of Whitespace. For more information on the NodeType property, see XmlTextReader.NodeType Property and XmlNodeType Enumeration.

The WhitespaceHandling property uses an enumeration to determine how white space is returned by the reader. For more information on retrieving or setting the property, see XmlTextReader.WhitespaceHandling Property. For more information on the enumeration values, see WhitespaceHandling Enumeration. For more information on the W3C standards, see Section 2.10 of the Extensible Markup Language (XML) 1.0 recommendation at www.w3.org/XML/Group/2000/07/REC-xml-2e-review\#sec-white-space.

Here is an example of XML that contains white space and has the xml:space attribute set to "preserve". The newline character is illustrated as a special white space character at the end of the lines in this example.

<!DOCTYPE test [

<!ELEMENT test (item | book)*> <-- element content model -->

<!ELEMENT item (item*)> <-- element content model -->

<!ATTLIST item xml:space (default | preserve) #IMPLIED>

<!ELEMENT book (#PCDATA | b | i)*> <-- mixed content model -->

<!ELEMENT b (#PCDATA)> <-- mixed content model -->

<!ELEMENT i (#PCDATA)> <-- mixed content model -->

]>•

<test>•

••••<item>•

••••••••<item xml:space="preserve">º

ºººººººººººº<item/>º

ºººººººº</item>•

••••</item>•

••••<book>º

ººººººººThisº

ººººººººisº

ººººººººa testº

ºººº</book>•

</test>•

The white space shown as (•) is insignificant white space. The white space shown as (º) is significant white space.

Note The scope of the xml:space attribute changes what would normally be considered insignificant white space to be significant white space.

Equally, the book element is defined as a mixed content model in the DTD, indicating that it can contain the b or i elements. In a mixed content model, the white space within the book element is considered to be significant white space. The XmlTextReader will not recognize the mixed content model because it does not use the information provided in the DTD. You should use XmlValidatingReader to get significant white space nodes in mixed content models.

Handling White Space with XmlTextReader

See Also

Additional resources