Module xml

Experimental functionality

This module is not stable and its API and behaviors are subject to change. Do not use this API if you are targeting the ReaPack installation, as future versions are likely to introduce backward incompatible changes. If you bundle your own copy of rtk with your script then this risk is removed.

Feedback is appreciated during this API preview phase.

This module implements a fairly naive XML parser, supporting a limited but useful subset of XML. Despite its limitations (see below for more details), in practice it copes with many XML documents in the wild.

It tries to be robust in what it does implement, tolerating common minor malformations (such as attributes missing quotes). In this sense alone it is not a valid XML processor, as the XML spec requires any document with syntactically invalid content to be rejected. Meanwhile, rtk's implementation favors pragmatism when reasonable.

Simple Example

In its simplest invocation, rtk.xmlparse() reads a string containing an XML document and returns a hierarchy of nested Lua tables representing the parsed document:

local document = [[
    <?xml version="1.0" encoding="UTF-8"?>
    <bookstore>
        <book category="children">
            <title lang="en">Harry Potter</title>
            <author>J K. Rowling</author>
            <year>2005</year>
            <price>29.99</price>
        </book>
        <book category="web">
            <title lang="en">Learning XML</title>
            <author>Erik T. Ray</author>
            <year>2003</year>
            <price>39.95</price>
        </book>
  </bookstore>
]]
local root = rtk.xmlparse(document)
rtk.log.info('root element is: %s', root.tag)
rtk.log.info('first element is: %s, category=%s', root[1].tag, root[1].attrs.category.value)

Each element is expressed as a Lua table. See rtk.xmlparse() for details about how element tables are structured.

This example crawls the parsed element hierarchy, printing a nested tree of elements showing their tag names and content (if any):

local function showelem(elem, depth)
    -- Indent the output according to the element's nested depth
    rtk.log.info('%s- %s (%s)', string.rep(' ', depth * 2), elem.tag, elem.content)
    for _, child in ipairs(elem) do
        showelem(child, depth + 1)
    end
end
showelem(root, 0)

Advanced Example

It's also possible to provide custom callback functions that get invoked when a new tag is started (ontagstart), when an element's attribute is parsed (onattr), and when a tag is ended (ontagend). You can also optionally provide custom userdata of your choosing that is passed along to these callbacks.

Your callbacks can attach custom fields to the element or attribute tables that will be available after parsing is finished. With the onattr callback, you can also rewrite the attribute name and value by replacing those fields in the attribute table that is passed to the callback.

When using this more advanced invocation of rtk.xmlparse(), you pass it a table. The example below reuses the XML document from the first example, and maintains a total count of elements by tag across the document. It also also rewrites the category attribute as genre and converts the values to upper case.

-- We'll keep track of element counts per tag.
local state = {counts={}}
local root = rtk.xmlparse{
    document,
    userdata=state,
    ontagstart=function(elem, state)
        state.counts[elem.tag] = (state.counts[elem.tag] or 0) + 1
    end,
    onattr=function(elem, attr, state)
        -- Rewrite the 'cateogry' attribute as 'genre' with an uppercase value
        if attr.name == 'category' then
            attr.name = 'genre'
            attr.value = attr.value:upper()
        end
    end,
}
rtk.log.info('tag counts: %s', table.tostring(state.counts))
-- This time we access the category attribute by its rewritten name 'genre',
-- whose value has been converted to uppercase.
rtk.log.info('first book genre: %s', root[1].attrs.genre.value)

The above example is a little bit contrived, because of course the callback functions could easily be closures around state so we could just access it directly, but this is useful when you want to reuse functions that are independent of some form of state.

Limitations

The parser has the following limitations:

  • Document Type Definitions (DTD) are not supported and will be ignored. Documents containing DTDs that define custom entities that are used in the document will not be parsed properly (the custom entities will remain unconverted).
  • Because DTDs aren't supported, the parser is non-validating.
  • The encoding field in the XML declaration (<? xml encoding="UTF-8"?>) is ignored. Only UTF-8 is supported.
  • It's not going to win any awards on speed. It's about as fast as Python's minidom parser, and about 2x faster than xml2lua (but it's also less featureful than either), and it gets decimated by C-native parsers. But it's perfectly serviceable for smaller XML docs (under a few thousand lines).

Synopsis

Functions
rtk.xmlparse()

Parses an XML document, returning the root element

rtk.xmlparse(args)

Parses an XML document, returning the root element.

The args parameter is either a string that contains the XML document to parse, or it's a table that acts as keyword arguments that allow defining more options to control parsing. When args is a table, it takes the following fields.

Field Type Required Description
xml or first positional field string yes the XML document string to parse
userdata or second positional field any no arbitrary user-defined data that is passed to callback functions
ontagstart function no an optional callback function that will be invoked when a new tag is started, before attribute parsing has begin. The callback takes the arguments (element, userdata), where element is the element table defined below, and userdata is the same-named field passed in the args table, which will be nil if not defined.
onattr function no an optional callback function that will be invoked when an attribute has been parsed within the currently open tag. The callback takes the arguments (element, attr, userdata), where attr is the attribute table defined below.
ontagend function no an optional callback function that will be invoked when a tag is closed. The callback takes the arguments (element, userdata).

Element Tables

Elements are represented as Lua tables, with named fields for metadata about the element, and positional fields holding the element's children. Each element table has:

Field Type Description
tag string The tag name of the element
attrs table or nil A table of attributes for the element keyed on attribute name, or nil if the element has no attributes
content string or nil The character data content within the element stripped of leading and trailing whitespace, or nil if there is no non-whitespace content
Positional fields element table Zero or more positional fields in the table representing any child elements

Attribute Tables

The attrs field within an element table, if present, is a table keyed on attribute name and has the following fields:

Field Type Description
value string the value of the attribute as defined in the XML document, or filtered by a user-defined onattr handler (see below)
name string the name of the attribute as defined in the XML document, or filtered by a user-defined onattr handler (see below). This is the same as the key in the element's attrs table.
type string contains context of how the attribute was parsed, which is one of quoted when the attribute was properly quoted in the document (e.g. lang="en"), unquoted when the attribute lacks quotes (e.g. lang=en, which is technically invalid XML), or novalue when the attribute has no value at all (where you can decide what, if anything, to use as a default value).
Usage
local root = rtk.xmlparse([[
    <addressbook>
      <contact>
        <name>Alice</name>
        <email>alice@example.com</email>
      </contact>
      <contact>
        <name>Bob</name>
        <email>bob@example.com</email>
      </contact>
    </addressbook>
]])
Parameters
args (string or table)

the XML document string or table containing the XML document string and other parser options (see above)

Return Values
(table or nil)

the element table of the root node, or nil if the document could not be parsed