xml
¶
This module is not stable and its API and behaviors are subject to change. Do not use this API if you are targeting the ReaPack installation, as future versions are likely to introduce backward incompatible changes. If you bundle your own copy of rtk with your script then this risk is removed.
Feedback is appreciated during this API preview phase.
This module implements a fairly naive XML parser, supporting a limited but useful subset of XML. Despite its limitations (see below for more details), in practice it copes with many XML documents in the wild.
It tries to be robust in what it does implement, tolerating common minor malformations (such as attributes missing quotes). In this sense alone it is not a valid XML processor, as the XML spec requires any document with syntactically invalid content to be rejected. Meanwhile, rtk's implementation favors pragmatism when reasonable.
In its simplest invocation, rtk.xmlparse()
reads a string containing an XML document
and returns a hierarchy of nested Lua tables representing the parsed document:
local document = [[
<?xml version="1.0" encoding="UTF-8"?>
<bookstore>
<book category="children">
<title lang="en">Harry Potter</title>
<author>J K. Rowling</author>
<year>2005</year>
<price>29.99</price>
</book>
<book category="web">
<title lang="en">Learning XML</title>
<author>Erik T. Ray</author>
<year>2003</year>
<price>39.95</price>
</book>
</bookstore>
]]
local root = rtk.xmlparse(document)
rtk.log.info('root element is: %s', root.tag)
rtk.log.info('first element is: %s, category=%s', root[1].tag, root[1].attrs.category.value)
Each element is expressed as a Lua table. See rtk.xmlparse()
for details about how element
tables are structured.
This example crawls the parsed element hierarchy, printing a nested tree of elements showing their tag names and content (if any):
local function showelem(elem, depth)
-- Indent the output according to the element's nested depth
rtk.log.info('%s- %s (%s)', string.rep(' ', depth * 2), elem.tag, elem.content)
for _, child in ipairs(elem) do
showelem(child, depth + 1)
end
end
showelem(root, 0)
It's also possible to provide custom callback functions that get invoked when a new tag
is started (ontagstart
), when an element's attribute is parsed (onattr
), and when a
tag is ended (ontagend
). You can also optionally provide custom userdata of your
choosing that is passed along to these callbacks.
Your callbacks can attach custom fields to the element or attribute tables that will
be available after parsing is finished. With the onattr
callback, you can also
rewrite the attribute name and value by replacing those fields in the attribute table
that is passed to the callback.
When using this more advanced invocation of rtk.xmlparse()
, you pass it a table. The
example below reuses the XML document from the first example, and maintains a total
count of elements by tag across the document. It also also rewrites the category
attribute as genre
and converts the values to upper case.
-- We'll keep track of element counts per tag.
local state = {counts={}}
local root = rtk.xmlparse{
document,
userdata=state,
ontagstart=function(elem, state)
state.counts[elem.tag] = (state.counts[elem.tag] or 0) + 1
end,
onattr=function(elem, attr, state)
-- Rewrite the 'cateogry' attribute as 'genre' with an uppercase value
if attr.name == 'category' then
attr.name = 'genre'
attr.value = attr.value:upper()
end
end,
}
rtk.log.info('tag counts: %s', table.tostring(state.counts))
-- This time we access the category attribute by its rewritten name 'genre',
-- whose value has been converted to uppercase.
rtk.log.info('first book genre: %s', root[1].attrs.genre.value)
The above example is a little bit contrived, because of course the callback functions
could easily be closures around state
so we could just access it directly, but this
is useful when you want to reuse functions that are independent of some form of state.
The parser has the following limitations:
encoding
field in the XML declaration (<? xml encoding="UTF-8"?>
) is ignored.
Only UTF-8 is supported.Parses an XML document, returning the root element.
The args
parameter is either a string that contains the XML document to parse, or
it's a table that acts as keyword arguments that allow defining more options to control
parsing. When args
is a table, it takes the following fields.
Field | Type | Required | Description |
---|---|---|---|
xml or first positional field |
string | yes | the XML document string to parse |
userdata or second positional field |
any | no | arbitrary user-defined data that is passed to callback functions |
ontagstart |
function | no | an optional callback function that will be invoked when a new tag is started, before attribute parsing has begin. The callback takes the arguments (element, userdata) , where element is the element table defined below, and userdata is the same-named field passed in the args table, which will be nil if not defined. |
onattr |
function | no | an optional callback function that will be invoked when an attribute has been parsed within the currently open tag. The callback takes the arguments (element, attr, userdata) , where attr is the attribute table defined below. |
ontagend |
function | no | an optional callback function that will be invoked when a tag is closed. The callback takes the arguments (element, userdata) . |
Elements are represented as Lua tables, with named fields for metadata about the element, and positional fields holding the element's children. Each element table has:
Field | Type | Description |
---|---|---|
tag |
string | The tag name of the element |
attrs |
table or nil | A table of attributes for the element keyed on attribute name, or nil if the element has no attributes |
content |
string or nil | The character data content within the element stripped of leading and trailing whitespace, or nil if there is no non-whitespace content |
Positional fields | element table | Zero or more positional fields in the table representing any child elements |
The attrs
field within an element table, if present, is a table keyed on attribute name and has the following fields:
Field | Type | Description |
---|---|---|
value |
string | the value of the attribute as defined in the XML document, or filtered by a user-defined onattr handler (see below) |
name |
string | the name of the attribute as defined in the XML document, or filtered by a user-defined onattr handler (see below). This is the same as the key in the element's attrs table. |
type |
string | contains context of how the attribute was parsed, which is one of quoted when the attribute was properly quoted in the document (e.g. lang="en" ), unquoted when the attribute lacks quotes (e.g. lang=en , which is technically invalid XML), or novalue when the attribute has no value at all (where you can decide what, if anything, to use as a default value). |
local root = rtk.xmlparse([[
<addressbook>
<contact>
<name>Alice</name>
<email>alice@example.com</email>
</contact>
<contact>
<name>Bob</name>
<email>bob@example.com</email>
</contact>
</addressbook>
]])
args | (string or table) | the XML document string or table containing the XML document string and other parser options (see above) |
(table or nil) | the element table of the root node, or nil if the document could not be parsed |