LuaExpat logo
LuaExpat Reference Manual
XML Expat parsing for the Lua programming language

home · Introduction · Parser Objects · Examples


Introduction

LuaExpat is a SAX XML parser based on the Expat library. SAX is the Simple API for XML and allows programs to

With an event based API like SAX the XML document can be fed to the parser in pieces, and the parsing begins as soon as the parser receives some part of the document. LuaExpat reports parsing events (such as the start and end of elements) directly to the application through callbacks. The parsing of huge documents can benefit from this piece by piece operation.

Parser objects

Usually SAX implementations bases all its operations on the concept of a parser that allows the registration of callback functions. LuaExpat offers the same functionality but uses a different registration method based on a table of callbacks. This table contains references to the callback functions, wich are responsible for the handling of the document parts. The parser will assume no behaviour for any non declared callbacks.

Constructor

lxp.new(callbacks [, separator])

The parser is created by a call to the function lxp.new and it returns the created parser or raises an Lua error. It receives the callbacks table and optionally the parser separator character used in the namespace expanded element names.

Methods

parser:close()

Closes the parser, freeing all memory associated with it. A call to parser:close() without a previous call to parser:parse() could result in an error.

parser:getbase()

Return the base for resolving relative URIs.

parser:getcallbacks()

Returns the callbacks table.

parser:parse(s)

Parse some more of the document. The string s cotains part (or perhaps all) of the document. When called without arguments the document is closed (but the parser still has to be closed).

The function returns a non nil value when the parser has been succesfull, and when the parser finds an error it returns five results: nil, msg, line, col, and pos where msg is the error message, and the others are the line, column and absolute position of the error in the XML document.

parser:pos()

Returns three results: the current parsing line, column, and absolute position.

parser:setbase(base)

Set the base to be used for resolving relative URIs in system identifiers.

parser:setencoding(encoding)

Set the encoding to be used by the parser. There are four built-in encodings, passed as strings: "US-ASCII", "UTF-8", "UTF-16", and "ISO-8859-1".

Callbacks

The Lua callbacks define the handlers of the parser events. The use of a table in the parser constructor has some advantages over the registration of callbacks, since there are no need for callback manipulation funcionality in the API.

Another difference lies in the behaviour of the callbacks during the parsing itself. The callback table contains references to the functions that can be redefined at will, the only restriction is that only the callbacks present in the table at the creation time will be called.

The callbacks indexes are named after the equivalent Expat callbacks and are CharacterData, Comment, Default, DefaultExpand, EndCDataSection, EndElement, EndNamespaceDecl, ExternalEntityRef, NotStandalone, NotationDecl, ProcessingInstruction, StartCDataSection, StartElement, StartNamespaceDecl, and UnparsedEntityDecl.

Each of these indexes can be references to functions with specific signatures as seem below. The parser constructor also checks the presence of field called _nonstrict in the callback table. If _nonstrict is absent, only valid callback names are accepted as indexes in the table (Defaultexpanded would be considered an error for example). If _nonstrict is defined any other fieldnames can be used.

The callbacks can optionally be defined as false behaving thus as placeholders for future assignment of functions.

Every callback function receives as the first parameter the calling parser itself, thus allowing the same functions to be used for more than one parser for example.

callbacks.CharacterData = function(parser, string)

Called when the parser recognizes a XML CData string.

callbacks.Comment = function(parser, string)

Called when the parser recognizes a XML comment string.

callbacks.Default = function(parser, string)

Called when the parser has a string corresponding to any characters in the document which wouldn't otherwise be handled. Using this handler has the side effect of turning off expansion of references to internally defined general entities. Instead these references are passed to the default handler.

callbacks.DefaultExpand = function(parser, string)

Called when the parser has a string corresponding to any characters in the document which wouldn't otherwise be handled. Using this handler doesn't affect expansion of internal entity references.

callbacks.EndCdataSection = function(parser)

Called when the parser detects the end of a CDATA section.

callbacks.EndElement = function(parser, elementName)

Called when the parser detects the ending of an XML element with elementName.

callbacks.EndNamespaceDecl = function(parser, namespaceName)

Called when the parser detects the ending of a XML namespace with namespaceName. The handling of the end namespace is done after the handling of the end tag for the element the namespace is associated with.

callbacks.ExternalEntityRef = function(parser, subparser, base, systemId, publicId)

Called when the parser detects an external entity reference.

The subparser is a LuaExpat parser created with the same callbacks and Expat context as the parser and should be used to parse the external entity.

The base parameter is the base to use for relative system identifiers. It is set by parser:setbase and may be nil.

The systemId parameter is the system identifier specified in the entity declaration and is never nil.

The publicId parameter is the public id given in the entity declaration and may be nil.

callbacks.NotStandalone = function(parser)

Called when the parser detects that the document is not "standalone". This happens when there is an external subset or a reference to a parameter entity, but does not have standalone set to "yes" in an XML declaration.

callbacks.NotationDecl = function(parser, notationName, base, systemId, publicId)

Called when the parser detects XML notation declarations with notationName

The base parameter is the base to use for relative system identifiers. It is set by parser:setbase and may be nil.

The systemId parameter is the system identifier specified in the entity declaration and is never nil.

The publicId parameter is the public id given in the entity declaration and may be nil.

callbacks.ProcessingInstruction = function(parser, target, data)

Called when the parser detects XML processing instructions. The target is the first word in the processing instruction. The data is the rest of the characters in it after skipping all whitespace after the initial word.

callbacks.StartCdataSection = function(parser)

Called when the parser detects the begining of a XML CDATA section.

callbacks.StartElement = function(parser, elementName, attributes)

Called when the parser detects the begining of a XML element with elementName.

The attributes parameter is a Lua table with all the element attribute names and values. The table contains an entry for every attribute in the element start tag and entries for the default attributes for that element.

The attributes are listed by name (including the inherited ones) and by position (inherited attributes are not considered in the position list).

As an example if the book element has attributes author, title and an optional format attribute (with "printed" as default value),

<book author="Ierusalimschy, Roberto" title="Programming in Lua">
would be represented as

        {[1] = "Ierusalimschy, Roberto",
         [2] = "Programming in Lua",
         author = "Ierusalimschy, Roberto",
         format = "printed",
         title = "Programming in Lua"}
        

callbacks.StartNamespaceDecl = function(parser, namespaceName)

Called when the parser detects a XML namespace declaration with namespaceName. Namespace declarations occur inside start tags, but the StartNamespaceDecl handler is called before the StartElement handler for each namespace declared in that start tag.

callbacks.UnparsedEntityDecl = function(parser, entityName, base, systemId, publicId, notationName)

Called when the parser receives declarations of unparsed entities. These are entity declarations that have a notation (NDATA) field.

As an example, in the chunk <!ENTITY logo SYSTEM "images/logo.gif" NDATA gif> entityName would be "logo", systemId would be "images/logo.gif" and notationName would be "gif". For this example the publicId parameter would be nil. The base parameter would be whatever has been set with parser:setbase. If not set, it would be nil.

The separator character

The optional separator character in the parser constructor defines the character used in the namespace expanded element names. The separator character is optional (if not defined the parser will not handle namespaces) but if defined it has to be different from the character '\0'.

Examples

The code excerpt below creates a parser with 2 callbacks and feeds a test string to it. The parsing of the test string triggers the callbacks, printing the results.

  local count = 0
  callbacks = {
    StartElement = function (parser, name)
      io.write("+ ", string.rep(" ", count), name, "\n")
      count = count + 1
    end,
    EndElement = function (parser, name)
      count = count - 1
      io.write("- ", string.rep(" ", count), name, "\n")
    end
  }

  p = lxp.new(callbacks)

  for l in io.lines() do  -- iterate lines
    p:parse(l)            -- parses the line
    p:parse("\n")         -- parses the end of line
  end
  p:parse()               -- finishes the document
  p:close()               -- closes the parser
  

For a test string like

  <elem1>
    text
    <elem2/>
    more text
  </elem1>
  

The example would print

  + elem1
    + elem2
    - elem2
  - elem1
  

Note that the text parts are not handled since the corresponding callback (CharacterData) have not been defined. Also note that defining this callback after the call to lxp.new would make no difference. But had the callback table been defined as

  callbacks = {
    StartElement = function (parser, name)
      io.write("+ ", string.rep(" ", count), name, "\n")
      count = count + 1
    end,
    EndElement = function (parser, name)
      count = count - 1
      io.write("- ", string.rep(" ", count), name, "\n")
    end,
    CharacterData = function (parser, string)
      io.write("* ", string.rep(" ", count), string, "\n")
    end
  }
  

The results would have been

  + elem1
  * text
    + elem2
    - elem2
  * more text
  - elem1
  

Another example would be the use of false as a placeholder for the callback. Suppose that we would like to print only the text associated with elem2 elements and that the XML sample was

  <elem1>
    text
    <elem2>
      inside text
    </elem2>
    more text
  </elem1>
  

We could define the new callback table as

  callbacks = {
    StartElement = function (parser, name)
      if name == "elem2" then
        callbacks.CharacterData = function (parser, string)   -- redefines CharacterData behaviour
          io.write(string, "\n")
        end
      end
    end,

    EndElement = function (parser, name)
      if name == "elem2" then
        callbacks.CharacterData = false     -- restores placeholder
      end
    end,

    CharacterData = false                   -- placeholder
  }
  

The results would have been

  inside text
  

Note that this example assumes no other elements are present inside elem2 tags.

Contents

home · Introduction · Parser Objects · Examples


$Id: manual.html,v 1.11 2004/01/12 10:42:34 tomas Exp $