XMLPARSE(n) 0.9 "XML"

NAME

XMLPARSE - Parser for XML files in Fortran

    TABLE OF CONTENTS
    SYNOPSIS
    DESCRIPTION
    PROCEDURES
    MOTIVATION
    PARAMETERS AND DERIVED TYPES
    GENERATING A READING ROUTINE
    EXAMPLES
    LIMITATIONS
    RELEASE NOTES
    TO DO
    KEYWORDS

SYNOPSIS

subroutine xml_open( info, filename, mustread )

subroutine xml_close( info )

subroutine xml_options( info, ... )

subroutine xml_get( info, tag, endtag, attribs, no_attribs, data, no_data )

subroutine xml_put( info, tag, attribs, no_attribs, data, no_data, type )

logical function xml_ok( info )

logical function xml_error( info )

logical function xml_data_trunc( info )

integer function xml_find_attrib( attribs, no_attribs, name, value )

subroutine read_xml_file_xxx( filename, lurep, error )

subroutine xml_process( filename, attribs, data, startfunc, datafunc, endfunc, lurep, error )

DESCRIPTION

The XML parser provided by this module has been written entirely in Fortran, making it possible to read and write XML files without the need to use mixed-language programming techniques.

It should be noted that the implementation has a number of limitations (cf. the section Limitations). The module has the following features:

Reading an XML-file (within certain limitations) in a stream-oriented manner.
Writing an XML-file in a stream-oriented manner.
Creating a reading routine that will fill a data structure. The data structure is described via an XML file and all necessary code to read files that conform to that structure is generated.

The module has been implemented in standard Fortran 90. It is the intention to make it compilable by the F compiler as well, so that it can be used in conjunction to a wide set of Fortran compilers.

(It should even be possible to convert the parsing routines to an equivalent library in FORTRAN 77, though with the availability of several free Fortran 95 compilers, there seems little need for that.)

PROCEDURES

The module defines the following public routines and functions:

subroutine xml_open( info, filename, mustread )

Open an XML-file and fill the structure info, so that it can be used to refer to the opened file.

To check if all is well, (errors could be: the file can not be opened for some reason), the function xml_error() is available.

Arguments:

info - TYPE(XML_PARSE) structure used to identify the file

filename - CHARACTER(LEN=*) name of the file to be opened

mustread - LOGICAL whether to read the file or to write to it

subroutine xml_close( info )

Close an opened XML-file. If the file was not opened, this routine has no effect.

info - TYPE(XML_PARSE) structure used to identify the file

subroutine xml_options( info, ... )

Set one or more options. These are all defined as optional arguments, so that the name=value convention can be used to select an option and to set its value. The first argument is fixed:

info - TYPE(XML_PARSE) structure used to identify the file

All other arguments are optional and include:

ignore_whitespace - LOGICAL compress the array of strings (remove empty lines and remove leading blanks) for easier processing

no_data_truncation - LOGICAL if data truncation occurs (too many lines of data or too many attributes, so that they can not all be stored in the arrays), this can be marked as an error or not. If the option is set to true, it is considered an error.

report_lun - INTEGER LU-number of a file to which messages can be logged (use XML_STDOUT for output to screen)

report_errors - LOGICAL write error messages to the report

report_details - LOGICAL write detailed messages to the report, useful for debugging

Note that these options are off by default. They should be set after the file has been opened. The reporting options can be set before an XML file has been opened, they hold globally (that is, they are in effect for all reading and writing, independent of the files).

subroutine xml_get( info, tag, endtag, attribs, no_attribs, data, no_data )

Read the current tag in the file up to the next one or the end-of-file. Store the attributes in the given array and do the same for the character data that may be present after the tag.

info - TYPE(XML_PARSE) structure used to identify the file

tag - CHARACTER(LEN=*) string that will hold the tag's name

endtag - LOGICAL indicates whether the current tag has ended or not

attribs - CHARACTER(LEN=*), DIMENSION(:,:) array of strings that will hold the attributes given to the tag

no_attribs - INTEGER number of attributes that were found

data - CHARACTER(LEN=*), DIMENSION(:) array of strings that will hold the character data (one element per line)

no_data - INTEGER number of lines of character data Note:

If an error occurs or end-of-file is found, then use the functions xml_ok() and xml_error() to find out the conditions.

subroutine xml_put( info, tag, attribs, no_attribs, data, no_data, type )

Write the information for the current tag to the file. This subroutine is the inverse, so to speak, of the subroutine xml_get that parses the XML input.

For a description of the arguments, other than type: see above.

type - CHARACTER(LEN=*) string having one the following values:

'open' - Write an opening tag with attributes and data (if there are any). Useful for creating a hierarchy of tags.
'close' - Write a closing tag
'elem' - Write the element data

logical function xml_ok( info )

Returns whether the parser is still okay (no read errors or end-of-file).

info - TYPE(XML_PARSE) structure used to identify the file

logical function xml_error( info )

Returns whether the parser has encountered some error (see also the options).

info - TYPE(XML_PARSE) structure used to identify the file

logical function xml_data_trunc( info )

Returns whether the parser has had to truncate the data or the attributes.

info - TYPE(XML_PARSE) structure used to identify the file

integer function xml_find_attrib( attribs, no_attribs, name, value )

Convenience function that searches the list of attributes and returns the index of the sought attribute in the array or -1 if not present. In that case the argument value is not set, so that you can use this to supply a default.

attribs - CHARACTER(LEN=*), DIMENSION(:,:) array of strings that hold the attributes

no_attribs - INTEGER number of attributes that was found

name - CHARACTER(LEN=*) name of the attribute to be found

value - CHARACTER(LEN=*) actual or default value of the attribute upon return

subroutine read_xml_file_xxx( filename, lurep, error )

Subroutine generated via the method described below to read an XML file of a particular structure.

filename - CHARACTER(LEN=*) name of the XML file to read

lurep - INTEGER LU-number to use for reporting errors (use 0 to write to the screen; optional)

error - LOGICAL variable that indicates if an error occurred while reading (optional).

subroutine xml_process( filename, attribs, data, startfunc, datafunc, endfunc, lurep, error )

Subroutine that reads the XML file and calls three user-defined subroutines to take care of the actual processing. This is a routine that implements the so-called SAX approach.

filename - CHARACTER(LEN=*) name of the XML file to read

attribs - CHARACTER(LEN=*), DIMENSION(:,:) work array to store the attributes

data - CHARACTER(LEN=*), DIMENSION(:) work array to store the character data associated with a tag

startfunc - Subroutine that is called to handle the start of a tag:

subroutine startfunc( tag, attribs, error ) character(len=*) :: tag character(len=*), dimension(:,:) :: attribs logical :: error

If the argument error is set to true (because the tag was unexpected or something similar), the reading is interrupted and the routine returns. Only the fact that something was wrong is recorded. You need to use other means to convey more information if that is needed.

datafunc - Subroutine that is called to handle the character data associated with a tag:

subroutine datafunc( tag, attribs, error ) character(len=*) :: tag character(len=*), dimension(:) :: data logical :: error

endfunc - Subroutine that is called to handle the end of a tag:

subroutine endfunc( tag, error ) character(len=*) :: tag logical :: error

lurep - INTEGER LU-number to use for reporting errors (use 0 to write to the screen; optional)

error - LOGICAL variable that indicates if an error occurred while reading (optional).

MOTIVATION

The use of XML-files as a means to store data and more importantly to transfer data between very disparate applications and organisations has been growing these last few years. Standard implementations of libraries that deal with all features of XML or a significant part of them are available in many languages, but as far as we know there was no implementation in Fortran.

One could of course use, say, the well-known Expat library by ... and provide a Fortran interface, but this is slightly awkward as it forces one to have a compatible C compiler. More importantly, this introduces platform-dependencies because the interfacing between Fortran and C depends strongly on the used compilers and this introduces a way of working that is alien to Fortran programmers: Expat requires the programmer to register a callback function, to be called when some "event" occurs while reading the file (a begin tag is found, character data are found and so on).

The alternative is even more awkward: build a tree of tags and associated data and ask for these data. To a Fortran programmer, one of the first things they will want to do with an XML-file is to get all the information out - so a stream-oriented parsing method is more appropriate.

Among the two predominant types of XML-parsing, SAX or stream-oriented parsing and DOM or object-oriented parsing, the stream-oriented approach is more suitable to the frame of mind of the average Fortran programmer. But instead of registering callbacks, this module uses the method known from, for instance, GNU's getopt() function: parse the data and return to the caller to have it process the information. The caller calls the function again and again, letting getopt() take care of the details.

This is exactly the approach taken by the xmlparse module:

call xml_open(info, ... ) do while ( xml_ok(info) ) call xml_get(info, ... ) ! Get the first/next tag ... identify the tag (via xml_check_tag for instance) ... process the information enddo call xml_close(info) ... proceed with the rest of the program

For convenience, the module does supply the routine xml_process that takes three user-defined subroutines to perform the actual processing. The file will be processed in its entirety.

PARAMETERS AND DERIVED TYPES

The module defines several parameters and derived types for use by the programmer:

XML_BUFFER_LENGTH: the length of the internal buffer, representing the maximum length of any individual line in an XML file and the maximum length for a tag including all its attributes.
XML_STDOUT: a parameter to indicate the standard output (or *) as the file to write messages to.
type(XML_PARSE): the data structure that holds information about the XML file to be read or written. Its contents are partially accessible via functions such as XML_OK() and XML_ERROR(). Note: do not use its contents directly, as these may change in future.

GENERATING A READING ROUTINE

Reading an XML file and making sure the data are structured the way they are supposed to, generally requires a lot of code. This can not be avoided: you will want to make sure everything you need is there and anything else is dealt with appropriately.

There is a way out: by automatically generating the reading routine you can reduce the amount of manual coding to a minimum. This has two advantages:

It is much less work to define the data and their place in an XML file than it is to encode the reading routine.
It is much less error-prone, if the logic is generated for you and therefore you need much less testing.

The idea is simple:

In an XML-file you define the data structure and the way this data structure should appear in an input XML file for your program. The process is probably best explained via an example.

Say, you want to read addresses (a classical example). Each address consists of the name of the person, street name and the number of the house, city (let us keep it simple). Of course we have multiple addresses, so they are stored in an array. Then via the xmlreader program you can generate a reading routine that deals with this type of information.

The program takes an XML file as input and produces a Fortran 90 module that reads input files and stores the data in the designated variables. It also creates a writing routine to write the data to an XML file.

In our case, we want a derived type to hold the various pieces that form a complete address and we want an array of that type:

This will produce the following derived type:

type address_type character(len=40) :: person character(len=40) :: street integer :: number character(len=40) :: city end type address_type

and a variable "address":

type(address_type), dimension(:), pointer :: address

The reading routine will be able to read such XML files as the following:

<address> <person>John Doe</person> <street>Wherever street</street> <number>30</number> <city>Erewhon</city> </address> <address> ... </address> ...

If in some address the number was forgotten, the reading routine will report this, as by default all variables and components in a derived type must be present.

Here is a more detailed description of the XML files accepted by the xmlreader program:

Use the comment tag to insert comments in the input file to reader (or the input to the resulting reading routines)
The options tag can be used to influence the generated code:
- The attribute "strict" determines whether unknown tags are regarded as an error (strict="yes") or not (strict="no", the default).
- The attribute "globaltype" is used to indicate that all variables should belong to a single derived type, whose name defaults to the name of the file. Use the "typename" attribute to set the name to a different value.
If you want to group tags for several variables, but you do not want to introduce a special derived type, you can do so with the placeholder tag. Its effect is to require an additional tag - end tag surrounding the data. Any tags defined within the placeholder - end placeholder tags will have to be put in the corresponding tags in the input file for the resulting program.
<placeholder tag="grid"> <variable x ...> <variable y ...> </placeholder>
variable tags correspond directly to module variables. They are used to declare these variables and to generate the code that will read them.

Variable tags can appear anywhere except within a type definition. Variables can be of a previously defined derived type or of a primitive type.
<variable name="x" type="integer" default="1" />
Variables can have a number of attributes:
- Required attributes:
  
  name - the name of the variable in the actual program
  
  type - the type of the variable
  
  length - for character types only, the length of the string
- Optional attributes:
  
  default - the default value to be used if information is missing
  
  dimension - the number of dimensions (up to 3), gives rise to a pointer component
  
  shape - the fixed size of an array, if this is present, the number of dimensions is taken from this attribute.
  
  tag - the name of the tag that holds the data (default to the name of the variable)
- Basic types for the variables include:
  
  integer - a single integer value
  
  integer-array - a one-dimensional array of integer values (the values must appear between an opening and ending tag) real - a single-precision real value
  
  real-array - a one-dimensional array of real values (the values must appear between an opening and ending tag)
  
  double - a double-precision real value
  
  double-array - a one-dimensional array of double-precision values (the values must appear between an opening and ending tag)
  
  logical - a single logical value (represented as "T" or "F")
  
  logical-array - a one-dimensional array of logical values (the values must appear between an opening and ending tag)
  
  word - a character string as can be read via list-directed input (if it should contain spaces, surround it with single or double quotes)
  
  word-array - a one-dimensional array of strings (the values must appear between an opening and ending tag)
  
  line - a character string as can be read from a single line of text (via the '(A)' format)
  
  line-array - a one-dimensional array of strings, read as individual lines between the opening and closing tag
  
  character - a character string (synonym for "line")
  
  character-array - a one-dimensional array of character strings, synonym for line-array
Type definitions (typedef)allow the xmlreader program to define the derived types that you want to use in your reader.

The typedef tag may only contain component tags. They are synonym to variable tags with the same restrictions.

Future versions may also include options for:

Adding code to handle certain data in a particular way
Version checking (so that an input file is explicitly identified as being of a particular version of the software)

EXAMPLES

The directory "examples" contains some example programs.

The tst_grid program demonstrates how to create a reader for an array of "grids", each consisting of two integers.
The tst_menu program uses a more elaborate structure, a menubar with menus and each menu having an array of items. Items in a menu can have a submenu. This leads to an XML file with multiple hierarchical layers.
The tst_process program uses the xml_process routine to read in an XML file (a "docbook" file) and turn it into an HTML file for viewing.

LIMITATIONS

Basic limitations:

The lines in the XML-file should not exceed 1000 characters. For tags that span more than one line, the limit holds for all the lines together (without leading or trailing blanks).
There is no support for DTDs or namespaces, XSLT, XPath and other more advanced features around XML.
There is currently no support for the object-oriented approach. It is up to the application to store the information that is needed, while the parsing is going on.
No support (yet) for a single quote as delimiter
No support (yet) for conversion of escape sequences (&gt. for instance)
The parser may not handle malformed XML-files properly
The parser does not (yet) handle different line-endings properly (that is: reading XML-files that were written under MS Windows in a UNIX or Linux environment)

RELEASE NOTES

This document belongs to version 1.00 of the module.

History:

version 0.1: Proof of concept, august 2003

A very preliminary version meant to show that it is indeed possible to read and write XML files using Fortran only. It was published on the comp.lang.fortran newsgroup and generated enough interest to encourage further development.

version 0.2: First public release, august 2003

After some additional testing with practical XML-files, a number of bugs were found and solved, several enhancements were made:

Handling attributes (especially when tags span more than one line and correctly handling the case that too many attributes are present).
Options for parsing and error handling added, as well as functions to check the status.
Revision of the API, for more uniform names (prefix: xml_)
Setting up the documentation (this document in particular)

version 0.3: Improvements, september 2003

Added the function xml_error()
Implemented the report options
Corrected a bug in xml_close (causing an infinite loop in the test program).
Revised the test program to run through a number of test files.

version 0.4: Corrected xml_put(), october 2003

Adjusted the interface and implementation of the subroutine xml_put() It will now produce correct and reasonably looking XML files.
Added a test program, tstwrite.f90, for this.

version 0.9: Added new approach, october 2005

Changes to the interface and implementation of the subroutine xml_put(), from a patch by cinonet.
Added a program, xmlreader, to generate complete reading routines for particular XML files (cf. GENERATING A READING ROUTINE

version 0.94: Gradually expanding the capabilities, june 2006

Added a routine xml_process that enables you to use an event-based approach like in the famous Expat library.
Added the option strict and the tag placeholder.
Corrected a number of bugs associated with the xmlreader program

version 0.97: Added the following capabilities to the xmlreader program since 0.94, june 2007

Support for the shape option
Defaults for both components of a derived type and for independent variables.
The generated reading routine takes care of elements that have attributes and character data now. The character data is treated as if it were an attribute with the name "value"
Several bugs corrected in the xmlreader program

version 1.00: Added the following capabilities to the xmlreader program since 0.97, april 2008

Write a writing routine to write the data to a XML file

The project now also contains a first version of a program to convert an XSD file to a file accepted by the xmlreader program. This is called "xsdconvert".

TO DO

The following items remain on the "to do" list:

Adding checks for truncation of strings (attribute names/values too long, data lines too long; now only the number is checked).
Documenting details about structures and parameters that may be of interest.

KEYWORDS

Fortran, XML, parsing