XMLPARSE(n) 0.9 "XML"
XMLPARSE - Parser for XML files in Fortran
TABLE OF CONTENTS
SYNOPSIS
DESCRIPTION
PROCEDURES
MOTIVATION
PARAMETERS AND DERIVED TYPES
GENERATING A READING ROUTINE
EXAMPLES
LIMITATIONS
RELEASE NOTES
TO DO
KEYWORDS
subroutine xml_open( info, filename, mustread ) |
subroutine xml_close( info ) |
subroutine xml_options( info, ... ) |
subroutine xml_get( info, tag, endtag, attribs, no_attribs, data, no_data ) |
subroutine xml_put( info, tag, attribs, no_attribs, data, no_data, type ) |
logical function xml_ok( info ) |
logical function xml_error( info ) |
logical function xml_data_trunc( info ) |
integer function xml_find_attrib( attribs, no_attribs, name, value ) |
subroutine read_xml_file_xxx( filename, lurep, error ) |
subroutine xml_process( filename, attribs, data, startfunc, datafunc, endfunc, lurep, error ) |
|
The XML parser provided by this module has been written entirely in
Fortran, making it possible to read and write XML files without the need
to use mixed-language programming techniques.
It should be noted that the implementation has a number of limitations
(cf. the section Limitations). The module has the following features:
-
Reading an XML-file (within certain limitations) in a stream-oriented
manner.
-
Writing an XML-file in a stream-oriented manner.
-
Creating a reading routine that will fill a data structure. The data
structure is described via an XML file and all necessary code to read
files that conform to that structure is generated.
The module has been implemented in standard Fortran 90. It is the
intention to make it compilable by the F compiler as well, so that
it can be used in conjunction to a wide set of Fortran compilers.
(It should even be possible to convert the parsing routines to an
equivalent library in FORTRAN 77, though with the availability of
several free Fortran 95 compilers, there seems little need for that.)
The module defines the following public routines and functions:
- subroutine xml_open( info, filename, mustread )
-
Open an XML-file and fill the structure info, so that it can be
used to refer to the opened file.
To check if all is well, (errors could be: the file can not be opened
for some reason), the function xml_error() is available.
Arguments:
info - TYPE(XML_PARSE) structure used to identify the file
filename - CHARACTER(LEN=*) name of the file to be opened
mustread - LOGICAL whether to read the file or to write to it
- subroutine xml_close( info )
-
Close an opened XML-file. If the file was not opened, this routine has
no effect.
info - TYPE(XML_PARSE) structure used to identify the file
- subroutine xml_options( info, ... )
-
Set one or more options. These are all defined as optional arguments, so
that the name=value convention can be used to select an option
and to set its value. The first argument is fixed:
info - TYPE(XML_PARSE) structure used to identify the file
All other arguments are optional and include:
ignore_whitespace - LOGICAL compress the array of strings (remove
empty lines and remove leading blanks) for easier processing
no_data_truncation - LOGICAL if data truncation occurs (too many
lines of data or too many attributes, so that they can not all be stored
in the arrays), this can be marked as an error or not. If the option is
set to true, it is considered an error.
report_lun - INTEGER LU-number of a file to which messages can be
logged (use XML_STDOUT for output to screen)
report_errors - LOGICAL write error messages to the report
report_details - LOGICAL write detailed messages to the report,
useful for debugging
Note that these options are off by default. They should be set
after the file has been opened. The reporting options can be set before
an XML file has been opened, they hold globally (that is, they are in
effect for all reading and writing, independent of the files).
- subroutine xml_get( info, tag, endtag, attribs, no_attribs, data, no_data )
-
Read the current tag in the file up to the next one or the end-of-file.
Store the attributes in the given array and do the same for the
character data that may be present after the tag.
info - TYPE(XML_PARSE) structure used to identify the file
tag - CHARACTER(LEN=*) string that will hold the tag's name
endtag - LOGICAL indicates whether the current tag has ended or
not
attribs - CHARACTER(LEN=*), DIMENSION(:,:) array of strings that
will hold the attributes given to the tag
no_attribs - INTEGER number of attributes that were found
data - CHARACTER(LEN=*), DIMENSION(:) array of strings that
will hold the character data (one element per line)
no_data - INTEGER number of lines of character data
Note:
If an error occurs or end-of-file is found, then use the functions
xml_ok() and xml_error() to find out the conditions.
- subroutine xml_put( info, tag, attribs, no_attribs, data, no_data, type )
-
Write the information for the current tag to the file. This subroutine
is the inverse, so to speak, of the subroutine xml_get that
parses the XML input.
For a description of the arguments, other than type: see above.
type - CHARACTER(LEN=*) string having one the following values:
-
'open' - Write an opening tag with attributes and data (if there
are any). Useful for creating a hierarchy of tags.
-
'close' - Write a closing tag
-
'elem' - Write the element data
- logical function xml_ok( info )
-
Returns whether the parser is still okay (no read errors or
end-of-file).
info - TYPE(XML_PARSE) structure used to identify the file
- logical function xml_error( info )
-
Returns whether the parser has encountered some error (see also the
options).
info - TYPE(XML_PARSE) structure used to identify the file
- logical function xml_data_trunc( info )
-
Returns whether the parser has had to truncate the data or the
attributes.
info - TYPE(XML_PARSE) structure used to identify the file
- integer function xml_find_attrib( attribs, no_attribs, name, value )
-
Convenience function that searches the list of attributes and returns
the index of the sought attribute in the array or -1 if not present.
In that case the argument value is not set, so that you can use
this to supply a default.
attribs - CHARACTER(LEN=*), DIMENSION(:,:) array of strings that
hold the attributes
no_attribs - INTEGER number of attributes that was found
name - CHARACTER(LEN=*) name of the attribute to be found
value - CHARACTER(LEN=*) actual or default value of the attribute
upon return
- subroutine read_xml_file_xxx( filename, lurep, error )
-
Subroutine generated via the method described below to read an XML
file of a particular structure.
filename - CHARACTER(LEN=*) name of the XML file to read
lurep - INTEGER LU-number to use for reporting errors (use 0 to
write to the screen; optional)
error - LOGICAL variable that indicates if an error occurred
while reading (optional).
- subroutine xml_process( filename, attribs, data, startfunc, datafunc, endfunc, lurep, error )
-
Subroutine that reads the XML file and calls three user-defined
subroutines to take care of the actual processing. This is a
routine that implements the so-called SAX approach.
filename - CHARACTER(LEN=*) name of the XML file to read
attribs - CHARACTER(LEN=*), DIMENSION(:,:) work array to store the
attributes
data - CHARACTER(LEN=*), DIMENSION(:) work array to store the
character data associated with a tag
startfunc - Subroutine that is called to handle the start
of a tag:
|
subroutine startfunc( tag, attribs, error )
character(len=*) :: tag
character(len=*), dimension(:,:) :: attribs
logical :: error
|
If the argument error is set to true (because the tag was unexpected or
something similar), the reading is interrupted and the routine returns.
Only the fact that something was wrong is recorded. You need to use
other means to convey more information if that is needed.
datafunc - Subroutine that is called to handle the character data
associated with a tag:
|
subroutine datafunc( tag, attribs, error )
character(len=*) :: tag
character(len=*), dimension(:) :: data
logical :: error
|
endfunc - Subroutine that is called to handle the end
of a tag:
|
subroutine endfunc( tag, error )
character(len=*) :: tag
logical :: error
|
lurep - INTEGER LU-number to use for reporting errors (use 0 to
write to the screen; optional)
error - LOGICAL variable that indicates if an error occurred
while reading (optional).
The use of XML-files as a means to store data and more importantly to
transfer data between very disparate applications and organisations has
been growing these last few years. Standard implementations of libraries
that deal with all features of XML or a significant part of them are
available in many languages, but as far as we know there was no
implementation in Fortran.
One could of course use, say, the well-known Expat library by ... and
provide a Fortran interface, but this is slightly awkward as it forces
one to have a compatible C compiler. More importantly, this introduces
platform-dependencies because the interfacing between Fortran and C
depends strongly on the used compilers and this introduces a way of
working that is alien to Fortran programmers: Expat requires the
programmer to register a callback function, to be called when some
"event" occurs while reading the file (a begin tag is found, character
data are found and so on).
The alternative is even more awkward: build a tree of tags and
associated data and ask for these data. To a Fortran programmer, one of
the first things they will want to do with an XML-file is to get all the
information out - so a stream-oriented parsing method is more
appropriate.
Among the two predominant types of XML-parsing, SAX or stream-oriented
parsing and DOM or object-oriented parsing, the stream-oriented approach
is more suitable to the frame of mind of the average Fortran programmer.
But instead of registering callbacks, this module uses the method known
from, for instance, GNU's getopt() function: parse the data and return
to the caller to have it process the information. The caller calls the
function again and again, letting getopt() take care of the details.
This is exactly the approach taken by the xmlparse module:
|
call xml_open(info, ... )
do while ( xml_ok(info) )
call xml_get(info, ... ) ! Get the first/next tag
... identify the tag (via xml_check_tag for instance)
... process the information
enddo
call xml_close(info)
... proceed with the rest of the program
|
For convenience, the module does supply the routine xml_process
that takes three user-defined subroutines to perform the actual
processing. The file will be processed in its entirety.
The module defines several parameters and derived types for use by the
programmer:
- XML_BUFFER_LENGTH
-
the length of the internal buffer, representing
the maximum length of any individual line in an XML file and the maximum
length for a tag including all its attributes.
- XML_STDOUT
-
a parameter to indicate the standard output (or *) as the file to
write messages to.
- type(XML_PARSE)
-
the data structure that holds information about
the XML file to be read or written. Its contents are partially
accessible via functions such as XML_OK() and XML_ERROR().
Note: do not use its contents directly, as these may change in
future.
Reading an XML file and making sure the data are structured the way
they are supposed to, generally requires a lot of code. This can not be
avoided: you will want to make sure everything you need is there and
anything else is dealt with appropriately.
There is a way out: by automatically generating the reading routine
you can reduce the amount of manual coding to a minimum. This has two
advantages:
-
It is much less work to define the data and their place in an XML file
than it is to encode the reading routine.
-
It is much less error-prone, if the logic is generated for you and
therefore you need much less testing.
The idea is simple:
In an XML-file you define the data structure and the way this data
structure should appear in an input XML file for your program.
The process is probably best explained via an example.
Say, you want to read addresses (a classical example). Each address
consists of the name of the person, street name and the number of
the house, city (let us keep it simple). Of course we have multiple
addresses, so they are stored in an array. Then via the
xmlreader program you can generate a reading routine that
deals with this type of information.
The program takes an XML file as input and produces a Fortran 90 module
that reads input files and stores the data in the designated variables.
It also creates a writing routine to write the data to an XML file.
In our case, we want a derived type to hold the various pieces
that form a complete address and we want an array of that type:
|
<typedef name="address_type">
<component name="person" type="character" length="40">
<component name="street" type="character" length="40">
<component name="number" type="integer">
<component name="city" type="character" length="40">
</typedef>
<variable name="adress" dimension="1">
|
This will produce the following derived type:
|
type address_type
character(len=40) :: person
character(len=40) :: street
integer :: number
character(len=40) :: city
end type address_type
|
and a variable "address":
|
type(address_type), dimension(:), pointer :: address
|
The reading routine will be able to read such XML files as the
following:
|
<address>
<person>John Doe</person>
<street>Wherever street</street>
<number>30</number>
<city>Erewhon</city>
</address>
<address>
...
</address>
...
|
If in some address the number was forgotten, the reading routine will
report this, as by default all variables and components in a derived
type must be present.
Here is a more detailed description of the XML files accepted by the
xmlreader program:
-
Use the comment tag to insert comments in the input file to
reader (or the input to the resulting reading routines)
-
The options tag can be used to influence the generated code:
-
The attribute "strict" determines whether unknown tags are
regarded as an error (strict="yes") or not (strict="no",
the default).
-
The attribute "globaltype" is used to indicate that all variables should
belong to a single derived type, whose name defaults to the name of the
file. Use the "typename" attribute to set the name to a different value.
-
If you want to group tags for several variables, but you do not
want to introduce a special derived type, you can do so with the
placeholder tag. Its effect is to require an additional
tag - end tag surrounding the data. Any tags defined within the
placeholder - end placeholder tags will have to be put in the
corresponding tags in the input file for the resulting program.
|
<placeholder tag="grid">
<variable x ...>
<variable y ...>
</placeholder>
|
-
variable tags correspond directly to module variables.
They are used to declare these variables and to generate the code that will
read them.
Variable tags can appear anywhere except within a type definition.
Variables can be of a previously defined derived type or of a
primitive type.
|
<variable name="x" type="integer" default="1" />
|
Variables can have a number of attributes:
-
Required attributes:
name - the name of the variable in the actual program
type - the type of the variable
length - for character types only, the length of the string
-
Optional attributes:
default - the default value to be used if information is missing
dimension - the number of dimensions (up to 3), gives rise to a
pointer component
shape - the fixed size of an array, if this is present, the
number of dimensions is taken from this attribute.
tag - the name of the tag that holds the data (default to
the name of the variable)
-
Basic types for the variables include:
integer - a single integer value
integer-array - a one-dimensional array of integer values (the
values must appear between an opening and ending tag)
real - a single-precision real value
real-array - a one-dimensional array of real values (the
values must appear between an opening and ending tag)
double - a double-precision real value
double-array - a one-dimensional array of double-precision values
(the values must appear between an opening and ending tag)
logical - a single logical value (represented as "T" or "F")
logical-array - a one-dimensional array of logical values
(the values must appear between an opening and ending tag)
word - a character string as can be read via list-directed input
(if it should contain spaces, surround it with single or double quotes)
word-array - a one-dimensional array of strings
(the values must appear between an opening and ending tag)
line - a character string as can be read from a single line
of text (via the '(A)' format)
line-array - a one-dimensional array of strings, read as
individual lines between the opening and closing tag
character - a character string (synonym for "line")
character-array - a one-dimensional array of character strings,
synonym for line-array
-
Type definitions (typedef)allow the xmlreader program to
define the derived types that you want to use in your reader.
The typedef tag may only contain component tags. They
are synonym to variable tags with the same restrictions.
Future versions may also include options for:
-
Adding code to handle certain data in a particular way
-
Version checking (so that an input file is explicitly identified
as being of a particular version of the software)
The directory "examples" contains some example programs.
-
The tst_grid program demonstrates how to create a reader
for an array of "grids", each consisting of two integers.
-
The tst_menu program uses a more elaborate structure,
a menubar with menus and each menu having an array of items.
Items in a menu can have a submenu. This leads to an XML file with
multiple hierarchical layers.
-
The tst_process program uses the xml_process routine to
read in an XML file (a "docbook" file) and turn it into an HTML file for
viewing.
Basic limitations:
-
The lines in the XML-file should not exceed 1000 characters. For tags
that span more than one line, the limit holds for all the lines together
(without leading or trailing blanks).
-
There is no support for DTDs or namespaces, XSLT, XPath and
other more advanced features around XML.
-
There is currently no support for the object-oriented approach. It is up
to the application to store the information that is needed, while the
parsing is going on.
-
No support (yet) for a single quote as delimiter
-
No support (yet) for conversion of escape sequences (>. for instance)
-
The parser may not handle malformed XML-files properly
-
The parser does not (yet) handle different line-endings properly (that
is: reading XML-files that were written under MS Windows in a UNIX or
Linux environment)
This document belongs to version 1.00 of the module.
History:
version 0.1: Proof of concept, august 2003
A very preliminary version meant to show that it is indeed possible to
read and write XML files using Fortran only. It was published on the
comp.lang.fortran newsgroup and generated enough interest to encourage
further development.
version 0.2: First public release, august 2003
After some additional testing with practical XML-files, a number of bugs
were found and solved, several enhancements were made:
-
Handling attributes (especially when tags span more than one line and
correctly handling the case that too many attributes are present).
-
Options for parsing and error handling added, as well as functions to
check the status.
-
Revision of the API, for more uniform names (prefix: xml_)
-
Setting up the documentation (this document in particular)
version 0.3: Improvements, september 2003
-
Added the function xml_error()
-
Implemented the report options
-
Corrected a bug in xml_close (causing an infinite loop in the
test program).
-
Revised the test program to run through a number of test
files.
version 0.4: Corrected xml_put(), october 2003
-
Adjusted the interface and implementation of the subroutine xml_put()
It will now produce correct and reasonably looking XML files.
-
Added a test program, tstwrite.f90, for this.
version 0.9: Added new approach, october 2005
-
Changes to the interface and implementation of the subroutine xml_put(),
from a patch by cinonet.
-
Added a program, xmlreader, to generate complete reading routines for
particular XML files (cf. GENERATING A READING ROUTINE
version 0.94: Gradually expanding the capabilities, june 2006
-
Added a routine xml_process that enables you to use an
event-based approach like in the famous Expat library.
-
Added the option strict and the tag placeholder.
-
Corrected a number of bugs associated with the xmlreader program
version 0.97: Added the following capabilities to the
xmlreader program since 0.94, june 2007
-
Support for the shape option
-
Defaults for both components of a derived type and for
independent variables.
-
The generated reading routine takes care of elements that have
attributes and character data now. The character data is treated as if
it were an attribute with the name "value"
-
Several bugs corrected in the xmlreader program
version 1.00: Added the following capabilities to the
xmlreader program since 0.97, april 2008
-
Write a writing routine to write the data to a XML file
The project now also contains a first version of a program to convert an
XSD file to a file accepted by the xmlreader program. This is called
"xsdconvert".
The following items remain on the "to do" list:
-
Adding checks for truncation of strings (attribute names/values too
long, data lines too long; now only the number is checked).
-
Documenting details about structures and parameters that may be of
interest.
Fortran, XML, parsing