dtd.pl

dtd.pl is a Perl library that parses an SGML document type defintion (DTD) and builds up data structures containing the structural content of the DTD.


Audience

I assume the reader knows about the scope of packages and how to access variables/subroutines defined in packages. If not, refer to perl(1) or any book on Perl. The reader should be familiar with the basic concepts of SGML.

Unless stated otherwise, all Perl variables mentioned are within the scope of package dtd.


Usage

If installed correctly, the following Perl statement can be used to access the dtd library routines:

require "dtd/dtd.pl";

The following routines are defined:


Routine Descriptions

DTDread_dtd

&'DTDread_dtd(FILEHANDLE);

DTDread_dtd parses the SGML DTD specified by FILEHANDLE. Parsing of the DTD stops once the end of the file is reached. Any external entity references will be parsed if an entity to filename mapping exists (see DTDread_mapfile).

DTDread_dtd makes the following assumptions when parsing a DTD:

After DTDread_dtd is finished, the following associative arrays are filled (remember, all the arrays are within the scope of package dtd):

%ParEntity
Keys: Non-external parameter entities.
Values: Replacement value.
%PubParEntity
Keys: External public parameter entities (PUBLIC).
Values: Entity identifier, if defined.
%SysParEntity
Keys: External public parameter entities (SYSTEM).
Values: Entity identifier, if defined.
%ElemCont
Keys: Element names.
Values: Base content of decleration of elements.
%ElemInc
Keys: Element names.
Values: Inclusion set declerations.
%ElemExc
Keys: Element names.
Values: Exclusion set declerations.
%ElemTag
Keys: Element names.
Values: Omitted tag minimization.
%Attribute
Keys: Element names.
Values: Attributes for elements.

To access the data stored in %Attribute, it is best to use DTDget_elem_attr.

All parameter enities are expanded when data is stored in %ElemCont, %ElemInc, %ElemInc, %ElemExc, %ElemTag, %Attribute arrays.

Note:
It is recommended to not access these arrays directly as the data stored within them are subject to change. One should use the various convienence routines described below to access the data.

When trying to locate external entity parameter entity files, DTDread_dtd uses the environment variable P_SGML_PATH. P_SGML_PATH is a colon separated string telling DTDread_dtd where to locate external entities. By default, DTDread_dtd will look in the current working directory or the sub-directory called ents.

If DTDread_dtd cannot cannot resolve an external entity reference, it will issue a warning and continue parsing the DTD.

Current status of DTDread_dtd:

The performance of DTDread_dtd is not the best. DTDread_dtd makes frequent use of Perl's getc function. If SGML did not have such screwing grammer rules, I could have easily avoided getc. I haven't bothered in trying to optimize DTDread_dtd's performance. So far it is working, and I do not feel like mucking with it.

DTDread_dtd is meant to process DTDs in separate files. If a document instance is in the file DTDread_dtd is parsing, God only knows what will happen.

DTDread_mapfile

&'DTDread_mapfile($filename);

DTDread_mapfile parses a entity map file specified $filename.

DTDread_mapfile uses the environment variable P_SGML_PATH as described in section DTDread_dtd to locate $filename. This way, one can put the map file in the same location of the entity files.

DTDread_mapfile makes the following assumptions when parsing $filename:

Example of a entity map file:

# DTDread_mapfile will ignore lines beginning with a `#' character.

#####################
# ISO entity files
#
ISO 8879-1986//ENTITIES General Technical//EN iso-tech.ent
ISO 8879-1986//ENTITIES Publishing//EN iso-pub.ent
ISO 8879-1986//ENTITIES Numeric and Special Graphic//EN iso-num.ent
ISO 8879-1986//ENTITIES Greek Letters//EN iso-grk1.ent
ISO 8879-1986//ENTITIES Diacritical Marks//EN iso-dia.ent
ISO 8879-1986//ENTITIES Added Latin 1//EN iso-lat1.ent
ISO 8879-1986//ENTITIES Greek Symbols//EN iso-grk3.ent
ISO 8879-1986//ENTITIES Added Latin 2//EN ISOlat2
ISO 8879-1986//ENTITIES Added Math Symbols: Ordinary//EN ISOamso

#####################
# ArborText entity file
#
-//ArborText//ELEMENTS Math Equation Structures//EN ati-math.elm

#####################
# A sample SYSTEM entities
#
MyGraphics my_graphics.ent

# end of map file

If DTDread_mapfile cannot access $filename, it will issue a warning to that effect.

DTDget_elements

@elements = &'DTDget_elements();

DTDget_elements retrieves a sorted array of all elements defined in the DTD.

This function is only useful after DTDread_dtd has been called.

DTDget_top_elements

@top_elements = &'DTDget_elements();

DTDget_top_elements retrieves a sorted array of all top-most elements defined in the DTD. Top-most elements are those elements that cannot be contained within another element or can only be contained within itself.

This function is only useful after DTDread_dtd has been called.

DTDget_elem_attr

%attribute = &'DTDget_elem_attr($elem);

DTDget_elem_attr returns an associative array containing the attributes of $elem. The keys of the array are the attribute names, and the array values are $; separated strings of the possible values for the attributes. Example of extracting an attribute's values:

@values = split(/$;/, $attribute{`alignment'});

The first array value of the $; splitted array is the default value for the attribute (which may be an SGML reserved word), and the other array values are all posible values for the attribute.

Note:
$; is assumed to be the default value assigned by Perl: \\034. If $; is changed, unpredictable results may occur.
This function is only useful after DTDread_dtd has been called.

DTDget_parents

@parent_elements = &'DTDget_parents($elem);

DTDget_parents returns an array of all elements that may be a parent of $elem.

This function is only useful after DTDread_dtd has been called.

DTDis_attr_keyword

&'DTDis_attr_keyword($word);

DTDis_attr_keyword returns 1 if $word is an attribute content reserved value, otherwise, it returns 0. In the reference concrete syntax, the following values of $word will return 1:

Character case is ignored.

DTDis_elem_keyword

&'DTDis_elem_keyword($word);

DTDis_elem_keyword returns 1 if $word is an element content reserved value. otherwise, it returns 0. In the reference concrete syntax, the following values of $word will return 1:

DTDprint_tree

&'DTDprint_tree($elem, $depth, FILEHANDLE);

DTDprint_tree prints the content hierarchy of a single element, $elem, to a maximum depth of $depth to the file specified by FILEHANDLE. If FILEHANDLE is not specified then output goes to STDOUT. A depth of 5 is used if $depth is not specified. The root of the tree has a depth of 1.

The routine cuts at elements that exist at higher (or equal) levels or if $depth has been reached. The string "..." is appended to an element if it has been cut-off due to preexistance at a higher (or equal) level.

Cutting the tree at repeat elements is necessary to avoid a combinatorical explosion with recursive element definitions.

Here's an example of what the output will look like due to pruning of recursive element contents:

         htmlplus
|
|_body
| |
| |_address
| | |
| | |_p ...
| |
| |_div1
| | |
| | |_address ...
| | |_div2 ...
| | |_div3 ...
| | |_div4 ...
| | |_div5 ...
| | |_div6 ...

If you see an element with "...", just search through the output until you find the element without the "...".

Note:
Higher, or equal level cousins are not recognized when determining if an elemeent should be pruned. Pruning is determined from siblings, ancestors, and ancestors' siblings (aunts & uncles). Therefore, some sub-element content hierarchies may be repeated.

In order to recognize cousins, a breadth first search is needed, or a full traversal of the hierarchy before outputing. The above technique currently is sufficient to avoid combinatorical explosions. Plus, it allows the printing of the tree while traversing the element data; there is no need to create a Perl tree data structure before printing (saves time, memory, and debugging).

Since the tree outputed is static, the inclusion and exclusion sets of elements are treated specially. Inclusion and exclusion elements inherited from ancestors are not propagated down to determine what elements are printed, but special markup is presented at a given element if there exists inclusion and exclusion elements from ancestors. The reason inclusion and exclusion elements are not propagated down is because of the pruning done. An element w/o "..." may be the only place of reference to see the content hierarchy of that element. However, the element may occur in multiple contents and have different ancestoral inclusion and exclusion elements applied to it.

Have I lost you? Maybe an example may help:

         openbook
|
|_d1
| | (I): idx needbegin needend newline
| |
| |_abbrev
| | | (Ia): idx needbegin needend newline
| | | (X): needbegin needend
| | |
| | |_#PCDATA
| | |_acro
| | | | (Ia): idx needbegin needend newline
| | | | (Xa): needbegin needend
| | | |
| | | |_#PCDATA
| | | |_sub ...
| | | |_super ...
| | |
Ignoring the lines starting with ()'s, one gets the content hierachy of an element as defined by the DTD without concern of where it may occur in the overall structure. The ()'s line give additional information regarding the element with respect to its existance within a specific context. For example, when an acro element occurs within openbook/d1/abbrev, along with its normal content, it can contain idx and newline elements due to inclusions from ancestors. However, it cannot contain needbegin, needend regardless of its defined content since an ancestor(s) excludes them.

Note:
Exclusions override inclusions. If an element occurs in an inclusion set and an exclusion set, the exclusion takes precedence. Therefore, in the above example, needbegin, needend are excluded from acro.
Explanation of ()'s keys:

(I)
The list of inclusion elements defined by the current element. Since this is part of the content model of the element, the inclusion elements are printed as part of the content hierarchy of the current element.
(Ia)
The list of inclusion elements due to ancestors. This is listed as reference to determine the content of an element within a given context. None of the ancestoral inclusion elements are printed as part of the content hierarchy of the element.
(X)
The list of exclusion elements defined by the current element. Since this is part of the content model of the element, the exclusion elements prevent elements defined in the base content and inclusion sets to be printed.
(Xa)
The list of exclusion elements due to ancestors. This is listed as reference to determine the content of an element within a given context. None of the ancestoral exclusion elements have any effect on the printing of the content hierarchy of the current element.

DTDreset

&'DTDreset();

DTDreset clears all data associated with the DTD read via DTDread_dtd. This routine is useful if multiple DTDs need to be processed.


See Also

Here is a list of Perl programs that utilize dtd.pl:

dtd2html
Generate HTML documents that allows navigation through the stucture of an SGML DTD.
dtdtree
Generate content hierarchy trees of SGML elements (with the use of the DTDprint_tree routine).

Earl Hood, ehood@convex.com