[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[lmi] an xml schema for (single|multiple)_cell_document file XML format
From: |
Evgeniy Tarassov |
Subject: |
[lmi] an xml schema for (single|multiple)_cell_document file XML format |
Date: |
Thu, 27 Dec 2007 13:43:19 +0100 |
On Dec 27, 2007 12:45 PM, Evgeniy Tarassov <address@hidden> wrote:
>
> The attachment contains :
The newer version of XML Schema files for cns/ill files could be downloaded
from lmi project download area at savannah:
| http://download.savannah.nongnu.org/releases/lmi/cell_document.tar.bz2
The original message (updated):
I've spent some time trying to write an XMLSchema for the existing
cns/ill files. The motivation is to be able to validate a cns/ill file
content to detect any errors/corruptions and to check that every part
of the system produces coherent output.
Ideally we would have a single *.xsd (XML Schema) file that describes
the cns/ill files XML structure, all the XML nodes data format and all
the implicit constraints. This file would not only validate any
existing correct cns/ill file but would also _in_validate a file
containing incorrect data. It seems that the second part is really hard
to achieve with the current files format.
Unfortunately for the moment only a subset of constraints of
file structure could be expressed within XMLSchema. Below is the list
of the issues encountered. (The list of issues is taken from
the schema file 'cell_document.no_ns.v1.xsd'):
Issue A (major):
===============
The current format contains only 'cell' nodes which represent cases,
class and cells. To specify the number of nodes of each type helper nodes
'NumberOfCases' and 'NumberOfClasses' are used. Each of 'NumberOfXXX'
is a positive integer number N, which is followed by exactly
N 'cell' nodes.
Unfortunately it is not possible to represent such a constraint
in XMLSchema. The best match is to ignore the values of
'NumberOfXXX' nodes.
A simple workaround would be to rename the 'cell' nodes into
the corresponding cell type: 'case', 'class', 'cell'. This allows
to fix the document node structure and to get rid of the redundant
nodes 'NumberOfXXX'.
Issue B (minor):
===============
Most of the elements that represent an array/list/sequence are stored as
a single string with items separated with spaces.
The bruteforce approach solves the issue by supplying complex regular
expressions for each sequence type. But if changing current format
of cns/ill files could be considered, then sequence elements could be
properly represented by a node with children nodes (instead of a single
string) which will allow rather simple validation of array/list/sequence
items separately.
Issue C (unsure/major):
======================
It is impossible to force two strings to have the same number of words.
But this seems to be a common validity constraint in xxx_cell_document.
The current schema ignores these constraints and does let nodes
representing sequence to have any number of items.
There is no trivial workaround (AFAICS).
Issue D (minor):
===============
Enum element values could contain '_' instead of spaces (' '). This not
seem to depend on the XML content format version ('version' attribute
of the 'cell' nodes).
This is handled by allowing enumeration items with spaces and '_' instead
of spaces.
Do you think that changes to the current cns/ill file format could be possible?
Myself, I would fix (at first) the issue (A) and then, maybe, issue (B),
because these are the defects of data serialisation mechanism - it's
an implementation detail that does not come from the application internal logic.
Fixing these two issues would bring a direct benefit -- possibility to improve
document validation precision.
--
Best wishes,
Evgeniy Tarassov
http://five.sentenc.es/
[Prev in Thread] |
Current Thread |
[Next in Thread] |
- [lmi] an xml schema for (single|multiple)_cell_document file XML format,
Evgeniy Tarassov <=