lmi
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[lmi] an xml schema for (single|multiple)_cell_document file XML format


From: Evgeniy Tarassov
Subject: [lmi] an xml schema for (single|multiple)_cell_document file XML format
Date: Thu, 27 Dec 2007 13:43:19 +0100

On Dec 27, 2007 12:45 PM, Evgeniy Tarassov <address@hidden> wrote:
>
> The attachment contains :

The newer version of XML Schema files for cns/ill files could be downloaded
from lmi project download area at savannah:
| http://download.savannah.nongnu.org/releases/lmi/cell_document.tar.bz2

The original message (updated):

I've spent some time trying to write an XMLSchema for the existing
cns/ill files. The motivation is to be able to validate a cns/ill file
content to detect any errors/corruptions and to check that every part
of the system produces coherent output.

Ideally we would have a single *.xsd (XML Schema) file that describes
the cns/ill files XML structure, all the XML nodes data format and all
the implicit constraints. This file would not only validate any
existing correct cns/ill file but would also _in_validate a file
containing incorrect data. It seems that the second part is really hard
to achieve with the current files format.

Unfortunately for the moment only a subset of constraints of
file structure could be expressed within XMLSchema. Below is the list
of the issues encountered. (The list of issues is taken from
the schema file 'cell_document.no_ns.v1.xsd'):

  Issue A (major):
  ===============
  The current format contains only 'cell' nodes which represent cases,
  class and cells. To specify the number of nodes of each type helper nodes
  'NumberOfCases' and 'NumberOfClasses' are used. Each of 'NumberOfXXX'
  is a positive integer number N, which is followed by exactly
  N 'cell' nodes.
  Unfortunately it is not possible to represent such a constraint
  in XMLSchema. The best match is to ignore the values of
  'NumberOfXXX' nodes.

  A simple workaround would be to rename the 'cell' nodes into
  the corresponding cell type: 'case', 'class', 'cell'. This allows
  to fix the document node structure and to get rid of the redundant
  nodes 'NumberOfXXX'.

  Issue B (minor):
  ===============
  Most of the elements that represent an array/list/sequence are stored as
  a single string with items separated with spaces.

  The bruteforce approach solves the issue by supplying complex regular
  expressions for each sequence type. But if changing current format
  of cns/ill files could be considered, then sequence elements could be
  properly represented by a node with children nodes (instead of a single
  string) which will allow rather simple validation of array/list/sequence
  items separately.

  Issue C (unsure/major):
  ======================
  It is impossible to force two strings to have the same number of words.
  But this seems to be a common validity constraint in xxx_cell_document.
  The current schema ignores these constraints and does let nodes
  representing sequence to have any number of items.

  There is no trivial workaround (AFAICS).

  Issue D (minor):
  ===============
  Enum element values could contain '_' instead of spaces (' '). This not
  seem to depend on the XML content format version ('version' attribute
  of the 'cell' nodes).

  This is handled by allowing enumeration items with spaces and '_' instead
  of spaces.

Do you think that changes to the current cns/ill file format could be possible?
Myself, I would fix (at first) the issue (A) and then, maybe, issue (B),
because these are the defects of data serialisation mechanism - it's
an implementation detail that does not come from the application internal logic.
Fixing these two issues would bring a direct benefit -- possibility to improve
document validation precision.

-- 
Best wishes,
Evgeniy Tarassov
http://five.sentenc.es/




reply via email to

[Prev in Thread] Current Thread [Next in Thread]