dotgnu-general
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [DotGNU]Disadvantages of XML


From: Rhys Weatherley
Subject: Re: [DotGNU]Disadvantages of XML
Date: Wed, 02 Jan 2002 22:39:36 +1000

"Gopal.V" wrote:

>         Sorry, the only KDE tools I use are kppp & konqueror. I haven't 
> noticed
> others. But Gzip is only a half measure for XML, we really need to develop
> an XML specific compression (psuedo-lossive) and put that into libxml &
> port/wrap it in other languages (read Java,Python,Php....).

I would humbly disagree with this ...

The WAP Forum created something like this for WML, which
is a lightweight XML-based page format for mobile phones
and other wireless devices.  http://www.wapforum.org

Essentially, they chose 1-byte numbers for each tag type,
and then replaced the tag structure with those numbers when
transmitting the data over the wireless network.  i.e., instead
of sending <TABLE>, you would send "3" (or whatever the
tag value was).  This compacted the XML quite well.

IMHO, it was a big mistake.  There is a massive version bug in
how they did it.  Because version 1 clients don't understand
version 2 tag numbers, it creates migration problems when
moving to a new version of the standard.  It also lost data:
DTD's, comments, stylesheets, and other meta information
were stripped from the input, which made it difficult for
clients that may want that information.

They made it worse by doing the conversion in the gateway
between the server and client.  That way, a version 1 gateway
sitting between a version 2 server and a version 2 client
would completely ruin the conversion, resulting in data loss.

They should have used a dictionary approach: at the start of
each message is a table that maps 1-byte numbers to textual
tag names.  i.e. instead of allocating the numbers ahead of time,
allow the mapping to be sent along with the compacted data.
Gateways in the middle would not get confused by unknown
tags: they would assign a new dictionary entry and move on.

The funny thing is, once you start down the road of dictionaries,
you eventually end up at gzip.  The LZ77 algorithm that underlies
gzip is essentially a dictionary-based compacter, where the
dictionary is computed on the fly and carried along in the message.

Implementing a special-purpose compression scheme for XML
is not really as good an idea as it sounds, IMHO.  Versioning
problems, dictionary encoding issues, data loss, etc.

At the end of the day, it is easier to just gzip it and forget about
the problem.  No data loss, and roughly the same level of
compaction.  Highly redundant data like XML compresses
very well.  For example, the 6 Mb All.xml file for the C#
library specification compresses to ~630k using gzip: about
10% of the original size.

Cheers,

Rhys.




reply via email to

[Prev in Thread] Current Thread [Next in Thread]