[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Bug-gnupedia] Re: Classification difficulty and incompleteness
From: |
<address@hidden> |
Subject: |
[Bug-gnupedia] Re: Classification difficulty and incompleteness |
Date: |
Thu, 18 Jan 2001 10:41:45 +1300 |
> Good stuff ... but we need some 'dotted' classification system such as:
>
> article.science.biology.genetics.human.gene ... ala Dewey Decimal ... so we
> can do effective searches.
>
> Could be pull down menus on the submission site etc...
>
> I also think we want a user-feedback system to correct bad classifications
> and even (pray tell) rate articles for usefullness etc...
Classification has been studied in library science since Alexandria. What it
comes down to is that classifications are ontologies---sets of assumptions we
make about the how the world works the relative importance of the different
parts of it.
Classification is incomplete in the mathematical sense, and it is unclear
whether all documents should be classified [see below for explanations of
these points]. What has been found to work best is:
*) the ability to assign multiple subjects to a document, so a document can be
both science.biology.genetics.human and ethics.biology
*) separation of of subject from format (so films, biographies, articles on a
topic can be found in the same place.
*) using multiple classification schemes, preferably ones known to and
understood by the users (this means LoC and Dewey mainly).
*) pointers from one category into another (exemplified by the Yahoo system)
What has been found not to work is:
*) X.X.general categories as in the ACM classification system.
*) numbering the categories as in both the LoC and Dewey system (the Dewey
system has as much space for Christian material as for all other religious
material).
*) unchanging labels on categories. 50 years ago automobiles was a fine name
for a category, the LoC system is still uses automobile where car would be
much better.
So what should we use? Personally I believe we should classify using the Dewey
Decimal and LoC systems in parallel, both with generous number of cross
references. When we have enough articles on a topic (physics, computers etc)
to justify it we can also include subject-specific classification systems.
What the world doesn't need is another ad-hoc classification system.
stuart
An Undecidable Classification Problem
=====================================
Consider a digital library classification scheme that denoted whether a
document used humor, and further, whether or not the humor was funny. Consider
an author writing a piece of humor which relied entirely for it's humor on
being classified as being not funny. If classified as funny, the humor fails
and the document is mis-classified; if classified as not funny, the humor
succeeds and the document is mis-classified. Either way, the document is
mis-classified.
Such classification schemes exist and are useful in the real world---consider
for example the newsgroup rec.humour.funny , a moderated newsgroup which tries
to carry only `funny' humour. Pathological jokes have been been attempted (by
myself) and submitted, but without response from the moderators (who must
judge the humour of the joke).
It was suggested that this apparent paradox can be resolved because the joke
is impossible to construct as it contains an internal paradox (i.e. it's only
true when it's false). The problem with this argument is that jokes are a
literary form which has no requirement internal consistency, indeed many
famous examples (much of Lewis Carroll's works for example) contain many
internal contradictions.
Should all documents be classified?
===================================
Consider a new document that is sufficiently metaphorical and allusionary that
it could be about anything (something like the prophecies on Nostradamus). Any
assignment of subject classification by a classifier to the document instantly
places that subject at the forefront of a readers mind when interpreting the
book, thus the classifier biases all subsequent readers of the document.
-- stuart yeates <address@hidden> aka `loam'
"Oh, havoc," cried Pooh, as he let slip the heffalumps of war.
X-no-archive:yes