[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
James Junmin Fan
Wed, 27 Jun 2001 11:44:59 -0500 (CDT)
Copyright (C) 2000 Free Software Foundation, Inc.
Permission is granted to copy, distribute and/or modify this document
under the terms of the GNU Free Documentation License, Version 1.1 or
any later version published by the Free Software Foundation; with no
Invariant Sections, with no Front-Cover Texts, and with no Back-Cover
Texts. A copy of the license is included in the file COPYING.
%%short-description: a text parser that converts text documents into a vector
space model based on word frequencies.
%%full-description: MC is a C++ program that creates vector-space models from
text documents that can be used for text mining applications. MC provides an
efficient multi-threaded implementation that can process very large document
<p> The MC program: 1. Recursively descends directories, finding text files. 2.
Processes files selectively through full regular expression matching of file
names. 3. Builds a sparse matrix of word/token counts. The particular sprse
marix format used is given here. 4. Processes any user specified text
formats(email address or URLs) as a whole token through regular expression
matching or FLEX definition. 5. Prunes vocabulary by word length and frequency.
6. Excludes user specified stop words words. 7. Sets word vector weights
according any of the txx, txn, tfn, tfx, lxx, lxn, lfn, lfx scaling schemes. 8.
Writes all data structures to disk in the Compressed Column Storage format.
<p> The application does not: 1. Have English parsing or part-of-speech tagging
facilities. 2. Have complete documentation. 3. Claim to be bug-free.
%%license verified by:
%%license verified on:
%%maintainer: James Fan <address@hidden>
%%keywords: text mining, data mining, vector space model, bag of words
%%interface: Command line
%%build-prerequisites: FLEX, STL, pthread lib
%%version: 2.19 stable released 2001-06-26
%%entry written by: James Fan <address@hidden>
|[Prev in Thread]
||[Next in Thread]|
- [gfsd]MC entry,
James Junmin Fan <=