pspp-cvs
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Pspp-cvs] pspp/doc pspp.texinfo data-file-format.texi


From: Ben Pfaff
Subject: [Pspp-cvs] pspp/doc pspp.texinfo data-file-format.texi
Date: Wed, 18 Jul 2007 02:20:46 +0000

CVSROOT:        /cvsroot/pspp
Module name:    pspp
Changes by:     Ben Pfaff <blp> 07/07/18 02:20:45

Modified files:
        doc            : pspp.texinfo data-file-format.texi 

Log message:
        Improve the description of the SPSS system file format.  Patch #6103.
        Thanks to John Darrington for review.

CVSWeb URLs:
http://cvs.savannah.gnu.org/viewcvs/pspp/doc/pspp.texinfo?cvsroot=pspp&r1=1.7&r2=1.8
http://cvs.savannah.gnu.org/viewcvs/pspp/doc/data-file-format.texi?cvsroot=pspp&r1=1.16&r2=1.17

Patches:
Index: pspp.texinfo
===================================================================
RCS file: /cvsroot/pspp/pspp/doc/pspp.texinfo,v
retrieving revision 1.7
retrieving revision 1.8
diff -u -b -r1.7 -r1.8
--- pspp.texinfo        3 Jun 2007 22:01:18 -0000       1.7
+++ pspp.texinfo        18 Jul 2007 02:20:45 -0000      1.8
@@ -88,7 +88,7 @@
 * Configuration::               Configuring PSPP.
 
 * Portable File Format::        Format of PSPP portable files.
-* Data File Format::            Format of PSPP system files.
+* System File Format::          Format of PSPP system files.
 * q2c Input Format::            Format of syntax accepted by q2c.
 
 * GNU Free Documentation License:: License for copying this manual.

Index: data-file-format.texi
===================================================================
RCS file: /cvsroot/pspp/pspp/doc/data-file-format.texi,v
retrieving revision 1.16
retrieving revision 1.17
diff -u -b -r1.16 -r1.17
--- data-file-format.texi       6 Jun 2007 05:17:48 -0000       1.16
+++ data-file-format.texi       18 Jul 2007 02:20:45 -0000      1.17
@@ -1,52 +1,101 @@
address@hidden Data File Format
address@hidden Data File Format
address@hidden System File Format
address@hidden System File Format
 
-PSPP necessarily uses the same format for system files as do the
-products with which it is compatible.  This chapter is a description of
-that format.
-
-There are three data types used in system files: 32-bit integers, 64-bit
-floating points, and 1-byte characters.  In this document these will
-simply be referred to as @code{int32}, @code{flt64}, and @code{char},
-the names that are used in the PSPP source code.  Every field of type
address@hidden or @code{flt64} is aligned on a 32-bit boundary relative to 
-the start of the record.
-
-The endianness of data in PSPP system files is not specified.  System
-files output on a computer of a particular endianness will have the
-endianness of that computer.  However, PSPP can read files of either
-endianness, regardless of its host computer's endianness.  PSPP
-translates endianness for both integer and floating point numbers.
-
-Floating point formats are also not specified.  PSPP does not
-translate between floating point formats.  This is unlikely to be a
-problem as all modern computer architectures use IEEE 754 format for
-floating point representation.
+A system file encapsulates a set of cases and dictionary information
+that describes how they may be interpreted.  This chapter describes
+the format of a system file.
+
+System files use three data types: 8-bit characters, 32-bit integers,
+and 64-bit floating points, called here @code{char}, @code{int32}, and
address@hidden, respectively.  Data is not necessarily aligned on a word
+or double-word boundary: the long variable name record (@pxref{Long
+Variable Names Record}) and very long string records (@pxref{Very Long
+String Record}) have arbitrary byte length and can therefore cause all
+data coming after them in the file to be misaligned.
+
+Integer data in system files may be big-endian or little-endian.  A
+reader may detect the endianness of a system file by examining
address@hidden in the file header record
+(@pxref{layout_code,,@code{layout_code}}).
+
+Floating-point data in system files may nominally be in IEEE 754, IBM,
+or VAX formats.  A reader may detect the floating-point format in use
+by examining @code{bias} in the file header record
+(@pxref{bias,,@code{bias}}).
+
+PSPP detects big-endian and little-endian integer formats in system
+files and translates as necessary.  PSPP also detects the
+floating-point format in use, as well as the endianness of IEEE 754
+floating-point numbers, and translates as needed.  However, only IEEE
+754 numbers with the same endianness as integer data in the same file
+has actually been observed in system files, and it is likely that
+other formats are obsolete or were never used.
 
 The PSPP system-missing value is represented by the largest possible
-negative number in the floating point format; in C, this is most likely
address@hidden  There are two other important values used in missing
-values: @code{HIGHEST} and @code{LOWEST}.  These are represented by the
-largest possible positive number (probably @code{DBL_MAX}) and the
-second-largest negative number.  The latter must be determined in a
-system-dependent manner; in IEEE 754 format it is represented by value
address@hidden
-
-System files are divided into records.  Each record begins with an
address@hidden giving a numeric record type.  Individual record types are
-described below:
+negative number in the floating point format (@code{-DBL_MAX}).  Two
+other values are important for use as missing values: @code{HIGHEST},
+represented by the largest possible positive number (@code{DBL_MAX}),
+and @code{LOWEST}, represented by the second-largest negative number
+(in IEEE 754 format, @code{0xffeffffffffffffe}).
+
+System files are divided into records, each of which begins with a
+4-byte record type, usually regarded as an @code{int32}.
+
+The records must appear in the following order:
+
address@hidden @bullet
address@hidden
+File header record.
+
address@hidden
+Variable records.
+
address@hidden
+All pairs of value labels records and value label variables records,
+if present.
+
address@hidden
+Document record, if present.
+
address@hidden
+Any of the following records, if present, in any order:
+
address@hidden @minus
address@hidden
+Machine integer info record.
+
address@hidden
+Machine floating-point info record.
+
address@hidden
+Variable display parameter record.
+
address@hidden
+Long variable names record.
+
address@hidden
+Miscellaneous informational records.
address@hidden itemize
+
address@hidden
+Dictionary termination record.
+
address@hidden
+Data record.
address@hidden itemize
+
+Each type of record is described separately below.
 
 @menu
 * File Header Record::          
 * Variable Record::             
-* Value Label Record::          
-* Value Label Variable Record::  
+* Value Labels Records::
 * Document Record::             
-* Machine int32 Info Record::   
-* Machine flt64 Info Record::   
-* Auxiliary Variable Parameter Record::
+* Machine Integer Info Record::
+* Machine Floating-Point Info Record::
+* Variable Display Parameter Record::
 * Long Variable Names Record::
-* Very Long String Length Record::
+* Very Long String Record::
 * Miscellaneous Informational Records::  
 * Dictionary Termination Record::  
 * Data Record::                 
@@ -55,30 +104,27 @@
 @node File Header Record
 @section File Header Record
 
-The file header is always the first record in the file.
+The file header is always the first record in the file.  It has the
+following format:
 
 @example
-struct sysfile_header
-  @{
-    char                rec_type[4];
-    char                prod_name[60];
-    int32               layout_code;
-    int32               nominal_case_size;
-    int32               compressed;
-    int32               weight_index;
-    int32               ncases;
-    flt64               bias;
-    char                creation_date[9];
-    char                creation_time[8];
-    char                file_label[64];
-    char                padding[3];
-  @};
+char                rec_type[4];
+char                prod_name[60];
+int32               layout_code;
+int32               nominal_case_size;
+int32               compressed;
+int32               weight_index;
+int32               ncases;
+flt64               bias;
+char                creation_date[9];
+char                creation_time[8];
+char                file_label[64];
+char                padding[3];
 @end example
 
 @table @code
 @item char rec_type[4];
-Record type code.  Always set to @samp{$FL2}.  This is the only record
-for which the record type is not of type @code{int32}.
+Record type code, set to @samp{$FL2}.
 
 @item char prod_name[60];
 Product identification string.  This always begins with the characters
@@ -88,9 +134,11 @@
 would be longer than 60 characters; otherwise it is padded on the right
 with spaces.
 
address@hidden
 @item int32 layout_code;
-Always set to 2.  PSPP reads this value to determine the
-file's endianness.
+Normally set to 2, although a few system files have been spotted in
+the wild with a value of 3 here.  PSPP use this value to determine the
+file's integer endianness (@pxref{System File Format}).
 
 @item int32 nominal_case_size;
 Number of data elements per case.  This is the number of variables,
@@ -104,8 +152,9 @@
 Set to 1 if the data in the file is compressed, 0 otherwise.
 
 @item int32 weight_index;
-If one of the variables in the data set is used as a weighting variable,
-set to the index of that variable.  Otherwise, set to 0.
+If one of the variables in the data set is used as a weighting
+variable, set to the dictionary index of that variable, plus 1
+(@pxref{Dictionary Index}).  Otherwise, set to 0.
 
 @item int32 ncases;
 Set to the number of cases in the file if it is known, or -1 otherwise.
@@ -118,24 +167,31 @@
 this is not valid, the seek operation fails.  In this case,
 @code{ncases} remains -1.
 
address@hidden
 @item flt64 bias;
-Compression bias.  Always set to 100.  The significance of this value is
-that only numbers between @code{(1 - bias)} and @code{(251 - bias)} can
-be compressed.
+Compression bias, ordinarily set to 100.  Only integers between
address@hidden - bias} and @code{251 - bias} can be compressed.
+
+By assuming that its value is 100, PSPP uses @code{bias} to determine
+the file's floating-point format and endianness (@pxref{System File
+Format}).  If the compression bias is not 100, PSPP cannot auto-detect
+the floating-point format and assumes that it is IEEE 754 format with
+the same endianness as the system file's integers, which is correct
+for all known system files.
 
 @item char creation_date[9];
-Set to the date of creation of the system file, in @samp{dd mmm yy}
+Date of creation of the system file, in @samp{dd mmm yy}
 format, with the month as standard English abbreviations, using an
 initial capital letter and following with lowercase.  If the date is not
 available then this field is arbitrarily set to @samp{01 Jan 70}.
 
 @item char creation_time[8];
-Set to the time of creation of the system file, in @samp{hh:mm:ss}
+Time of creation of the system file, in @samp{hh:mm:ss}
 format and using 24-hour time.  If the time is not available then this
 field is arbitrarily set to @samp{00:00:00}.
 
 @item char file_label[64];
-Set the file label declared by the user, if any (@pxref{FILE LABEL}).
+File label declared by the user, if any (@pxref{FILE LABEL}).
 Padded on the right with spaces.
 
 @item char padding[3];
@@ -146,30 +202,44 @@
 @node Variable Record
 @section Variable Record
 
-Immediately following the header must come the variable records.  There
-must be one variable record for every variable and every 8 characters in
-a long string beyond the first 8.
-
address@hidden
-struct sysfile_variable
-  @{
-    int32               rec_type;
-    int32               type;
-    int32               has_var_label;
-    int32               n_missing_values;
-    int32               print;
-    int32               write;
-    char                name[8];
-
-    /* The following two fields are present 
-       only if has_var_label is 1. */
-    int32               label_len;
-    char                label[/* variable length */];
-
-    /* The following field is present only
-       if n_missing_values is not 0. */
-    flt64               missing_values[/* variable length */];
-  @};
+There must be one variable record for each numeric variable and each
+string variable with width 8 bytes or less.  String variables wider
+than 8 bytes have one variable record for each 8 bytes, rounding up.
+The first variable record for a long string specifies the variable's
+correct dictionary information.  Subsequent variable records for a
+long string are filled with dummy information: a type of -1, no
+variable label or missing values, print and write formats that are
+ignored, and an empty string as name.  A few system files have been
+encountered that include a variable label on dummy variable records,
+so readers should take care to parse dummy variable records in the
+same way as other variable records.
+
address@hidden Index}
+The @dfn{dictionary index} of a variable is its offset in the set of
+variable records, including dummy variable records for long string
+variables.  The first variable record has a dictionary index of 0, the
+second has a dictionary index of 1, and so on.
+
+The system file format does not directly support string variables
+wider than 255 bytes.  Such very long string variables are represented
+by a number of narrower string variables.  @xref{Very Long String
+Record}, for details.
+
address@hidden
+int32               rec_type;
+int32               type;
+int32               has_var_label;
+int32               n_missing_values;
+int32               print;
+int32               write;
+char                name[8];
+
+/* @r{Present only if @code{has_var_label} is 1.} */
+int32               label_len;
+char                label[];
+
+/* @r{Present only if @code{n_missing_values} is nonzero}. */
+flt64               missing_values[];
 @end example
 
 @table @code
@@ -210,12 +280,12 @@
 set to the length, in characters, of the variable label, which must be a
 number between 0 and 120.
 
address@hidden char label[/* variable length */];
address@hidden char label[];
 This field is present only if @code{has_var_label} is set to 1.  It has
 length @code{label_len}, rounded up to the nearest multiple of 32 bits.
 The first @code{label_len} characters are the variable's variable label.
 
address@hidden flt64 missing_values[/* variable length */];
address@hidden flt64 missing_values[];
 This field is present only if @code{n_missing_values} is not 0.  It has
 the same number of elements as the absolute value of
 @code{n_missing_values}.  For discrete missing values, each element
@@ -228,166 +298,176 @@
 @end table
 
 The @code{print} and @code{write} members of sysfile_variable are output
-formats coded into @code{int32} types.  The LSB (least-significant byte)
+formats coded into @code{int32} types.  The least-significant byte
 of the @code{int32} represents the number of decimal places, and the
 next two bytes in order of increasing significance represent field width
-and format type, respectively.  The MSB (most-significant byte) is not
+and format type, respectively.  The most-significant byte is not
 used and should be set to zero.
 
 Format types are defined as follows:
address@hidden @asis
+
address@hidden
address@hidden {Value} address@hidden
address@hidden Value
address@hidden Meaning
 @item 0
-Not used.
address@hidden Not used.
 @item 1
address@hidden
address@hidden @code{A}
 @item 2
address@hidden
address@hidden @code{AHEX}
 @item 3
address@hidden
address@hidden @code{COMMA}
 @item 4
address@hidden
address@hidden @code{DOLLAR}
 @item 5
address@hidden
address@hidden @code{F}
 @item 6
address@hidden
address@hidden @code{IB}
 @item 7
address@hidden
address@hidden @code{PIBHEX}
 @item 8
address@hidden
address@hidden @code{P}
 @item 9
address@hidden
address@hidden @code{PIB}
 @item 10
address@hidden
address@hidden @code{PK}
 @item 11
address@hidden
address@hidden @code{RB}
 @item 12
address@hidden
address@hidden @code{RBHEX}
 @item 13
-Not used.
address@hidden Not used.
 @item 14
-Not used.
address@hidden Not used.
 @item 15
address@hidden
address@hidden @code{Z}
 @item 16
address@hidden
address@hidden @code{N}
 @item 17
address@hidden
address@hidden @code{E}
 @item 18
-Not used.
address@hidden Not used.
 @item 19
-Not used.
address@hidden Not used.
 @item 20
address@hidden
address@hidden @code{DATE}
 @item 21
address@hidden
address@hidden @code{TIME}
 @item 22
address@hidden
address@hidden @code{DATETIME}
 @item 23
address@hidden
address@hidden @code{ADATE}
 @item 24
address@hidden
address@hidden @code{JDATE}
 @item 25
address@hidden
address@hidden @code{DTIME}
 @item 26
address@hidden
address@hidden @code{WKDAY}
 @item 27
address@hidden
address@hidden @code{MONTH}
 @item 28
address@hidden
address@hidden @code{MOYR}
 @item 29
address@hidden
address@hidden @code{QYR}
 @item 30
address@hidden
address@hidden @code{WKYR}
 @item 31
address@hidden
address@hidden @code{PCT}
 @item 32
address@hidden
address@hidden @code{DOT}
 @item 33
address@hidden
address@hidden @code{CCA}
 @item 34
address@hidden
address@hidden @code{CCB}
 @item 35
address@hidden
address@hidden @code{CCC}
 @item 36
address@hidden
address@hidden @code{CCD}
 @item 37
address@hidden
address@hidden @code{CCE}
 @item 38
address@hidden
address@hidden @code{EDATE}
 @item 39
address@hidden
address@hidden @code{SDATE}
address@hidden multitable
address@hidden quotation
+
address@hidden Value Labels Records
address@hidden Value Labels Records
+
+The value label record has the following format:
+
address@hidden
+int32               rec_type;
+int32               label_count;
+
+/* @r{Repeated @code{label_cnt} times}. */
+char                value[8];
+char                label_len;
+char                label[];
address@hidden example
+
address@hidden @code
address@hidden int32 rec_type;
+Record type.  Always set to 3.
+
address@hidden int32 label_count;
+Number of value labels present in this record.
 @end table
 
address@hidden Value Label Record
address@hidden Value Label Record
+The remaining fields are repeated @code{count} times.  Each
+repetition specifies one value label.
 
-Value label records must follow the variable records and must precede
-the header termination record.  Other than this, they may appear
-anywhere in the system file.  Every value label record must be
-immediately followed by a label variable record, described below.
-
-Value label records begin with @code{rec_type}, an @code{int32} value
-set to the record type of 3.  This is followed by @code{count}, an
address@hidden value set to the number of value labels present in this
-record.
-
-These two fields are followed by a series of @code{count} tuples.  Each
-tuple is divided into two fields, the value and the label.  The first of
-these, the value, is composed of a 64-bit value, which is either a
address@hidden value or up to 8 characters (padded on the right to 8
-bytes) denoting a short string value.  Whether the value is a
address@hidden or a character string is not defined inside the value label
-record.
-
-The second field in the tuple, the label, has variable length.  The
-first @code{char} is a count of the number of characters in the value
-label.  The remainder of the field is the label itself.  The field is
-padded on the right to a multiple of 64 bits in length.
-
address@hidden Value Label Variable Record
address@hidden Value Label Variable Record
-
-Every value label variable record must be immediately preceded by a
-value label record, described above.
-
address@hidden
-struct sysfile_value_label_variable
-  @{
-     int32              rec_type;
-     int32              count;
-     int32              vars[/* variable length */];
-  @};
address@hidden @code
address@hidden char value[8];
+A numeric value or a short string value padded as necessary to 8 bytes
+in length.  Its type and width cannot be determined until the
+following value label variables record (see below) is read.
+
address@hidden char label_len;
+The label's length, in bytes.
+
address@hidden char label[];
address@hidden bytes of the actual label, followed by up to 7 bytes
+of padding to bring @code{label} and @code{label_len} together to a
+multiple of 8 bytes in length.
address@hidden table
+
+The value label record is always immediately followed by a value label
+variables record with the following format:
+
address@hidden
+int32               rec_type;
+int32               var_count;
+int32               vars[];
 @end example
 
 @table @code
 @item int32 rec_type;
 Record type.  Always set to 4.
 
address@hidden int32 count;
address@hidden int32 var_count;
 Number of variables that the associated value labels from the value
 label record are to be applied.
 
address@hidden int32 vars[/* variable length */];
-A list of variables to which to apply the value labels.  There are
address@hidden elements.  Each element identifies a variable record, where
-the first element is numbered 1 and long string variables are considered
-to occupy multiple indexes.
address@hidden int32 vars[];
+A list of dictionary indexes of variables to which to apply the value
+labels (@pxref{Dictionary Index}).  There are @code{var_count}
+elements.
+
+String variables wider than 8 bytes may not have value labels.
 @end table
 
 @node Document Record
 @section Document Record
 
-There must be no more than one document record per system file.
-Document records must follow the variable records and precede the
-dictionary termination record.
+The document record, if present, has the following format:
 
 @example
-struct sysfile_document
-  @{
-    int32               rec_type;
-    int32               n_lines;
-    char                lines[/* variable length */][80];
-  @};
+int32               rec_type;
+int32               n_lines;
+char                lines[][80];
 @end example
 
 @table @code
@@ -397,37 +477,32 @@
 @item int32 n_lines;
 Number of lines of documents present.
 
address@hidden char lines[/* variable length */][80];
address@hidden char lines[][80];
 Document lines.  The number of elements is defined by @code{n_lines}.
 Lines shorter than 80 characters are padded on the right with spaces.
 @end table
 
address@hidden Machine int32 Info Record
address@hidden Machine @code{int32} Info Record
address@hidden Machine Integer Info Record
address@hidden Machine Integer Info Record
 
-There must be no more than one machine @code{int32} info record per
-system file.  Machine @code{int32} info records must follow the variable
-records and precede the dictionary termination record.
-
address@hidden
-struct sysfile_machine_int32_info
-  @{
-    /* Header. */
-    int32               rec_type;
-    int32               subtype;
-    int32               size;
-    int32               count;
-
-    /* Data. */
-    int32               version_major;
-    int32               version_minor;
-    int32               version_revision;
-    int32               machine_code;
-    int32               floating_point_rep;
-    int32               compression_code;
-    int32               endianness;
-    int32               character_code;
-  @};
+The integer info record, if present, has the following format:
+
address@hidden
+/* @r{Header.} */
+int32               rec_type;
+int32               subtype;
+int32               size;
+int32               count;
+
+/* @r{Data.} */
+int32               version_major;
+int32               version_minor;
+int32               version_revision;
+int32               machine_code;
+int32               floating_point_rep;
+int32               compression_code;
+int32               endianness;
+int32               character_code;
 @end example
 
 @table @code
@@ -475,27 +550,22 @@
 Windows code page numbers are also valid.
 @end table
 
address@hidden Machine flt64 Info Record
address@hidden Machine @code{flt64} Info Record
address@hidden Machine Floating-Point Info Record
address@hidden Machine Floating-Point Info Record
 
-There must be no more than one machine @code{flt64} info record per
-system file.  Machine @code{flt64} info records must follow the variable
-records and precede the dictionary termination record.
-
address@hidden
-struct sysfile_machine_flt64_info
-  @{
-    /* Header. */
-    int32               rec_type;
-    int32               subtype;
-    int32               size;
-    int32               count;
-
-    /* Data. */
-    flt64               sysmis;
-    flt64               highest;
-    flt64               lowest;
-  @};
+The floating-point info record, if present, has the following format:
+
address@hidden
+/* @r{Header.} */
+int32               rec_type;
+int32               subtype;
+int32               size;
+int32               count;
+
+/* @r{Data.} */
+flt64               sysmis;
+flt64               highest;
+flt64               lowest;
 @end example
 
 @table @code
@@ -521,25 +591,23 @@
 The value used for LOWEST in missing values.
 @end table
 
address@hidden Auxiliary Variable Parameter Record
address@hidden Auxiliary Variable Parameter Record
address@hidden Variable Display Parameter Record
address@hidden Variable Display Parameter Record
 
-There must be no more than one auxiliary variable parameter record per
-system file.  This  record must follow the variable
-records and precede the dictionary termination record.
-
address@hidden
-struct sysfile_aux_var_parameter
-  @{
-    /* Header. */
-    int32               rec_type;
-    int32               subtype;
-    int32               size;
-    int32               count;
-
-    /* Data. */
-    struct aux_params   aux_params[/* variable length */];
-  @};
+The variable display parameter record, if present, has the following
+format:
+
address@hidden
+/* @r{Header.} */
+int32               rec_type;
+int32               subtype;
+int32               size;
+int32               count;
+
+/* @r{Repeated @code{count} times}. */
+int32               measure;
+int32               width;
+int32               alignment;
 @end example
 
 @table @code
@@ -550,28 +618,20 @@
 Record subtype.  Always set to 11.
 
 @item int32 size;
-The size  @code{int32}. Always set to 4.
+The size of @code{int32}.  Always set to 4.
 
 @item int32 count;
-The total number of records in @code{aux_params}, multiplied by 3.
-
address@hidden struct aux_params aux_params[];
-An array of @code{struct aux_params}.   The order of the elements corresponds 
-to the order of the variables in the Variable Records.  No element
-corresponds to variable records that continue long string variables.
-The @code{struct aux_params} type is defined as follows:
+The number of sets of variable display parameters (ordinarily the
+number of variables in the dictionary), times 3.
address@hidden table
 
address@hidden
-struct aux_params
-  @{
-    int32 measure;
-    int32 width;
-    int32 alignment;
-  @};
address@hidden example
+The remaining members are repeated @code{count} times, in the same
+order as the variable records.  No element corresponds to variable
+records that continue long string variables.  The meanings of these
+members are as follows:
 
 @table @code
address@hidden int32 measure
address@hidden int32 measure;
 The measurement type of the variable:  
 @table @asis
 @item 1
@@ -582,13 +642,13 @@
 Continuous Scale
 @end table
 
-Occasionally a value of 0 is seen here.  PSPP interprets this to mean
-a nominal scale.
+SPSS 14 sometimes writes a @code{measure} of 0.  PSPP interprets this
+as nominal scale.
 
address@hidden int32 width
address@hidden int32 width;
 The width of the display column for the variable in characters.
 
address@hidden int32 alignment 
address@hidden int32 alignment;
 The alignment of the variable for display purposes:
 
 @table @asis
@@ -599,34 +659,22 @@
 @item 2
 Centre aligned
 @end table
-
address@hidden table
-
-
-
 @end table
 
-
-
 @node Long Variable Names Record
 @section Long Variable Names Record
 
-There must be no more than one long variable names record per
-system file.  This  record must follow the variable
-records and precede the dictionary termination record.
-
address@hidden
-struct sysfile_long_variable_names
-  @{
-    /* Header. */
-    int32               rec_type;
-    int32               subtype;
-    int32               size;
-    int32               count;
-
-    /* Data. */
-    char                var_name_pairs[/* variable length */];
-  @};
+If present, the long variable names record has the following format:
+
address@hidden
+/* @r{Header.} */
+int32               rec_type;
+int32               subtype;
+int32               size;
+int32               count;
+
+/* @r{Exactly @code{count} bytes of data.} */
+char                var_name_pairs[];
 @end example
 
 @table @code
@@ -642,7 +690,7 @@
 @item int32 count;
 The total number of bytes in @code{var_name_pairs}.
 
address@hidden char var_name_pairs[/* variable length */];
address@hidden char var_name_pairs[];
 A list of @address@hidden tuples, where @var{key} is the name
 of a variable, and @var{value} is its long variable name. 
 The @var{key} field is at most 8 bytes long and must match the
@@ -655,27 +703,61 @@
 The total length is @code{count} bytes.
 @end table
 
address@hidden Very Long String Length Record
address@hidden  node-name,  next,  previous,  up
address@hidden Very Long String Length Record
-
-
-There must be no more than one very long string length record per
-system file.  This  record must follow the variable records and precede the 
-dictionary termination record. 
-
address@hidden
-struct sysfile_very_long_string_lengths
-  @{
-    /* Header. */
-    int32               rec_type;
-    int32               subtype;
-    int32               size;
-    int32               count;
-
-    /* Data. */
-    char                string_lengths[/* variable length */];
-  @};
address@hidden Very Long String Record
address@hidden Very Long String Record
+
+Old versions of SPSS limited string variables to a width of 255 bytes.
+For backward compatibility with these older versions, the system file
+format represents a string longer than 255 bytes, called a @dfn{very
+long string}, as a collection of strings no longer than 255 bytes
+each.  The strings concatenated to make a very long string are called
+its @dfn{segments}; for consistency, variables other than very long
+strings are considered to have a single segment.
+
+A very long string with a width of @var{w} has @var{n} =
+(@var{w} + 251) / 252 segments, that is, one segment for every
+252 bytes of width, rounding up.  It would be logical, then, for each
+of the segments except the last to have a width of 252 and the last
+segment to have the remainder, but this is not the case.  In fact,
+each segment except the last has a width of 255 bytes.  The last
+segment has width @var{w} - (@var{n} - 1) * 252; some versions
+of SPSS make it slightly wider, but not wide enough to make the last
+segment require another 8 bytes of data.
+
+Data is packed tightly into segments of a very long string, 255 bytes
+per segment.  Because 255 bytes of segment data are allocated for
+every 252 bytes of the very long string's width (approximately), some
+unused space is left over at the end of the allocated segments.  Data
+in unused space is ignored.
+
+Example: Consider a very long string of width 20,000.  Such a very
+long string has 20,000 / 252 = 80 (rounding up) segments.  The first
+79 segments have width 255; the last segment has width 20,000 - 79 *
+252 = 92 or slightly wider (up to 96 bytes, the next multiple of 8).
+The very long string's data is actually stored in the 19,890 bytes in
+the first 78 segments, plus the first 110 bytes of the 79th segment
+(19,890 + 110 = 20,000).  The remaining 145 bytes of the 79th segment
+and all 92 bytes of the 80th segment are unused.
+
+The very long string record explains how to stitch together segments
+to obtain very long string data.  For each of the very long string
+variables in the dictionary, it specifies the name of its first
+segment's variable and the very long string variable's actual width.
+The remaining segments immediately follow the named variable in the
+system file's dictionary.
+
+The very long string record, which is present only if the system file
+contains very long string variables, has the following format:
+
address@hidden
+/* @r{Header.} */
+int32               rec_type;
+int32               subtype;
+int32               size;
+int32               count;
+
+/* @r{Exactly @code{count} bytes of data.} */
+char                string_lengths[];
 @end example
 
 @table @code
@@ -691,7 +773,7 @@
 @item int32 count;
 The total number of bytes in @code{string_lengths}.
 
address@hidden char string_lengths[/* variable length */];
address@hidden char string_lengths[];
 A list of @address@hidden tuples, where @var{key} is the name
 of a variable, and @var{value} is its length.
 The @var{key} field is at most 8 bytes long and must match the
@@ -705,30 +787,22 @@
 The total length is @code{count} bytes.
 @end table
 
-
-
 @node Miscellaneous Informational Records
 @section Miscellaneous Informational Records
 
-Miscellaneous informational records must follow the variable records and
-precede the dictionary termination record.
-
 Some specific types of miscellaneous informational records are
 documented here, but others are known to exist.  PSPP ignores unknown
 miscellaneous informational records when reading system files.
 
 @example
-struct sysfile_misc_info
-  @{
-    /* Header. */
-    int32               rec_type;
-    int32               subtype;
-    int32               size;
-    int32               count;
-
-    /* Data. */
-    char                data[/* variable length */];
-  @};
+/* @r{Header.} */
+int32               rec_type;
+int32               subtype;
+int32               size;
+int32               count;
+
+/* @r{Exactly @code{size * count} bytes of data.} */
+char                data[];
 @end example
 
 @table @code
@@ -741,13 +815,14 @@
 indicates date info (probably related to USE).
 
 @item int32 size;
-Size of each piece of data in the data part.  Should have the value 4 or
-8, for @code{int32} and @code{flt64}, respectively.
+Size of each piece of data in the data part.  Should have the value 1,
+4, or 8, for @code{char}, @code{int32}, and @code{flt64} format data,
+respectively.
 
 @item int32 count;
 Number of pieces of data in the data part.
 
address@hidden char data[/* variable length */];
address@hidden char data[];
 Arbitrary data.  There must be @code{size} times @code{count} bytes of
 data.
 @end table
@@ -755,16 +830,12 @@
 @node Dictionary Termination Record
 @section Dictionary Termination Record
 
-The dictionary termination record must follow all other records, except
-for the actual cases, which it must precede.  There must be exactly one
-dictionary termination record in every system file.
+The dictionary termination record separates all other records from the
+data records.
 
 @example
-struct sysfile_dict_term
-  @{
-    int32               rec_type;
-    int32               filler;
-  @};
+int32               rec_type;
+int32               filler;
 @end example
 
 @table @code
@@ -778,7 +849,7 @@
 @node Data Record
 @section Data Record
 
-Data records must follow all other records in the data file.  There must
+Data records must follow all other records in the system file.  There must
 be at least one data record in every system file.
 
 The format of data records varies depending on whether the data is
@@ -790,10 +861,10 @@
 the variable declared in the respective variable record (@pxref{Variable
 Record}).  Numeric values are given in @code{flt64} format; string
 values are literal characters string, padded on the right when
-necessary.
+necessary to fill out 8-byte units.
 
-Compressed data is arranged in the following manner: the first 8-byte
-element in the data section is divided into a series of 1-byte command
+Compressed data is arranged in the following manner: the first 8 bytes
+in the data section is divided into a series of 1-byte command
 codes.  These codes have meanings as described below:
 
 @table @asis
@@ -803,10 +874,10 @@
 bytes remaining at the end of a fixed-size block.
 
 @item 1 through 251
-These values indicate that the corresponding numeric variable has the
-value @code{(@var{code} - @var{bias})} for the case being read, where
+A number with
+value @var{code} - @var{bias}, where
 @var{code} is the value of the compression code and @var{bias} is the
-variable @code{compression_bias} from the file header.  For example,
+variable @code{bias} from the file header.  For example,
 code 105 with bias 100.0 (the normal value) indicates a numeric variable
 of value 5.
 
@@ -815,21 +886,21 @@
 stream.  PSPP always outputs this code but its use is not required.
 
 @item 253
-This value indicates that the numeric or string value is not
-compressible.  The value is stored in the 8-byte element following the
+A numeric or string value that is not
+compressible.  The value is stored in the 8 bytes following the
 current block of command bytes.  If this value appears twice in a block
-of command bytes, then it indicates the second element following the
+of command bytes, then it indicates the second group of 8 bytes following the
 command bytes, and so on.
 
 @item 254
-Used to indicate a string value that is all spaces.
+An 8-byte string value that is all spaces.
 
 @item 255
-Used to indicate the system-missing value.
+The system-missing value.
 @end table
 
-When the end of the first 8-byte element of command bytes is reached,
-any blocks of non-compressible values are skipped, and the next element
-of command bytes is read and interpreted, until the end of the file is
-reached.
+When the end of the an 8-byte group of command bytes is reached, any
+blocks of non-compressible values indicated by code 253 are skipped,
+and the next element of command bytes is read and interpreted, until
+the end of the file or a code with value 252 is reached.
 @setfilename ignored




reply via email to

[Prev in Thread] Current Thread [Next in Thread]