help-octave
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Import large field-delimited file with strings and numbers


From: Philip Nienhuis
Subject: Re: Import large field-delimited file with strings and numbers
Date: Thu, 11 Sep 2014 14:21:34 -0700 (PDT)

Joao Rodrigues wrote
> On 08-09-2014 17:49, Philip Nienhuis wrote:
>>> 
> <snip>
>>> Yet, csv2cell is orders of magnitude faster. I will break the big file
>>> into chunks (using fileread, strfind to determine newlines and fprintf)
>>> and then apply csv2cell chunk-wise.
>> Why do you need to break it up using csv2cell? AFAICS that reads the
>> entire
>> file and directly translates the data into "values" in the output cell
>> array, using very little temporary storage (the latter quite unlike
>> textscan/strread).
>> It does read the entire file twice, once to assess the required
>> dimensions
>> for the cell array, the second (more intensive) pass for actually reading
>> the data.
> The file I want to read has around 35 million rows, 15 columns and takes 
> 200 MB of disk space: csv2cell would simply eat up all memory and the 
> computer stopped responding.
> 
> I tried to feed it small chunk of increasing size and found out that it 
> behaved well until it received a chunk of 500 million rows (when memory 
> use went through the stratosphere).
> 
> So I opted for the clumsy solution of breaking the file into small 
> pieces and spoon feed csv2cell.
> 
> But then I found out something interesting. If I would save a cell with 
> 35 million rows and only 3 columns in gzip format it would take very 
> little disk space (20 MB or so) but when I tried to open it... it would 
> again take forever and eat up GBs of memory.
> 
> Bottom line: I think it has to do with the way Octave allocates memory 
> to cells, which is not very efficient (as opposed to dense or sparse 
> numerical data, which it handles very well).
> 
> I managed to solve the problem I had, thanks to the help of you guys.
> 
> However, I think it would probably be nice if in future versions of 
> Octave there was something akin to ulimit installed by default to 
> prevent a process from eating up all available memory.
> 
> If someone wants to check this issue the data I am working with is public:
> 
> http://www.bls.gov/cew/data/files/*/csv/*_annual_singlefile.zip
> 
> where * = 1990:2013
> 
> http://www.bls.gov/cew/datatoc.htm explains the content.

I d/led the 2013 file and gave it a try with csv2cell with a 64-bit Octave.
csv2cell() didn't even need the new headerlines parameter - it is a neat
.csv cell from bottom to top.
Results:


>> tic; data = csv2cell ('2013.annual.singlefile.csv'); toc
Elapsed time is 20.2152 seconds.
>> size (data)
ans =

   3565139        15

>> whos
Variables in the current scope:

   Attr Name         Size                     Bytes  Class
   ==== ====         ====                     =====  =====
        ans          1x2                         16  double
        data   3565139x15                 354645851  cell

Total is 53477087 elements using 354645867 bytes

>>

...and Octave's memory usage is ~4.6 GB (total occupied RAM on my Win7-64b
box was 5.75 GB). So you'd need at least a 64-bit Octave + 64-bit OS. For
Windows a (experimental but IMO fairly good) 64-bit Octave is available
these days.

Even after stripping away the rightmost columns, saving the result to a .mat
file, restarting Octave and reading back the .mat file, Octave still needs >
4 GB to read the file. Once in workspace the data occupies > 2 GB RAM, while
according to "whos" the cell array (3565139 x 4) occupies ~100 MB.
Puzzling numbers... as you say, Octave apparently needs a lot more RAM
behind the scenes to hold such big cell arrays.

Philip




--
View this message in context: 
http://octave.1599824.n4.nabble.com/Import-large-field-delimited-file-with-strings-and-numbers-tp4666380p4666469.html
Sent from the Octave - General mailing list archive at Nabble.com.



reply via email to

[Prev in Thread] Current Thread [Next in Thread]