help-octave
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Import large field-delimited file with strings and numbers


From: João Rodrigues
Subject: Re: Import large field-delimited file with strings and numbers
Date: Sat, 06 Sep 2014 22:39:25 +0100
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:24.0) Gecko/20100101 Thunderbird/24.6.0


I need to import a large CSV file with multiple columns with mixed
string and number entries, such as:

field1, field2, field3, field4
A,        a,        1,       1.0,
B,        b,        2,        2.0,
C,        c,        3,        3.0,

and I want to pass this on to something like

cell1 ={[1,1] = A; [2,1] = B; [3,1] = C};
cell2 ={[1,1] = a; [2,1] = b; [3,1] = c};
arr3 =[1 2 3]';
arr4 =[1.0 2.0 3.0]';

furthermore, some columns can be ignored, the total number of entries is
known and there is a header.

In response to Thomas Dean, Ben Abbot and Philip Nienhuis: Thanks for the tips. (I did not know either the fileread or csv2cell functions).

Nevertheless, the problem persists. Either with fscanf, cellfun + strsplit or csv2cell (indeed the fastest) the memory requirements blow up (4 GB RAM + a few more GB of swap) when I want to import the big csv (~3.5 million lines, 15 columns, 200 MB).

On the contrary, if I use a loop with strsplit it takes forever (~ 30 min) but memory use is just a few hundred MB.

Another useful trick that I sometimes use myself would be to read the file
with textscan but then in chunks. You can specify the number of lines to
read. textscan should remember the file position (see "help textscan").
After having read chunk# N, you can simply restart textscan (w/o headerlines
param!) to read chunk# N+1, and repeat until EOF.
I guess this is the right direction. I had never noticed the N option in textscan. Using textscan to read chunk by chunk is much faster (~ 8 min).

Yet, csv2cell is orders of magnitude faster. I will break the big file into chunks (using fileread, strfind to determine newlines and fprintf) and then apply csv2cell chunk-wise.

Thank you all
Joao








reply via email to

[Prev in Thread] Current Thread [Next in Thread]