Re: Import large field-delimited file with strings and numbers

help-octave

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Import large field-delimited file with strings and numbers

From:	João Rodrigues
Subject:	Re: Import large field-delimited file with strings and numbers
Date:	Sat, 06 Sep 2014 22:39:25 +0100
User-agent:	Mozilla/5.0 (X11; Linux x86_64; rv:24.0) Gecko/20100101 Thunderbird/24.6.0

I need to import a large CSV file with multiple columns with mixed
string and number entries, such as:

field1, field2, field3, field4
A,        a,        1,       1.0,
B,        b,        2,        2.0,
C,        c,        3,        3.0,

and I want to pass this on to something like

cell1 ={[1,1] = A; [2,1] = B; [3,1] = C};
cell2 ={[1,1] = a; [2,1] = b; [3,1] = c};
arr3 =[1 2 3]';
arr4 =[1.0 2.0 3.0]';

furthermore, some columns can be ignored, the total number of entries is
known and there is a header.

In response to Thomas Dean, Ben Abbot and Philip Nienhuis: Thanks forthe tips. (I did not know either the fileread or csv2cell functions).

Nevertheless, the problem persists. Either with fscanf, cellfun +strsplit or csv2cell (indeed the fastest) the memory requirements blowup (4 GB RAM + a few more GB of swap) when I want to import the big csv(~3.5 million lines, 15 columns, 200 MB).

On the contrary, if I use a loop with strsplit it takes forever (~ 30min) but memory use is just a few hundred MB.

Another useful trick that I sometimes use myself would be to read the file
with textscan but then in chunks. You can specify the number of lines to
read. textscan should remember the file position (see "help textscan").
After having read chunk# N, you can simply restart textscan (w/o headerlines
param!) to read chunk# N+1, and repeat until EOF.

I guess this is the right direction. I had never noticed the N option intextscan. Using textscan to read chunk by chunk is much faster (~ 8 min).

Yet, csv2cell is orders of magnitude faster. I will break the big fileinto chunks (using fileread, strfind to determine newlines and fprintf)and then apply csv2cell chunk-wise.


Thank you all
Joao

[Prev in Thread]

Current Thread

[Next in Thread]

Re: Import large field-delimited file with strings and numbers, (continued)
- Re: Import large field-delimited file with strings and numbers, Ben Abbott, 2014/09/06
- Re: Import large field-delimited file with strings and numbers, Philip Nienhuis, 2014/09/06
  - Re: Import large field-delimited file with strings and numbers, João Rodrigues <=
    - Re: Import large field-delimited file with strings and numbers, Philip Nienhuis, 2014/09/08
    - Re: Import large field-delimited file with strings and numbers, João Rodrigues, 2014/09/08
    - Re: Import large field-delimited file with strings and numbers, Markus Bergholz, 2014/09/08
    - Re: Import large field-delimited file with strings and numbers, Markus Bergholz, 2014/09/08
    - Re: Import large field-delimited file with strings and numbers, Joao Rodrigues, 2014/09/08
    - Re: Import large field-delimited file with strings and numbers, Markus Bergholz, 2014/09/08
    - Re: Import large field-delimited file with strings and numbers, Markus Bergholz, 2014/09/08
    - Re: Import large field-delimited file with strings and numbers, Philip Nienhuis, 2014/09/11
    - Re: Import large field-delimited file with strings and numbers, Philip Nienhuis, 2014/09/10
- Re: Import large field-delimited file with strings and numbers, CdeMills, 2014/09/08

Prev by Date: Re: Import large field-delimited file with strings and numbers
Next by Date: out of memory or dimension too large for Octave's index type
Previous by thread: Re: Import large field-delimited file with strings and numbers
Next by thread: Re: Import large field-delimited file with strings and numbers
Index(es):
- Date
- Thread