help-octave
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Textscan and csv fitness data problem


From: Philip Nienhuis
Subject: Re: Textscan and csv fitness data problem
Date: Wed, 3 Jan 2018 23:47:50 +0100
User-agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:51.0) Gecko/20100101 Firefox/51.0 SeaMonkey/2.48

Ben Abbott wrote:
On Jan 3, 2018, at 12:19 AM, PhilipNienhuis <address@hidden> wrote:

bpabbott wrote
On Jan 1, 2018, at 11:51 AM, PhilipNienhuis &lt;

pr.nienhuis@

&gt; wrote:

NJank wrote
On Jan 1, 2018 12:13 PM, "PhilipNienhuis" &lt;

pr.nienhuis@

&gt; wrote:

NJank wrote
On Jan 1, 2018 9:06 AM, "Thomas Esbensen"


As to textscan, Dan did a lot of good work lately, I think the bugs you
implied have been fixed in the development branch.


Yeah, i noticed that. Would those make it into a 4.2.2 or not until
4.4.0?
Been keeping my fingers crossed that it would suddenly "just work" and I
wouldn't have to dive into his data again.

Have a look in the log: http://hg.savannah.gnu.org/hgweb/octave
bugs 52116 and 52479 have been fixed on stable, the last one (bug 52550)
not. If you want you can ask in the latter bug report to backport it to
stable.

As to csv2cell's erroneous column conversion, I've fixed that stupid bug
and
pushed it. To use it, get csv2cell.cc from here:

http://hg.code.sf.net/p/octave/io/file/31b7ff5ee040/src/csv2cell.cc

and then do

mkoctfile csv2cell.cc

to build a fixed version. Swap it into place, using
"pkg load io; which csv2cell"
to find out where it should live, followed by
"pkg unload io; clear -f"
to clear the way for copying (otherwise csv2cell.oct is locked), and then
copy csv2cell.oct into place.

Philip

The original file has lines with a varied number of columns.  As a result
…

error: csv2cell: incorrect CSV file, line 2 too short

The first row are the column labels (127 of them), and 2nd row only has 19
columns (18 commas). There are other rows deep in the file with 127
columns too.

Sure but if you try csv2cell with as 2nd argument a spreadsheet-style range,
it'll read .csv files with a varying nr. of data per row just fine, see my
first answer in this thread.
If you want to read all of the file, just supply a sufficiently large range;
it'll fill empty fields beyond current line length with "" (empty string).
See "help csv2cell"

The only practical limit is the max line length of 4096 chars (a #DEFINEd
setting; changing that is easy as csv2cell() is just an .oct file).

(of course, as usual I can only vouch for csv2cell() to work fine on the 4
boxes I have access to: my 2 multiboot Linux/Win7/Win10 boxes + 2 Win7 boxes
at work.)

Philip

Ok. I’m not familiar with the history behind the default behavior. I was 
expecting the default behavior to load the full csv file. Given the current 
design, I don’t know how to determine the actual size of the file. Meaning when 
a range of rows/cols is specified, there is now way to be sure all the 
information is included.

Sure, I sympathize with your (and probably anyone else's) expectation. But csv2cell isn't so flexible yet.

I usually inspect csv files with e.g., notepad++ before feeding them to Octave. For csv2cell one can always specify a range that is sufficiently "wide" to contain all possible columns (max 4096, see [*] below). Afterwards one can invoke parsecell.m in the io package to separate text and numerical info; the resulting arrays are stripped from enveloping empty columns/rows; or strip the empty outer columns by hand (e.g., by re-using the code in parsecell.m).

csv files can also be read into Octave by LibreOffice (or Excel) using xlsread or odsread.

If the behavior were changed such that "error: csv2cell: incorrect CSV file, 
line 2 too short” was replaced by “warning: csv2cell: line 2 has fewer columns than 
the prior lines” and the entire file were to be read would there be an adverse 
impact on compatibility?

Compatibility? you mean with the competition? Matlab doesn't have csv2cell or my other "easy" function to read mixed-type delimited files.

When implementing csv2cell's "range" option some io package releases ago I've changed the actual reading part so that variable numbers of fields per line are now easily coped with. The crux is efficiently finding the required number of columns to be able to also efficiently preallocate the output array. For a small file an initial 4096 columns and resizing afterwards could be fine, but I've read up to GB size files with csv2cell and cutting down on the initial output array size gets vital then.
(10^6 lines (= not extreme) times 4096 columns needs 64-bit indexing.)

I'm open to suggestions, but implementation will be in a future io-2.4.10 release as I think this needs careful thinking over (FYI, yesterday I've opened a ticket for an io-2.4.9 release)

Philip

[*] The line buffer currently is 4096 characters. A line can contain just consecutive separators separating 4096 empty fields (or is it 4097 empty fields?). So there's the default max nr. of columns to take into account.



reply via email to

[Prev in Thread] Current Thread [Next in Thread]