Re: indexing expression performance

On Wed, Jan 14, 2009 at 01:36, Jaroslav Hajek <address@hidden> wrote:

On Wed, Jan 14, 2009 at 5:00 AM, Nicholas Tung <address@hidden> wrote:
> On Tue, Jan 13, 2009 at 16:46, Francesco Potortì <address@hidden>
> wrote:
>>
>> >I've noticed that indexing expressions on large arrays can be quite slow,
>>
>> Have a look at this, which I wrote in November but for some reason does
>> not appear in the list archives. In an application of mine, I wrote an
>> ad hoc function to access a 5-dim matrix using the last method below,
>> getting a significant speedup. I usually run 3.0.1, and I did not make
>> any check to compare its speed with the more recent Octave versions.
>>
>> |Date: 11 Nov 2008 10:44:45 +0100
>> |From: Francesco Potortì <address@hidden>
>> |To: Jaroslav Hajek <address@hidden>
>> |CC: Octave help list <address@hidden>
>> |Subject: Re: slow access to consecutive matrix elements
>> |
>> |>On Sat, Oct 11, 2008 at 10:22 AM, Francesco Potorti`
>> |><address@hidden> wrote:
>> |>> This is a real life example that demonstrates how access to matrix
>> |>> elements is not optimised for the case of complete ranges in the first
>> |>> dimensions. I am sending it here and not to the bug list because this
>> |>> example may be useful to others, at least until this cases is
>> optimised
>> |>> in the sources.
>> |>>
>> |>> # This is a big 5-dim matrix, about 100MB size
>> |>> octave> kk=rand(156,222,1,44,8);
>> |>>
>> |>> # I access a slice of it, which is sequential in memory
>> |>> octave> t=cputime; for ii=1:44, for jj=1:8
>> |>> mm=kk(:,:,:,ii,jj); endfor, endfor, cputime-t
>> |>> ans = 5.9124
>> |>>
>> |>> # Using a linear index is much faster
>> |>> octave> span=(1:156*222);
>> |>> t=cputime; for ii=1:44, for jj=1:8
>> |>> mm=kk(sub2ind(size(kk),1,1,1,ii,jj)+span-1); endfor, endfor,
>> cputime-t
>> |>> ans = 2.6642
>> |>>
>> |>> # Removing the call to sub2ind reaches top speed
>> |>> octave> cp=[1,cumprod(size(kk)(1:end-1))]; span=(1:156*222);
>> |>> t=cputime; for ii=1:44, for jj=1:8
>> |>> mm=kk(sum(([1,1,1,ii,jj]-1).*cp)+span); endfor, endfor, cputime-t
>> |>> ans = 1.4001
>> |>>
>> |>> Still, I wish access were faster yet. Is there a reason why copying
>> |>> consecutive memory is so slow? I wish I could help with optimising
>> |>> this, even if I am certainly not the most qualified person to do it.
>> |>>
>> |>
>> |>I guess the indexing routines do deserve some attention w.r.t
>> |>performance. Reducing code duplication would also be nice. I have this
>> |>on my TODO list, but I don't think it's a good idea to do it before
>> |>3.2.x is forked, as such changes are, IMO, likely to introduce bugs.
>> |
>> |Follwoing up on this, I realised that there is room for further
>> |significant speedup:
>> |
>> |# Be careful to not convert ranges into matrices
>> |octave> cp=[1,cumprod(size(kk)(1:end-1))]; len=156*222;
>> | t=cputime; for ii=1:44, for jj=1:8
>> |base=sum(([1,1,1,ii,jj]-1).*cp); mm=kk(base+1:base+len); endfor, endfor,
>> cputime-t
>> |ans = 0.26402
>> |
>> |The fact is, I was discounting the fact that a range remains a range
>> |even after linear transformation, while this does not appear to be the
>> |case:
>> |
>> |octave3.1> typeinfo(1:4)
>> |ans = range
>> |octave3.1> typeinfo(4+(1:4))
>> |ans = matrix
>> |octave3.1> typeinfo(4*(1:4))
>> |ans = matrix
>> |
>> |>From the depth of my ignorance about Octave's internals, I dare say that
>> |it should not be too difficult to keep ranges as ranges even after sum
>> |or product with a scalar. Maybe even after sum with a range with the
>> |same number of elements. Am I wrong?
>
> neat, I didn't know that 1-d vector indexing worked on matrices. However, I
> think it may be faster for you because of the multi-dimensional aspect,
> whereas I am working with more values (potentially large image samples). On
> Octave 3.1.51+ (built today) the following code runs in about the same time.
> If you're doing something more complicated, please tell me. I know the two
> snippets below don't compute exactly the same value...
>
> tic, x1 = 1:299; x2 = 2:300; for i = 1:100, a(:, x1) += a(:, x2); endfor,
> toc
> x1 = 1:89700; x2 = 301:90000; tic, for i = 1:100, a(x1) += a(x2); endfor,
> toc
>
> As another note, it doesn't seem to matter that much which way the matrix is
> indexed for the first, i.e. a(x1, :) vs a(:, x1).
>
> Jaroslav - I updated to hg head and it was faster (cut time in half for the
> test example), but on real code not quite as much as I'd hoped.
>
> Nicholas
>

Hi Nicholas,

if you can identify bottlenecks and post them as reasonably small
self-contained examples, we can surely discuss them and possible
improvements.

regards

--
RNDr. Jaroslav Hajek
computing expert
Aeronautical Research and Test Institute (VZLU)
Prague, Czech Republic
url: www.highegg.matfyz.cz

Thanks so much. I uploaded a real example to
http://gatoatigrado.arvixe.com/files/temp/octave_indexing_perf.tgz

The script is 60 lines long; it should perform a two-step Haar transformation (actually only for the wavelet coefficients) lifting method decomposition. Run it with the following:
octave encode_image_standalone.m attachment.jpg

I looped it 450 times to simulate the amount of processing that I need. On my machine, the base time is 26.6 seconds; if you remove the matrix scalar indexing (original file lines 36, 37 and 43, 44), it still takes 17 seconds just to run through the interpreter. Replacing both of the lines above with
gamma_ += alpha_arr(shift_idx) * gamma_;
lambda_ += alpha_arr(shift_idx) * lambda_;
respectively, it takes about 20 seconds, which is much closer to the interpreter overhead (which I might be able to cut down by reducing the number of lines; I'm not sure.

Also, Francesco, I forgot to do the obvious thing and test your examples with 3.1.51. At least on my machine, it looks like the performance has been reversed by the latest dense indexing improvements.

octave:1> kk=rand(156,222,1,44,8);
octave:2> t=cputime; for ii=1:44, for jj=1:8, mm=kk(:,:,:,ii,jj); endfor, endfor, cputime-t
ans = 0.072004
octave:3> cp=[1,cumprod(size(kk)(1:end-1))]; len=156*222; t=cputime; for ii=1:44, for jj=1:8,base=sum(([1,1,1,ii,jj]-1).*cp); mm=kk(base+1:base+len); endfor, endfor, cputime-t
ans = 0.12401

Thanks again,
Nicholas

From:	Nicholas Tung
Subject:	Re: indexing expression performance
Date:	Wed, 14 Jan 2009 11:24:43 -0500