[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Perfromance problem running multiple copies of Octave on a multicore

From: Ian McCallion
Subject: Re: Perfromance problem running multiple copies of Octave on a multicore processor
Date: Thu, 10 Dec 2015 13:46:53 +0000

Hi Olaf,

Here is the code. ldd is unix-only so haven't done the commands you
suggested. However I will send you the win equivalent shortly.

 The cmd script is how I switch blas versions.

One important thing I note (which I thought was a testing error and
ignored earlier), is that the only combination that runs fast is
Octave 3.8.0 and the blas shipped with it.

I will experiment with different versions of lapack here. Please take
a quick look at the code to ensure there are no siily errors. You will
need source chnges for your environment to run it.

Cheers... Ian
p.s. Change attachment to .zip. Bloody google!

On 10 December 2015 at 13:16, Olaf Till <address@hidden> wrote:
> On Thu, Dec 10, 2015 at 11:47:02AM +0000, Ian McCallion wrote:
>> On 1 December 2015 at 08:09, Olaf Till <address@hidden> wrote:
>> ><snip>
>> > An explanation could be using up the common L2 cache of the processor
>> > cores, so that they must share the main memory bus. Maybe some
>> > calculation in the newer Octave is spread over a larger memory area.
>> >
>> > It could be useful to find the smallest possible unit (e.g a native
>> > Octave function, or a builtin function called by it) which shows your
>> > problem.
>> >
>> > Your idea with different BLAS libraries seemed good to me, make sure
>> > you tested this influence correctly.
>> On 5 December 2015 at 11:30, Ian McCallion <address@hidden> wrote:
>> > <snip>
>> >The bad performance definitely follows Octave 4.0.0.
>> >
>> > I am planning to instrument the code to find where the excess time is
>> > being consumed
>> I instrumented the code. This showed matrix multiplication and
>> log(matrix) are affected, whereas  dot multiplication and division do
>> not seem to be affected.
>> I have written demonstrator functions that show the problem for matrix
>> multiplication:
>>     perf1(8,1000,12,500,1000);
>>     Running 8 processes.
>>     4.0.0,  libopenblas382, rand(1000,12)*rand(12,500) 1000 times
>>                took 11.07 seconds
>>     3.8.2,  libopenblas382, rand(1000,12)*rand(12,500) 1000 times
>>                took 7.22 seconds
>> The same magnitude of task on one processor:
>>    perf1(1,1000,12,500,8000)
>>    Running 1 process
>>    4.0.0,  libopenblas382, rand(1000,12)*rand(12,500) 8000 times
>>               took 13.04 seconds
>>    3.8.2,  libopenblas382, rand(1000,12)*rand(12,500) 8000 times
>>                took 21.14 seconds
>> I will happily share the demonstrator with anyone who wants it. It may
>> be useful to know whether the problem shows up on other processors and
>> other motherboards. Mine is an intel i7-3610QM ASUS laptop with 8GB
>> RAM running win7.
> The code for matrix multiplication (in liboctave/array/,
> function xgemm and how it is called) has only code-formatting changes
> between Octave 3.8.2 and 4.0.0. So the same lapack functions should be
> called. This indicates the cause are differences in the used (lapack
> or) blas. (I know you stated the same blas was used in both, but you
> didn't state how you checked it.)
> As for your test, your code surely called rand() _before_ the loop,
> not within? Better you post the code, including any shell scripts or
> similar to call it.
> What do
> ldd /path/to/octave-3.8.0 | grep lapack
> ldd /path/to/octave-3.8.0 | grep blas
> and
> ldd /path/to/octave-4.0.0 | grep lapack
> ldd /path/to/octave-4.0.0 | grep blas
> output on your system? (Replace octave-... with octave-...-cli or
> whatever the (symlink to the) final executable is named.)
> Olaf
> --
> public key id EAFE0591, e.g. on x-hkp://

Attachment: perf.zzz
Description: Binary data

reply via email to

[Prev in Thread] Current Thread [Next in Thread]