octave-maintainers
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Changesets: Re: Parallel access to data created with Octave


From: Jarno Rajahalme
Subject: Changesets: Re: Parallel access to data created with Octave
Date: Mon, 24 May 2010 21:16:49 -0700

On May 3, 2010, at 10:33 PM, ext Jaroslav Hajek wrote:

> On Mon, May 3, 2010 at 6:53 PM, Jarno Rajahalme <address@hidden> wrote:
> 
>> For the case where I test for scalar type, it would suffice if numel() on a 
>> scalar would not depend on a static initialization of a "static dim_vector 
>> dv(1,1); return dv;" within a function dims(). If octave_base_scalar 
>> implemented a numel() always returning 1, I would not have to care whether a 
>> specific cell is a scalar or not. And it would be faster, too :-).
>> 
> 
> Feel free to do this, but it's just the tip of an iceberg. You simply
> shouldn't depend on internals like this.
> 


I did this (see the attached 2 changesets):

Attachment: changeset1
Description: Binary data

Attachment: changeset2
Description: Binary data


(Please apply both sets, there is minor back and forth between them).

All tests pass (except for the svds, which has one, non-related, fail either 
way).

liboctave:

- Followed the Array.h practice of returning const ref from dims() (Sparse.h)
- Added ov_ndims(), const methods declared as such (dim_vector.h)
- Changed idx_vector::orig_rows() and idx_vector::orig_columns() to call new 
internal rep virtual functions (idx-vector.h)
- Changed dim_vector local variable to const ref when applicable (Array.cc, 
CNDArray.cc, CSparse.cc, Sparse.cc, dNDArray.cc, dSparse.cc, fCNDArray.cc, 
fNDArray.cc, idx-vector.cc, lo-specfun.cc, mx-inlines.cc)

src:

- Made rows() and columns() virtual, changed default implementations of ndims, 
rows, columns, numel, and capacity to be thread-safe, subclasses need to 
override. (ov-base.h)
- Added thread-safe (virtual) ndims(), numel(), capacity(), rows() and 
columns() (ov-base-diag.h, ov-base-mat.h, ov-base-scalar.h, ov-base-sparse.h, 
ov-class.h, ov-cs-list.h, ov-fcn-handle.h, ov-lazy-idx.h, ov-perm.h, 
ov-range.h, ov-struct.h)
- Followed the Array.h practice of returning const ref from dims() (oct-map.h)
- Changed dim_vector local variable to const ref when available (bitfcns.cc, 
data.cc, oct-map.cc, ov-base-diag.cc, ov-base-int.cc, ov-base-mat.cc, 
ov-base-sparse.cc, ov-bool-mat.cc, ov-bool-sparse.cc, ov-cell.cc, ov-cx-mat.cc, 
ov-cx-sparse.cc, ov-flt-cx-mat.cc, ov-flt-re-mat.cc, ov-intx.h, ov-perm.cc, 
ov-re-mat.cc, ov-re-sparse.cc, ov-str-mat.cc, ov-struct.cc, pr-output.cc, 
pt-mat.cc, strfns.cc, xpow.cc)
- Changed dim_vector local variable to const when possible (data.cc, 
gl-render.cc, graphics.cc, ls-mat5.cc, mex.cc, ov-base.cc, ov.cc, pt-mat.cc)
- Changed to avoid calling dims() or numel() via the virtual members 
(ov-base-diag.cc, ov-base-mat.cc, ov-base-sparse.cc, ov-bool-mat.cc, 
ov-bool-mat.h, ov-bool-sparse.cc, ov-cell.cc, ov-cx-mat.cc, ov-cx-sparse.cc, 
ov-flt-cx-mat.cc, ov-flt-re-mat.cc, ov-intx.h, ov-perm.cc, ov-range.cc, 
ov-re-mat.cc, ov-re-sparse.cc, ov-str-mat.cc, ov-struct.cc)


With these changes the octave_(base)_value functions ndims, rows, columns, 
numel and capacity are now thread safe. My own OpenMP code used to crash 
calling numel, due to the unprotected reference counting, but does not crash 
any more, as these now do not need to touch any reference counters.

These, and dependent functions are now also marginally faster. However, when 
timing these I noticed some variation due to the changed branch prediction 
behavior, are the patterns of calling virtual functions have changed. From this 
viewpoint it would be nice if all numeric data types had the real dim_vector, 
so that they could share the same implementation for all these functions. This 
would require a common base class for numeric types. Also, the I had some 
earlier crashes related to the initialization of static members in threaded 
code, I believe, so this would take some more testing.

dims() itself cannot be made thread safe without protecting the reference 
counting, as for some types the dim_vector is created dynamically.

There are minor efficiency increases all over the place, due to not creating a 
local instance of dim_vector, where it is not needed.

Anyway, in the following you will see that the time it takes for cellfun () to 
complete depends on the order of the data. The data consists of (1) null 
matrix, (2) single-element vector (== scalar), and (3) three-element vector (== 
matrix). First column has the data in alternating (1,2,3,1,2,3,...) pattern 
(mix), second in three long stripes of each element, and the third in a random 
order (same order in each case).

Before: mix             stripe  random
numel:  0.3174  0.1877  0.2800
ndims:  0.3452  0.2599  0.3160
length: 1.6094  1.4972  1.6913
isempty:        0.1355  0.1321  0.2059
islogical:0.0734        0.0733  0.0735
isreal: 0.1773  0.0735  0.1429
columns:11.2638 11.0394 11.3478
rows:   11.2953 11.0780 11.2934

After:  mix             stripe  random
numel:  0.2478  0.1334  0.2257
ndims:  0.2483  0.1396  0.2213
length: 0.2941  0.2843  0.4240
isempty:        0.0788  0.0790  0.1518
islogical:0.0782        0.0809  0.0771
isreal: 0.0821  0.0820  0.1603
columns:10.0994 9.6064  9.7986
rows:   9.7863  9.6609  10.1380

Times in seconds; overall the gain is mostly insignificant, even if relative 
large at times (length is 4x faster) as these functions were called 15 million 
times. My intent is just to show that there is *some* overall efficiency gain 
in addition to them being thread safe. Last two are slower than others because 
cellfun () does not directly support them.

Here is the timing code I used:

Attachment: crn_test.m
Description: Binary data


As I have not contributed C++ code to Octave before, I think someone should 
check the changesets so that I did not do anything stupid :-)

One point for discussion may be that making rows () and columns () virtual will 
require all C++ Oct modules to be recompiled, as the octave_base_value vtable 
layout changes. Other changes might not force this, as the ov and ov-base 
interfaces have not (otherwise) changed.

Regards,

 Jarno

> 
>>> I would suggest you try to hoist as much of the value extraction calls
>>> as possible out of the parallelized loops. If it's not possible, you
>>> can use #pragma omp critical. Perhaps the pragmas will be inserted in
>>> Octave's sources in the future.
>>> 
>> 
>> In this case doing this would effectively mean copying a multi-megabyte 
>> structure from Octave to C, and is not an option, at least for now. I could 
>> add a separate initialization function to that effect, but then I would have 
>> to worry about calling that each time the Octave structure is changed to 
>> keep them in sync.
>> 
> 
> That's not exactly what I meant. Consider your earlier example
> 
> const octave_value snv(su.xelem(0,si));
> const uint16NDArray sna(snv.uint16_array_value());
> const octave_idx_type snl = sna.nelem();
> const octave_uint16 * snp = sna.fortran_vec();
> 
> You could do something like:
> 
> OCTAVE_LOCAL_BUFFER (uint16NDArray, sna, nsi);
> 
> // prepare array values
> for (si = 0; si < nsi; si++)
> sna[i] = su(0,si).uint16_array_value ();
> 
> // parallel loop
> #pragma omp parallel for
> for (si = 0; si < nsi; si++)
> {
> const octave_idx_type snl = sna[i].nelem(); // thread safe
> const octave_uint16 * snp = sna[i].fortran_vec(); // thread safe
> }
> 
> The sna buffer is an added memory overhead, but it shouldn't be a
> problem unless you have lots of small arrays (which is not efficient
> in Octave anyway).
> 
>> Declaring value extraction functions critical would unnecessarily slow down 
>> the code. IMO, just inspecting data should not do anything that would not 
>> work when parallel.
> 
> Unfortunately, at the interpreter level, a lot can happen behind the
> scene just because uint16_array_value is called. Sometimes, the
> resulting array is a shared copy, or it may be created afresh. Because
> many things are reference-counted, it often results in race
> conditions. Static initializers are used in several places to boost up
> things, in particular default constructors.
> 
>> I guess there are NOT that many places where things break down. Maybe one 
>> issue is with the reference counts being incremented and decremented, when 
>> extracting values, and every now and then there will be a race regarding a 
>> reference count: 2 threads read the same value, increment it and then 
>> independently decrement it, causing the data to be freed. This could be 
>> solved by making reference count manipulation critical, or by enabling data 
>> inspection without touching the reference values at all (as I have now done).
> 
> Yes, the former is an option, but it has disadvantages, because many
> of those locks will be redundant, still, I think we may do it in
> future. The latter, as explained, is extremely quirky given the very
> interchangeable and dynamic nature of octave_values.
> 
> regards
> 
> -- 
> RNDr. Jaroslav Hajek, PhD
> computing expert & GNU Octave developer
> Aeronautical Research and Test Institute (VZLU)
> Prague, Czech Republic
> url: www.highegg.matfyz.cz


reply via email to

[Prev in Thread] Current Thread [Next in Thread]