[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Changesets: Re: Parallel access to data created with Octave
From: |
Jarno Rajahalme |
Subject: |
Changesets: Re: Parallel access to data created with Octave |
Date: |
Mon, 24 May 2010 21:16:49 -0700 |
On May 3, 2010, at 10:33 PM, ext Jaroslav Hajek wrote:
> On Mon, May 3, 2010 at 6:53 PM, Jarno Rajahalme <address@hidden> wrote:
>
>> For the case where I test for scalar type, it would suffice if numel() on a
>> scalar would not depend on a static initialization of a "static dim_vector
>> dv(1,1); return dv;" within a function dims(). If octave_base_scalar
>> implemented a numel() always returning 1, I would not have to care whether a
>> specific cell is a scalar or not. And it would be faster, too :-).
>>
>
> Feel free to do this, but it's just the tip of an iceberg. You simply
> shouldn't depend on internals like this.
>
I did this (see the attached 2 changesets):
changeset1
Description: Binary data
changeset2
Description: Binary data
(Please apply both sets, there is minor back and forth between them).
All tests pass (except for the svds, which has one, non-related, fail either
way).
liboctave:
- Followed the Array.h practice of returning const ref from dims() (Sparse.h)
- Added ov_ndims(), const methods declared as such (dim_vector.h)
- Changed idx_vector::orig_rows() and idx_vector::orig_columns() to call new
internal rep virtual functions (idx-vector.h)
- Changed dim_vector local variable to const ref when applicable (Array.cc,
CNDArray.cc, CSparse.cc, Sparse.cc, dNDArray.cc, dSparse.cc, fCNDArray.cc,
fNDArray.cc, idx-vector.cc, lo-specfun.cc, mx-inlines.cc)
src:
- Made rows() and columns() virtual, changed default implementations of ndims,
rows, columns, numel, and capacity to be thread-safe, subclasses need to
override. (ov-base.h)
- Added thread-safe (virtual) ndims(), numel(), capacity(), rows() and
columns() (ov-base-diag.h, ov-base-mat.h, ov-base-scalar.h, ov-base-sparse.h,
ov-class.h, ov-cs-list.h, ov-fcn-handle.h, ov-lazy-idx.h, ov-perm.h,
ov-range.h, ov-struct.h)
- Followed the Array.h practice of returning const ref from dims() (oct-map.h)
- Changed dim_vector local variable to const ref when available (bitfcns.cc,
data.cc, oct-map.cc, ov-base-diag.cc, ov-base-int.cc, ov-base-mat.cc,
ov-base-sparse.cc, ov-bool-mat.cc, ov-bool-sparse.cc, ov-cell.cc, ov-cx-mat.cc,
ov-cx-sparse.cc, ov-flt-cx-mat.cc, ov-flt-re-mat.cc, ov-intx.h, ov-perm.cc,
ov-re-mat.cc, ov-re-sparse.cc, ov-str-mat.cc, ov-struct.cc, pr-output.cc,
pt-mat.cc, strfns.cc, xpow.cc)
- Changed dim_vector local variable to const when possible (data.cc,
gl-render.cc, graphics.cc, ls-mat5.cc, mex.cc, ov-base.cc, ov.cc, pt-mat.cc)
- Changed to avoid calling dims() or numel() via the virtual members
(ov-base-diag.cc, ov-base-mat.cc, ov-base-sparse.cc, ov-bool-mat.cc,
ov-bool-mat.h, ov-bool-sparse.cc, ov-cell.cc, ov-cx-mat.cc, ov-cx-sparse.cc,
ov-flt-cx-mat.cc, ov-flt-re-mat.cc, ov-intx.h, ov-perm.cc, ov-range.cc,
ov-re-mat.cc, ov-re-sparse.cc, ov-str-mat.cc, ov-struct.cc)
With these changes the octave_(base)_value functions ndims, rows, columns,
numel and capacity are now thread safe. My own OpenMP code used to crash
calling numel, due to the unprotected reference counting, but does not crash
any more, as these now do not need to touch any reference counters.
These, and dependent functions are now also marginally faster. However, when
timing these I noticed some variation due to the changed branch prediction
behavior, are the patterns of calling virtual functions have changed. From this
viewpoint it would be nice if all numeric data types had the real dim_vector,
so that they could share the same implementation for all these functions. This
would require a common base class for numeric types. Also, the I had some
earlier crashes related to the initialization of static members in threaded
code, I believe, so this would take some more testing.
dims() itself cannot be made thread safe without protecting the reference
counting, as for some types the dim_vector is created dynamically.
There are minor efficiency increases all over the place, due to not creating a
local instance of dim_vector, where it is not needed.
Anyway, in the following you will see that the time it takes for cellfun () to
complete depends on the order of the data. The data consists of (1) null
matrix, (2) single-element vector (== scalar), and (3) three-element vector (==
matrix). First column has the data in alternating (1,2,3,1,2,3,...) pattern
(mix), second in three long stripes of each element, and the third in a random
order (same order in each case).
Before: mix stripe random
numel: 0.3174 0.1877 0.2800
ndims: 0.3452 0.2599 0.3160
length: 1.6094 1.4972 1.6913
isempty: 0.1355 0.1321 0.2059
islogical:0.0734 0.0733 0.0735
isreal: 0.1773 0.0735 0.1429
columns:11.2638 11.0394 11.3478
rows: 11.2953 11.0780 11.2934
After: mix stripe random
numel: 0.2478 0.1334 0.2257
ndims: 0.2483 0.1396 0.2213
length: 0.2941 0.2843 0.4240
isempty: 0.0788 0.0790 0.1518
islogical:0.0782 0.0809 0.0771
isreal: 0.0821 0.0820 0.1603
columns:10.0994 9.6064 9.7986
rows: 9.7863 9.6609 10.1380
Times in seconds; overall the gain is mostly insignificant, even if relative
large at times (length is 4x faster) as these functions were called 15 million
times. My intent is just to show that there is *some* overall efficiency gain
in addition to them being thread safe. Last two are slower than others because
cellfun () does not directly support them.
Here is the timing code I used:
crn_test.m
Description: Binary data
As I have not contributed C++ code to Octave before, I think someone should
check the changesets so that I did not do anything stupid :-)
One point for discussion may be that making rows () and columns () virtual will
require all C++ Oct modules to be recompiled, as the octave_base_value vtable
layout changes. Other changes might not force this, as the ov and ov-base
interfaces have not (otherwise) changed.
Regards,
Jarno
>
>>> I would suggest you try to hoist as much of the value extraction calls
>>> as possible out of the parallelized loops. If it's not possible, you
>>> can use #pragma omp critical. Perhaps the pragmas will be inserted in
>>> Octave's sources in the future.
>>>
>>
>> In this case doing this would effectively mean copying a multi-megabyte
>> structure from Octave to C, and is not an option, at least for now. I could
>> add a separate initialization function to that effect, but then I would have
>> to worry about calling that each time the Octave structure is changed to
>> keep them in sync.
>>
>
> That's not exactly what I meant. Consider your earlier example
>
> const octave_value snv(su.xelem(0,si));
> const uint16NDArray sna(snv.uint16_array_value());
> const octave_idx_type snl = sna.nelem();
> const octave_uint16 * snp = sna.fortran_vec();
>
> You could do something like:
>
> OCTAVE_LOCAL_BUFFER (uint16NDArray, sna, nsi);
>
> // prepare array values
> for (si = 0; si < nsi; si++)
> sna[i] = su(0,si).uint16_array_value ();
>
> // parallel loop
> #pragma omp parallel for
> for (si = 0; si < nsi; si++)
> {
> const octave_idx_type snl = sna[i].nelem(); // thread safe
> const octave_uint16 * snp = sna[i].fortran_vec(); // thread safe
> }
>
> The sna buffer is an added memory overhead, but it shouldn't be a
> problem unless you have lots of small arrays (which is not efficient
> in Octave anyway).
>
>> Declaring value extraction functions critical would unnecessarily slow down
>> the code. IMO, just inspecting data should not do anything that would not
>> work when parallel.
>
> Unfortunately, at the interpreter level, a lot can happen behind the
> scene just because uint16_array_value is called. Sometimes, the
> resulting array is a shared copy, or it may be created afresh. Because
> many things are reference-counted, it often results in race
> conditions. Static initializers are used in several places to boost up
> things, in particular default constructors.
>
>> I guess there are NOT that many places where things break down. Maybe one
>> issue is with the reference counts being incremented and decremented, when
>> extracting values, and every now and then there will be a race regarding a
>> reference count: 2 threads read the same value, increment it and then
>> independently decrement it, causing the data to be freed. This could be
>> solved by making reference count manipulation critical, or by enabling data
>> inspection without touching the reference values at all (as I have now done).
>
> Yes, the former is an option, but it has disadvantages, because many
> of those locks will be redundant, still, I think we may do it in
> future. The latter, as explained, is extremely quirky given the very
> interchangeable and dynamic nature of octave_values.
>
> regards
>
> --
> RNDr. Jaroslav Hajek, PhD
> computing expert & GNU Octave developer
> Aeronautical Research and Test Institute (VZLU)
> Prague, Czech Republic
> url: www.highegg.matfyz.cz
- Re: Parallel access to data created with Octave, Jaroslav Hajek, 2010/05/03
- Re: Parallel access to data created with Octave, Jarno Rajahalme, 2010/05/03
- Re: Parallel access to data created with Octave, Jaroslav Hajek, 2010/05/04
- Changesets: Re: Parallel access to data created with Octave,
Jarno Rajahalme <=
- Re: Changesets: Re: Parallel access to data created with Octave, Jaroslav Hajek, 2010/05/25
- Re: Changesets: Re: Parallel access to data created with Octave, Michael D. Godfrey, 2010/05/25
- Re: Changesets: Re: Parallel access to data created with Octave, Jaroslav Hajek, 2010/05/25
- Re: Changesets: Re: Parallel access to data created with Octave, Michael D. Godfrey, 2010/05/25
- Re: Changesets: Re: Parallel access to data created with Octave, Jarno Rajahalme, 2010/05/25
- Re: Changesets: Re: Parallel access to data created with Octave, Jarno Rajahalme, 2010/05/25
- Thread-safe reference counting (modified dim-vector.h), Jarno Rajahalme, 2010/05/26
- Re: Thread-safe reference counting (modified dim-vector.h), Jaroslav Hajek, 2010/05/26
- Re: Thread-safe reference counting (modified dim-vector.h), Jarno Rajahalme, 2010/05/26
- Re: Thread-safe reference counting (modified dim-vector.h), Jaroslav Hajek, 2010/05/26