[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Problem with MPITB on IA64 arch
From: |
Javier Fernandez Baldomero |
Subject: |
Re: Problem with MPITB on IA64 arch |
Date: |
Sun, 05 Feb 2006 12:37:24 +0100 |
User-agent: |
Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.2.1) Gecko/20030225 |
Hi all,
just in case our results are useful to anyone, we summarize the
workaround for this MPITB problem.
Gianvito Quarta wrote:
Hi,
I'm trying to set up a parallel octave environment on an Itanium II,
IA64, 128 cpu cluster.
I have some problem during the mpitb re-compilation because for IA64
arch, the cast from pointer to int gives problem (during the
compilation with gcc 3.2.3 the error:
reinterpret_cast from `_comm*' to `int' loses precision
occurs).
Yes, the MPI pointer types were being systematically translated to
Octave flints (floating-point integers) with code like this:
RET_2_ARG (info,reinterpret_cast<int>(comm));
were, as one could guess
______________________________
#define RET_2_ARG(ARG0,ARG1) \
octave_value_list retval; \
retval(0) = ARG0; \
retval(1) = ARG1; \
return retval;
______________________________
LAM/MPI uses pointers to describe internal objects such as process
groups, communicators, reduction operations, callback functions, etc,
used as arguments in MPI calls. The C bindings use C pointers to pass
these arguments. The FORTRAN bindings use integers. It seemed normal to
use Octave flints to store the C pointers, being thus able to store
returned pointers and pass them to subsequent MPI calls.
On IA-32, the default constructor for retval(1) in the code above
converts the 32bit-int (obtained from the reinterpret_cast yet more
above) to an Octave flint. Particularly, the place in ov.h where these
constructors are declared reads like this:
______________________________
octave_value (int i);
octave_value (unsigned int i);
octave_value (long int i);
octave_value (unsigned long int i);
// XXX FIXME XXX -- these are kluges. They turn into doubles
// internally, which will break for very large values. We just use
// them to store things like 64-bit ino_t, etc, and hope that those
// values are never actually larger than can be represented exactly
// in a double.
#if defined (HAVE_LONG_LONG_INT)
octave_value (long long int i);
#endif
#if defined (HAVE_UNSIGNED_LONG_LONG_INT)
octave_value (unsigned long long int i);
#endif
______________________________
Gianvito obtained the following error message, due to pointers being 8B
wide on IA-64
(during the compilation with gcc 3.2.3 the error:
reinterpret_cast from `_comm*' to `int' loses precision
occurs).
These are the sizes on each architecture
int long void* double
IA-32 4B 4B 4B 8B
IA-64 4B 8B 8B 8B
Changing to reinterpret_cast<long>(comm) was an excellent try. I
thought it should work.
I tried to change the casting of pointers to long and then I have
successifull compiled MPITB.
Unfortunaly some problems occur at run time,
...
[info rank]=MPI_Comm_rank(MPI_COMM_WORLD)% rank=0
MPI process rank 0 (n0, p31218) caught a SIGSEGV in MPI_Comm_rank.
Rank (0, MPI_COMM_WORLD): Call stack within LAM:
Rank (0, MPI_COMM_WORLD): - MPI_Comm_rank()
Rank (0, MPI_COMM_WORLD): - main()
But I forgot 2 details:
*****
* 1.- *
*****
not all 64bits from an 8B double are devoted to mantissa:
sign+exp+man=1+11+52. So if the address expressed in the pointer
requires more than 53 bits (there is an implicit 1) there will be some
rounding and/or truncation. When we later try to access that address,
we'll access a location not intended, and SegFault.
That should be investigated. Are addresses really that big? The "FIXME"
comment about 64b int_t would make one think that you should have...
errr... 2^52= 4PetaBytes of memory before such an address would show up.
These are the results from Gianvito:
address@hidden ~]$ octave
Set SSI rpi to tcp with the command:
putenv('LAM_MPI_SSI_rpi','tcp'), MPI_Init
Help on MPI: help mpi
octave-2.1.72:1> MPI_COMM_WORLD
ans = 2.3058e+18
Hmpf, I didn't instruct him to use format long so I could see all those
18 decimal digits. Anyways, 2*10^18 is a huge address. Since
1K=2^10~10^3, coarsely 10^18 ~ 2^60, which fits in 64b but not in 52.
octave-2.1.72:2> MPI_Init
ans = 0
octave-2.1.72:3> a=MPI_COMM_WORLD
a = 2.3058e+18
octave-2.1.72:4> whos a
*** local user variables:
Prot Name Size Bytes Class
==== ==== ==== ===== =====
rwd a 1x1 8 scalar
Total is 1 element using 8 bytes
The communicator was being translated to scalar double. Here I'm a bit
embarassed to confess I spent some 6 to 8 trials to discover the right
sequence of typecasts to get the pointer stored in an Octave uint64, and
back to C pointer. Gianvito proved to have an enduring, unbreakable
patience. Some of the tried casts are:
RET_1_ARG(reinterpret_cast<octave_uint64>( comm ))
MPI_Comm comm = reinterpret_cast<MPI_Comm>(
args(ARGN).uint64_scalar_value() );
Wrong, octave_uint64 is an object. Changing to octave_uint64_t does not
help, since uint64_scalar_value() returns an object too.
RET_1_ARG(reinterpret_cast<octave_uint64_t>( comm ))
MPI_Comm comm = reinterpret_cast<MPI_Comm>( args(ARGN).ulong_value() );
Wrong, becomes flint under Octave, uses default constructor for 8B_int.
The final solution is
#if __ia64__
#define MPITB_OctPtrTyp octave_uint64 // IA-64
#define MPITB_OctIntFcn uint64_scalar_value
#else // #elif __i386__
#define MPITB_OctPtrTyp octave_uint32 // IA-32
#define MPITB_OctIntFcn uint32_scalar_value
#endif
#define MPITB_isOPtrTyp(ov)
((ov.is_scalar_type())&&(ov.is_numeric_type()))
#define MPITB_intcast(cptr)
MPITB_OctPtrTyp (\
reinterpret_cast <unsigned long> (
cptr ) )
#define MPITB_ptrcast(typ,ov) reinterpret_cast <typ> (
ov. \
MPITB_OctIntFcn().value() )
so the casts become:
RET_1_ARG(octave_uint64(reinterpret_cast<unsigned long>( comm )))
MPI_Comm comm = reinterpret_cast<MPI_Comm>( args(ARGN).uint64_scalar_value().value()
);
*****
* 2.- *
*****
the args error-checking code was rejecting those new 8B integers as
valid MPI objects. So we also included the MPITB_isOPtrTyp() macro to
replace the previous ov.is_scalar() test.
-javier
P.S.:
Out of curiousity, these are the pointer values causing the SegFault
_____________________________________________________
octave-2.1.72:4> a=MPI_COMM_WORLD
a = 2.3058430092933637e+18
octave-2.1.72:5> whos a
*** local user variables:
Prot Name Size Bytes Class
==== ==== ==== ===== =====
rwd a 1x1 8 scalar
Total is 1 element using 8 bytes
octave-2.1.72:6> [info rank]=MPI_Comm_rank(MPI_COMM_WORLD)
MPI process rank 0 (n0, p8187) caught a SIGSEGV in MPI_Comm_rank.
_____________________________________________________
octave-2.1.72:4> a=MPI_COMM_WORLD
a = 2305843009293363640
octave-2.1.72:5> whos a
*** local user variables:
Prot Name Size Bytes Class
==== ==== ==== ===== =====
rwd a 1x1 8 uint64 scalar
Total is 1 element using 8 bytes
octave-2.1.72:6> [info rank]=MPI_Comm_rank(MPI_COMM_WORLD)
error: MPI_Comm_rank: required arg#1: comm(int)
_____________________________________________________
Can you see the rounding up (not truncation) when translated to double?
I didn't expect such a huge address.
a = 2.3058430092933637e+18
a = 2305843009293363640
-------------------------------------------------------------
Octave is freely available under the terms of the GNU GPL.
Octave's home on the web: http://www.octave.org
How to fund new projects: http://www.octave.org/funding.html
Subscription information: http://www.octave.org/archive.html
-------------------------------------------------------------
[Prev in Thread] |
Current Thread |
[Next in Thread] |
- Re: Problem with MPITB on IA64 arch,
Javier Fernandez Baldomero <=