[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Problem with MPITB on IA64 arch
From: |
Javier Fernández Baldomero |
Subject: |
Re: Problem with MPITB on IA64 arch |
Date: |
Fri, 20 Jan 2006 18:25:31 +0100 |
User-agent: |
Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.7.2) Gecko/20040804 Netscape/7.2 (ax) |
Hi Gianvito,
Gianvito Quarta wrote:
Dear Javier,
first to all, thanks for your fast answer. (I apologize for the delay
in my answer but
this morning I have been occupied with my course.)
I'm busy on evenings and make you wait as well, and this time it has
taken me a lot of time double-checking all those typecasts, so no
problem there. I'm gladly surprised again, this time due to the
precision and completeness of your answer. Hope I can keep up to your
standars :-) And sorry for such a long long e-mail.
1.- copy-paste LAM config.log lines related to int and void* sizes,
alignments and endianness:
configure:5363: checking size of int
configure:5408: result: 4
configure:5436: checking size of long
configure:5481: result: 8
...
configure:5801: checking size of void *
configure:5846: result: 8
configure:6111: checking alignment of int
configure:6172: result: 4
configure:6188: checking alignment of long
configure:6249: result: 8
...
configure:6573: checking alignment of void *
configure:6634: result: 8
configure:19090: checking whether byte ordering is bigendian
configure:19301: result:no
Excellent, your choice of "long" for holding a pointer was perfect.
"int" typecast won't work on IA64, and the compiler chokes appropriately.
2.- copy-paste Octave config.log lines related to int sizes:
configure:13130: checking size of int
configure:13521: result: 4
configure:13528: checking for long
configure:13589: result: yes
configure:13592: checking size of long
configure:13983: result: 8
configure:13990: checking for long long
configure:14051: result: yes
configure:14054: checking size of long long
configure:14445: result: 8
yup, both configures (LAM's and Octave's) reach the same conclusion.
Octave's configure does not check for pointer size, obviously. LAM
requires detailed knowledge of data representation to send the data to
other computers, which might be heterogeneous, with different data size,
alignment and even endianness. I should study how to extract that
information from LAM and reuse it for recompiling MPITB, instead of
suffering these ugly dependences on MPITB sources. Sigh.
At least, I'll try to add two macros "CPOINTER_TO_OCTAVE(c-ptr)" and
"OCTAVE_TO_CPTR(ov)" to do the casts in all these +75 places (see
below). I'll also try to look for an appropriate available compile-time
definition (IA64 or whatever) so the macros automatically change for
your cluster. Now I think of it, since I need local variables of the
appropriate type, I'll also need an "OCT_PTRTYPE" symbol that also
changes accordingly (int for IA32, long for IA64). I'll ask you to do
the beta test, if you don't mind :-)
3.- tell me if you modified the line I mentioned (MPI_COMM_WORLD.cc,
on line 33)
I verified that the MPI_COMM_WORLD was also modified as:
#define MPI_COMM_WORLD ((MPI_Comm) &lam_mpi_comm_world)
RET_1_ARG(reinterpret_cast<long>( NAME )) // defined ->
expanded
}
I see, you modified them as the compiler spotted them one by one. Good.
Let me count them, that makes... (I'm grepping right now)
2 MPI_Attr_put.cc int <- MPI_Comm
8 MPI_B* L* MAX/MIN.cc(8 files) int <- MPI_Op
3 MPI_COMM* (3 files) int <- MPI_Comm
2 MPI_Comm_spawn* (2 files) int <- MPI_Comm (children)
1 MPI_DUP_FN.cc int <- MPI_Copy_fn
3 MPI_ERR* (3 files) int <- MPI_Errhandler
10 MPI_Errhandler/Keyval _create/free.cc (4 files)
int <- MPI_Comm*, MPI_Errhandler
3 MPI_GROUP*/INFO* (3 files) int <- MPI_Group, MPI_Info
6 MPI_NULL*_F, OP_NULL, PROD, SUM, REPL (6 files)
int <- MPI_*_fn, MPI_Op
1 MPI_REQUEST_NULL int <- MPI_Request
2 hColl.h (0 direct+2 reverse)
2 hErr.h (1+1)
5 hInfo.h (4+1)
1 hSend.h (1+0)
3 hTopo.h (3+0)
4 hTstWait (2+2)
17 hGrp.h (10+7)
2 mpitb.h (2 direct+0 reverse)
To explain this "direct" and "reverse" accounting, let's use hGrp.h as
example. It also includes Michael's credited int-to-int typecast. I
can't recall what I was smoking when I wrote that. Around line 125, you
can see a (reverse) MPI_Comm<-int typecast, and what tried to be a
(direct) int<-MPI_* typecast but became a silly embarrassing bug.
____________________
...
MPI_Comm comm1=reinterpret_cast<MPI_Comm>( args(ARGM).int_value() ), \
comm2=reinterpret_cast<MPI_Comm>( args(ARGN).int_value() );
...
#define BLCK_COMLDR(NAME,ARGN,P) /* P for PREFIX */ \
...
MPI_Comm P##_comm=reinterpret_cast<MPI_Comm>( args(ARGN).int_value() );\
int P##_lead = ( args(ARGN+1).int_value() );
// int P##_lead = reinterpret_cast<int >( args(ARGN+1).int_value() );
// BUG:BRAINDAMAGE!!casting int to int, compile error w/gcc4.0.2 Ubuntu
// MPI_Intercomm_create.cc:49: error: invalid cast from 'int' to 'int'
// CREDIT: detected by Michael Creel, Econometrics UAB Spain
____________________
Since this fix is still in my sources, not in the tarball, I think you
must find 75 int casts, 13 of them reverse (MPI_*<-int) and the
remaining 62 direct (int<-MPI_*). Did you count them? Did you mark them
in sources to easily locate them? If not, you could still compare to the
original sources and double-check if they are indeed 75=62+13 reverse.
BTW, don't keep the int-to-int cast :-) Modify the others to long
Do you have an AOL or Messanger or Skype or ICQ account?
Hm, sorry, no. I know this is terribly slow, I spent all this time
thinking the answer and writing the e-mail. Sorry for the inconvenience,
please bear with me :-)
4.- tell me if you modified (and how) any other line:
I modified each reinterpret_cast in which the compilation problem occurs.
In particular I modified the code in the header file mpitb.h in each
definitions as for example:
#define PATN_COM_F_COM_INT(NAME,INAM)
{
NARGCHK (NAME,2)
BLCK_ONECOM (NAME,0)
BLCK_ONEINT (NAME,1,high) /
MPI_Comm intracomm=0;
int info=NAME (comm, high, &intracomm );
RET_2_ARG (info, reinterpret_cast<long>( intracomm ) )
}
where before the original code was
...
RET_2_ARG (info, reinterpret_cast<int>( intracomm ) )
...
Excellent! there is another before that one, in PATN_COM_F_VOID. There
are 2 direct casts in mpitb.h and you correctly changed them to <long>.
Double-check the remaining ones are also changed (up to 62 direct
casts). Then also change the 13 reverse casts, like this one in hGrp.h
(which accumulates 7 of them)
________________
MPI_Comm P##_comm=reinterpret_cast<MPI_Comm>(args(ARGN).long_value() );\
instead of
MPI_Comm P##_comm=reinterpret_cast<MPI_Comm>( args(ARGN).int_value() );\
________________
I think the compiler might have not choked on those, since the
declarations in ov.h are
________________
virtual int
int_value (bool req_int = false, bool frc_str_conv = false) const
{ return rep->int_value (req_int, frc_str_conv); }
...
virtual long int
long_value (bool req_int = false, bool frc_str_conv = false) const
{ return rep->long_value (req_int, frc_str_conv); }
________________
so int_value() is returning 4bytes (an int) instead of 8bytes (a long
int), but the compiler is happy since it can promote the int to a full
8byte pointer. Is that true, or the compiler complained also about those
reverse casts?
6.- (just a joke) locate in the sources the last line of code shown,
the one with the bad C-style typecast
I think the I have answered before to the question, in the question 4.
Hmmm, no, I think we are talking about different casts. Here I meant the
not-tagged, C-style casts. These are more difficult (impossible) to
track, since I didn't tagged them as "reinterpret". That's why I said it
was my fault. Recall your problem was
Unfortunaly some problems occur at run time,
...
[info rank]=MPI_Comm_rank(MPI_COMM_WORLD)% rank=0
MPI process rank 0 (n0, p31218) caught a SIGSEGV in MPI_Comm_rank.
Rank (0, MPI_COMM_WORLD): Call stack within LAM:
Rank (0, MPI_COMM_WORLD): - MPI_Comm_rank()
Rank (0, MPI_COMM_WORLD): - main()
Perhaps the pointer is being correctly cast to long but it is not
being correctly castback to pointer, since it's using this code:
________________
MPI_Comm comm = (MPI_Comm) args(ARGN).int_value();
________________
That's my fault. Right now I cannot remember why I didn't write any
XXX_cast reserved word there. When I learned one shouldn't directly
cast in C++, I started to static_ and reinterpret_cast. Perhaps I
wrote that line before I learned that.
MPI_Comm_rank uses PATN_INT_F_COM (also defined in mpitb.h), which in
turn uses BLCK_ONECOM, which includes the line shown at mpitb.h:141.
It's my fault. It should had read:
________________
MPI_Comm comm = reinterpret_cast<MPI_Comm>( args(ARGN).int_value() );
and on IA64 now it should read (make it look like that)
MPI_Comm comm = reinterpret_cast<MPI_Comm>( args(ARGN).long_value() );
and when I finish the modifications I told you before, it will read
something like this
MPI_Comm comm = OCTAVE_TO_CPTR( args(ARGN) );
________________
5.- copy-paste a screen dump with the same Octave command sequence I
showed above
this is the command sequence:
address@hidden ~]$ octave
Set SSI rpi to tcp with the command:
putenv('LAM_MPI_SSI_rpi','tcp'), MPI_Init
Help on MPI: help mpi
octave-2.1.72:1> MPI_COMM_WORLD
ans = 2.3058e+18
Hmpf, I should have suggested format long :-) since my suspicion was long=8B
octave-2.1.72:2> MPI_Init
ans = 0
octave-2.1.72:3> a=MPI_COMM_WORLD
a = 2.3058e+18
octave-2.1.72:4> whos a
*** local user variables:
Prot Name Size Bytes Class
==== ==== ==== ===== =====
rwd a 1x1 8 scalar
Total is 1 element using 8 bytes
octave-2.1.72:5> MPI_Finalize
ans = 0
octave-2.1.72:6>
Good, good, I think it's crystal clear now. There is again a lot of
homework for both of us :-) I'll start hunting all the C-style typecasts
that might remain in sources, and re-write all them (reinterpret and
C-style) with the macros I'm planning. Your homework is now:
1.- Double check all reinterpret casts are long (both direct and reverse)
2.- Correct the C-Style cast I told you on item 6.- (just that, we have
not found any more yet)
3.- Go on checking MPITB for errors... I think you'll find them faster
than I. I must read sources, you just need to run the examples in the
tutorial and complain when you find another error :-)
4.- Also, reply the question I asked in item 4.- ... the compiler didn't
complain with the reverse typecasts... did it?
n.- Go back to direct e-mail when people in the mailing list start to
complain because they get too bored :-)
Dear Javier, thanks a lot for your time and I'm impatient
for your reply.
I hope you can do an interesting use of MPITB (in class or in research)
and I can include soon a link to your work in the MPITB web page.
-javier
-------------------------------------------------------------
Octave is freely available under the terms of the GNU GPL.
Octave's home on the web: http://www.octave.org
How to fund new projects: http://www.octave.org/funding.html
Subscription information: http://www.octave.org/archive.html
-------------------------------------------------------------