[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Problem with MPITB on IA64 arch

From: Javier Fernández Baldomero
Subject: Re: Problem with MPITB on IA64 arch
Date: Fri, 20 Jan 2006 18:25:31 +0100
User-agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.7.2) Gecko/20040804 Netscape/7.2 (ax)

Hi Gianvito,

Gianvito Quarta wrote:

Dear Javier,
first to all, thanks for your fast answer. (I apologize for the delay in my answer but this morning I have been occupied with my course.)

I'm busy on evenings and make you wait as well, and this time it has
taken me a lot of time double-checking all those typecasts, so no
problem there. I'm gladly surprised again, this time due to the
precision and completeness of your answer. Hope I can keep up to your
standars :-) And sorry for such a long long e-mail.

1.- copy-paste LAM config.log lines related to int and void* sizes, alignments and endianness:

configure:5363: checking size of int
configure:5408: result: 4
configure:5436: checking size of long
configure:5481: result: 8
... configure:5801: checking size of void *
configure:5846: result: 8
configure:6111: checking alignment of int
configure:6172: result: 4
configure:6188: checking alignment of long
configure:6249: result: 8
... configure:6573: checking alignment of void *
configure:6634: result: 8
configure:19090: checking whether byte ordering is bigendian
configure:19301: result:no

Excellent, your choice of "long" for holding a pointer was perfect.
"int" typecast won't work on IA64, and the compiler chokes appropriately.

2.- copy-paste Octave config.log lines related to int sizes:

configure:13130: checking size of int
configure:13521: result: 4
configure:13528: checking for long
configure:13589: result: yes
configure:13592: checking size of long
configure:13983: result: 8
configure:13990: checking for long long
configure:14051: result: yes
configure:14054: checking size of long long
configure:14445: result: 8

yup, both configures (LAM's and Octave's) reach the same conclusion.
Octave's configure does not check for pointer size, obviously. LAM
requires detailed knowledge of data representation to send the data to
other computers, which might be heterogeneous, with different data size,
alignment and even endianness. I should study how to extract that
information from LAM and reuse it for recompiling MPITB, instead of
suffering these ugly dependences on MPITB sources. Sigh.

At least, I'll try to add two macros "CPOINTER_TO_OCTAVE(c-ptr)" and
"OCTAVE_TO_CPTR(ov)" to do the casts in all these +75 places (see
below). I'll also try to look for an appropriate available compile-time
definition (IA64 or whatever) so the macros automatically change for
your cluster. Now I think of it, since I need local variables of the
appropriate type, I'll also need an "OCT_PTRTYPE" symbol that also
changes accordingly (int for IA32, long for IA64). I'll ask you to do
the beta test, if you don't mind :-)

3.- tell me if you modified the line I mentioned (, on line 33)

I verified that the MPI_COMM_WORLD was also modified as:

#define MPI_COMM_WORLD ((MPI_Comm) &lam_mpi_comm_world)
RET_1_ARG(reinterpret_cast<long>( NAME )) // defined -> expanded }

I see, you modified them as the compiler spotted them one by one. Good.
Let me count them, that makes... (I'm grepping right now)

2                  int <- MPI_Comm
8    MPI_B* L* MAX/ files)    int <- MPI_Op
3    MPI_COMM* (3 files)              int <- MPI_Comm
2    MPI_Comm_spawn* (2 files)        int <- MPI_Comm (children)
1                    int <- MPI_Copy_fn
3    MPI_ERR* (3 files)               int <- MPI_Errhandler
10  MPI_Errhandler/Keyval _create/ (4 files)
                                      int <- MPI_Comm*, MPI_Errhandler
3    MPI_GROUP*/INFO* (3 files)       int <- MPI_Group, MPI_Info
6    MPI_NULL*_F, OP_NULL, PROD, SUM, REPL (6 files)
                                      int <- MPI_*_fn, MPI_Op
1    MPI_REQUEST_NULL                 int <- MPI_Request
2    hColl.h    (0 direct+2 reverse)
2    hErr.h     (1+1)
5    hInfo.h    (4+1)
1    hSend.h    (1+0)
3    hTopo.h    (3+0)
4    hTstWait   (2+2)
17    hGrp.h    (10+7)
2    mpitb.h    (2 direct+0 reverse)

To explain this "direct" and "reverse" accounting, let's use hGrp.h as
example. It also includes Michael's credited int-to-int typecast. I
can't recall what I was smoking when I wrote that. Around line 125, you
can see a (reverse) MPI_Comm<-int typecast, and what tried to be a
(direct) int<-MPI_* typecast but became a silly embarrassing bug.
MPI_Comm comm1=reinterpret_cast<MPI_Comm>( args(ARGM).int_value() ),  \
         comm2=reinterpret_cast<MPI_Comm>( args(ARGN).int_value() );
#define BLCK_COMLDR(NAME,ARGN,P)              /* P for PREFIX */      \
MPI_Comm P##_comm=reinterpret_cast<MPI_Comm>( args(ARGN).int_value() );\
   int   P##_lead =                       ( args(ARGN+1).int_value() );
// int   P##_lead = reinterpret_cast<int >( args(ARGN+1).int_value() );
// BUG:BRAINDAMAGE!!casting int to int, compile error w/gcc4.0.2 Ubuntu
// error: invalid cast from 'int' to 'int'
// CREDIT: detected by Michael Creel, Econometrics UAB Spain

Since this fix is still in my sources, not in the tarball, I think you
must find 75 int casts, 13 of them reverse (MPI_*<-int) and the
remaining 62 direct (int<-MPI_*). Did you count them? Did you mark them
in sources to easily locate them? If not, you could still compare to the
original sources and double-check if they are indeed 75=62+13 reverse.

BTW, don't keep the int-to-int cast :-) Modify the others to long

Do you have an AOL or Messanger or Skype or ICQ account?

Hm, sorry, no. I know this is terribly slow, I spent all this time
thinking the answer and writing the e-mail. Sorry for the inconvenience,
please bear with me :-)

4.- tell me if you modified (and how) any other line:

I modified each reinterpret_cast in which the compilation problem occurs.
In particular I modified the code in the header file mpitb.h in each definitions as for example:

        NARGCHK       (NAME,2)
        BLCK_ONECOM   (NAME,0)
        BLCK_ONEINT   (NAME,1,high)     /

        MPI_Comm                    intracomm=0;
        int info=NAME (comm, high, &intracomm );
        RET_2_ARG     (info, reinterpret_cast<long>( intracomm )    )

where before the original code was
RET_2_ARG (info, reinterpret_cast<int>( intracomm )  )

Excellent! there is another before that one, in PATN_COM_F_VOID. There
are 2 direct casts in mpitb.h and you correctly changed them to <long>.
Double-check the remaining ones are also changed (up to 62 direct
casts). Then also change the 13 reverse casts, like this one in hGrp.h
(which accumulates 7 of them)
MPI_Comm P##_comm=reinterpret_cast<MPI_Comm>(args(ARGN).long_value() );\
instead of
MPI_Comm P##_comm=reinterpret_cast<MPI_Comm>( args(ARGN).int_value() );\

I think the compiler might have not choked on those, since the
declarations in ov.h are
 virtual int
 int_value (bool req_int = false, bool frc_str_conv = false) const
   { return rep->int_value (req_int, frc_str_conv); }
 virtual long int
 long_value (bool req_int = false, bool frc_str_conv = false) const
   { return rep->long_value (req_int, frc_str_conv); }

so int_value() is returning 4bytes (an int) instead of 8bytes (a long
int), but the compiler is happy since it can promote the int to a full
8byte pointer. Is that true, or the compiler complained also about those
reverse casts?

6.- (just a joke) locate in the sources the last line of code shown, the one with the bad C-style typecast

I think the I have answered before to the question, in the question 4.

Hmmm, no, I think we are talking about different casts. Here I meant the
not-tagged, C-style casts. These are more difficult (impossible) to
track, since I didn't tagged them as "reinterpret". That's why I said it
was my fault. Recall your problem was

Unfortunaly some problems occur at run time,
[info rank]=MPI_Comm_rank(MPI_COMM_WORLD)% rank=0
MPI process rank 0 (n0, p31218) caught a SIGSEGV in MPI_Comm_rank.
Rank (0, MPI_COMM_WORLD): Call stack within LAM:
Rank (0, MPI_COMM_WORLD):  - MPI_Comm_rank()
Rank (0, MPI_COMM_WORLD):  - main()

Perhaps the pointer is being correctly cast to long but it is not being correctly castback to pointer, since it's using this code:
       MPI_Comm comm = (MPI_Comm) args(ARGN).int_value();
That's my fault. Right now I cannot remember why I didn't write any XXX_cast reserved word there. When I learned one shouldn't directly cast in C++, I started to static_ and reinterpret_cast. Perhaps I wrote that line before I learned that.

MPI_Comm_rank uses PATN_INT_F_COM (also defined in mpitb.h), which in
turn uses BLCK_ONECOM, which includes the line shown at mpitb.h:141.
It's my fault. It should had read:
MPI_Comm comm = reinterpret_cast<MPI_Comm>( args(ARGN).int_value() );

and on IA64 now it should read (make it look like that)

MPI_Comm comm = reinterpret_cast<MPI_Comm>( args(ARGN).long_value() );

and when I finish the modifications I told you before, it will read
something like this

MPI_Comm comm = OCTAVE_TO_CPTR( args(ARGN) );

5.- copy-paste a screen dump with the same Octave command sequence I showed above

this is the command sequence:

address@hidden ~]$ octave
Set SSI rpi to tcp with the command:
  putenv('LAM_MPI_SSI_rpi','tcp'), MPI_Init
Help on MPI: help mpi
octave-2.1.72:1> MPI_COMM_WORLD
ans = 2.3058e+18

Hmpf, I should have suggested format long :-) since my suspicion was long=8B

octave-2.1.72:2> MPI_Init
ans = 0
octave-2.1.72:3> a=MPI_COMM_WORLD
a =  2.3058e+18
octave-2.1.72:4> whos a

*** local user variables:

  Prot Name        Size                     Bytes  Class
  ==== ====        ====                     =====  =====
   rwd a           1x1                          8  scalar

Total is 1 element using 8 bytes

octave-2.1.72:5> MPI_Finalize
ans = 0

Good, good, I think it's crystal clear now. There is again a lot of
homework for both of us :-) I'll start hunting all the C-style typecasts
that might remain in sources, and re-write all them (reinterpret and
C-style) with the macros I'm planning. Your homework is now:

1.- Double check all reinterpret casts are long (both direct and reverse)
2.- Correct the C-Style cast I told you on item 6.- (just that, we have
not found any more yet)
3.- Go on checking MPITB for errors... I think you'll find them faster
than I. I must read sources, you just need to run the examples in the
tutorial and complain when you find another error :-)
4.- Also, reply the question I asked in item 4.- ... the compiler didn't
complain with the reverse typecasts... did it?
n.- Go back to direct e-mail when people in the mailing list start to
complain because they get too bored :-)

Dear Javier, thanks a lot for your time and I'm impatient
for your reply.

I hope you can do an interesting use of MPITB (in class or in research)
and I can include soon a link to your work in the MPITB web page.


Octave is freely available under the terms of the GNU GPL.

Octave's home on the web:
How to fund new projects:
Subscription information:

reply via email to

[Prev in Thread] Current Thread [Next in Thread]