gnucap-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Gnucap-devel] Gnucap-devel Digest, Vol 88, Issue 3


From: beranger six
Subject: Re: [Gnucap-devel] Gnucap-devel Digest, Vol 88, Issue 3
Date: Tue, 04 Mar 2014 17:35:50 +0100
User-agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:24.0) Gecko/20100101 Thunderbird/24.3.0

I start a work to parellelize(with openmp, and then maybe cuda) the LU decomposition in gnucap.
Openmp looks like a good way to do it.  It looks simple, and
comes with most compilers including gcc.

Do not use cuda.  Unless I misunderstand, the licensing of cuda
makes it unsuitable for use in a GNU project.
Regarding CUDA licensing, I refered to the FAQ #SystemLibraryException pointing to the section 1 of GNU/GPL3. Based on that I assume we could distribute the software with CUDA code in it but without the actual library. Not considering here libraries based on CUDA i.e. Cublas. To me, CUDA looks like one of the easiest way to work with GPGPUs since it provides debugging and performance analysis tools. CUDA would provide a good insight on what happens allowing me to detect issue before re-factoring the code using OpenCL. Which is more compliant with GPLv3 license.
I see your topic about parallelism, and i have some time to
do it.

Futhermore, we definitly need faster simulation result for
our application.

What kind of solution did you have in mind:
Very simple ..  Identify certain loops that can run in parallel.
That is really all.

You should look at the output of the "status" command to see
where the time is spent, which will show where parallelism could
be of benefit and how much benefit to expect.
Using our projects (big system with low accuracy), a lot of time is spent inside lu_decomp().
In the LU decomposition, running the outermost loop in parallel
should be all that is needed there.  But to get enough benefit
model evaluation also should be parallel, and likely more
important.

I think that we can not parallelize the outermost loop because we noticed that most elements are dependant on the previous computations. (a block is dependent to all block beetwen its lownode(bn) and its row_value(mm) ) To bypass this issue we thought about using a dependence tree. This tree would be created when we get the system and updated each time the lownodes change. not making this dependence tree time consuming. Please refer to the PDF attached to this mail explaining it with some figures. Please keep in mind this document is a draft for now.

-Is it the "section" you design with row, diagonal, and
column? In this case did you want to use , the fact that if
all the section beetween _lownode[mm] and mm are calculated
we could computed the element.

In this case we could have a dependence graph(or tree)
applied to your storage matrix section, mostlyy used to
parrallelize Gilbert-Peierls Algorithm .
I don't think that makes sense here, but you might want to try
it.  Remember ..  gnucap's matrix solver usually does low rank
updates and partial solutions.  If you lose this feature it
could make it so much slower that any  parallel operation can
not come close to recovering the loss.

The simpler solver used for AC analysis is not parallel ready.
To parallelize the AC matrix solution it may be necessary to
switch to the other lu_decomp, which requires double matrix
storage.

-Is it an iterative method,with the problem that the
convergence could take theorically an infinite number of
operation.(so maybe not a good way)
no -- not iterative -- except for the standard DC_tran Newton
iteration which would not change.

-Is it parallelize only the map(multiplicaton beetwen
element) of dot product, and then maybe parallelize the
reduction(addition beetween elements).
I think the overhead of parallelizing the dot product would be
too high, thinking of the multi-thread model.  The dot product
might be a candidate for GPU type processing, but look at
"status" to judge whether there is enough potential benefit
before doing this.

You are right, my tests show that the number of data transfert between host(CPU) and deviceGPU made it not suitable fr our kind of application. Still seem like a good way for very-large system. Though we need to do some testing to determine the matrix size making this interesting.

Thanks


reply via email to

[Prev in Thread] Current Thread [Next in Thread]