gnucap-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Gnucap-devel] Gnucap-devel Digest, Vol 88, Issue 3


From: al davis
Subject: Re: [Gnucap-devel] Gnucap-devel Digest, Vol 88, Issue 3
Date: Tue, 4 Mar 2014 23:43:43 -0500
User-agent: KMail/1.13.7 (Linux/3.2.0-4-amd64; KDE/4.8.4; x86_64; ; )

On Tuesday 04 March 2014, beranger six wrote:
> Regarding CUDA licensing, I refered to the FAQ
> #SystemLibraryException pointing to the section 1 of
> GNU/GPL3.
> Based on that I assume we could distribute the software with
> CUDA code in it but without the actual library.

For us to distribute, it must be licensed GPL-3.  To include the 
code the license must allow us to relicense it GPL-3.

Having said that ... sometimes it makes sense to write trial 
code you don't expect to leave in as a learning experience.

> Using our projects (big system with low accuracy), a lot of
> time is spent inside lu_decomp().
> 
> >> In the LU decomposition, running the outermost loop in
> >> parallel should be all that is needed there.  But to get
> >> enough benefit model evaluation also should be parallel,
> >> and likely more important.
> 
> I think that we can not parallelize the outermost loop
> because we noticed that most elements are dependant on the
> previous computations. (a block is dependent to all block
> beetwen its lownode(bn) and its row_value(mm) )

25 years later ...  I look at the code again  ....

Yes .. the inner loop uses previously calculated values in L and 
U.

But it is in blocks.  It is a bordered-block-diagonal form.

You need to solve each block as a unit, but blocks can be solved 
in parallel with other blocks, and must be solved before the 
border can be solved.  The border is connections between blocks.

I considered moving the block decomposition to be part of the 
subckt code, perhaps doing the decomposition as part of the load 
pass.  Then the final LU would only need to do the main border.

Block loading can be done in parallel but I question how 
effective it would be because of the non-parallel hardware data 
transfer path.

This move makes paralleling matrix blocks easy, but that is not 
the biggest advantage.  A much bigger advantage is that it makes 
it easier to skip whole blocks when doing partial solutions.

Another big advantage is that when there are multiple 
instantiations  of linear or quasi-linear subckt blocks it 
becomes possible to solve the block once and reuse the 
decomposed blocks.

> To bypass this issue we thought about using a dependence
> tree. This tree would be created when we get the system and
> updated each time the lownodes change. not making this
> dependence tree time consuming. Please refer to the PDF
> attached to this mail explaining it with some figures.
> Please keep in mind this document is a draft for now.

The 'lownode' array is a dependence tree.  It is set up once 
when the matrix is allocated then remains constant.

The list filters do not pass attachments.  Can you post a link?

If you need a place to upload it I can set up an account for you 
on the wiki and you can upload it there.

> You are right, my tests show that the number of data
> transfert between host(CPU) and deviceGPU made it not
> suitable fr our kind of application. Still seem like a good
> way for very-large system. Though we need to do some testing
> to determine the matrix size making this interesting.

For math the CPU is speed limited by memory access.  The 
operations add, subtract, and multiply all take essentially no 
time because the time spent to transfer from memory to/from CPU 
registers dominates and is done in parallel with actual 
calculation.

Since we are discussing speed ...  There are two reasons the 
matrix solver is not a plugin.  

The first is speed.  Plugins use the virtual function mechanism, 
which for speed is equivalent to a real function call.  Having 
the access functions in the header file, not a plugin, allows 
inline code generation without function call overhead.  It makes 
about a 2:1 difference in run speed for the load step.

The other reason is template generation.  The matrix is a 
template class.  Gnucap already uses double and complex in the 
release version.  I have done experiments using other types 
including polynomials.  This would not be possible with the 
plugin mechanism.


On the pdf ....

That particular matrix structure is unlikely.  It shows two 
circuit blocks with no connection to each other.

One paper referenced states run time of NGSpice as O(n^1.2) or 
something like that.  My tests show it as O(n^2).  Gnucap 
somewhere between O(n) and O(n log(n)) for the same set of 
circuits.  I believe that if the authors of that paper didn't 
see spice at O(n^2) they used circuits that are too small.

It depends on node ordering.  I know that the gnucap ordering is 
not optimal, particularly when fed big scrambled flat circuits.  
Working on it can be very frustrating, particularly if you don't 
understand why the patterns come out as they do.



reply via email to

[Prev in Thread] Current Thread [Next in Thread]