[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Gnucap-devel] Gnucap-devel Digest, Vol 88, Issue 3
From: |
al davis |
Subject: |
Re: [Gnucap-devel] Gnucap-devel Digest, Vol 88, Issue 3 |
Date: |
Tue, 4 Mar 2014 23:43:43 -0500 |
User-agent: |
KMail/1.13.7 (Linux/3.2.0-4-amd64; KDE/4.8.4; x86_64; ; ) |
On Tuesday 04 March 2014, beranger six wrote:
> Regarding CUDA licensing, I refered to the FAQ
> #SystemLibraryException pointing to the section 1 of
> GNU/GPL3.
> Based on that I assume we could distribute the software with
> CUDA code in it but without the actual library.
For us to distribute, it must be licensed GPL-3. To include the
code the license must allow us to relicense it GPL-3.
Having said that ... sometimes it makes sense to write trial
code you don't expect to leave in as a learning experience.
> Using our projects (big system with low accuracy), a lot of
> time is spent inside lu_decomp().
>
> >> In the LU decomposition, running the outermost loop in
> >> parallel should be all that is needed there. But to get
> >> enough benefit model evaluation also should be parallel,
> >> and likely more important.
>
> I think that we can not parallelize the outermost loop
> because we noticed that most elements are dependant on the
> previous computations. (a block is dependent to all block
> beetwen its lownode(bn) and its row_value(mm) )
25 years later ... I look at the code again ....
Yes .. the inner loop uses previously calculated values in L and
U.
But it is in blocks. It is a bordered-block-diagonal form.
You need to solve each block as a unit, but blocks can be solved
in parallel with other blocks, and must be solved before the
border can be solved. The border is connections between blocks.
I considered moving the block decomposition to be part of the
subckt code, perhaps doing the decomposition as part of the load
pass. Then the final LU would only need to do the main border.
Block loading can be done in parallel but I question how
effective it would be because of the non-parallel hardware data
transfer path.
This move makes paralleling matrix blocks easy, but that is not
the biggest advantage. A much bigger advantage is that it makes
it easier to skip whole blocks when doing partial solutions.
Another big advantage is that when there are multiple
instantiations of linear or quasi-linear subckt blocks it
becomes possible to solve the block once and reuse the
decomposed blocks.
> To bypass this issue we thought about using a dependence
> tree. This tree would be created when we get the system and
> updated each time the lownodes change. not making this
> dependence tree time consuming. Please refer to the PDF
> attached to this mail explaining it with some figures.
> Please keep in mind this document is a draft for now.
The 'lownode' array is a dependence tree. It is set up once
when the matrix is allocated then remains constant.
The list filters do not pass attachments. Can you post a link?
If you need a place to upload it I can set up an account for you
on the wiki and you can upload it there.
> You are right, my tests show that the number of data
> transfert between host(CPU) and deviceGPU made it not
> suitable fr our kind of application. Still seem like a good
> way for very-large system. Though we need to do some testing
> to determine the matrix size making this interesting.
For math the CPU is speed limited by memory access. The
operations add, subtract, and multiply all take essentially no
time because the time spent to transfer from memory to/from CPU
registers dominates and is done in parallel with actual
calculation.
Since we are discussing speed ... There are two reasons the
matrix solver is not a plugin.
The first is speed. Plugins use the virtual function mechanism,
which for speed is equivalent to a real function call. Having
the access functions in the header file, not a plugin, allows
inline code generation without function call overhead. It makes
about a 2:1 difference in run speed for the load step.
The other reason is template generation. The matrix is a
template class. Gnucap already uses double and complex in the
release version. I have done experiments using other types
including polynomials. This would not be possible with the
plugin mechanism.
On the pdf ....
That particular matrix structure is unlikely. It shows two
circuit blocks with no connection to each other.
One paper referenced states run time of NGSpice as O(n^1.2) or
something like that. My tests show it as O(n^2). Gnucap
somewhere between O(n) and O(n log(n)) for the same set of
circuits. I believe that if the authors of that paper didn't
see spice at O(n^2) they used circuits that are too small.
It depends on node ordering. I know that the gnucap ordering is
not optimal, particularly when fed big scrambled flat circuits.
Working on it can be very frustrating, particularly if you don't
understand why the patterns come out as they do.