gnucap-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Gnucap-devel] [devel-gnucap] Parralelism


From: al davis
Subject: Re: [Gnucap-devel] [devel-gnucap] Parralelism
Date: Sun, 2 Mar 2014 20:53:03 -0500
User-agent: KMail/1.13.7 (Linux/3.2.0-4-amd64; KDE/4.8.4; x86_64; ; )

On Thursday 27 February 2014, Felix Salfelder wrote:
> On Wed, Feb 26, 2014 at 07:34:24PM -0500, al davis wrote:
> > Very simple ..  Identify certain loops that can run in
> > parallel. That is really all.
> > 
> > You should look at the output of the "status" command to
> > see where the time is spent, which will show where
> > parallelism could be of benefit and how much benefit to
> > expect.
> 
> the status command says, a lot of time is used in "advance".
> advance apparently just shifts the history for each device. 
> so, why does every device store this individually? is this
> bound to a condition that is not implemented? something like
> 'dormant subcircuits'?

Storing time : yes.  multi-rate

Otherwise, some of what is stored is not available elsewhere, 
not part of the voltage vector.  Example: capacitor charge.

It is needed, but I am sure there is a way to do it faster.  It 
could use a queue to do some optimization, or maybe it could be 
speeded up by rotating pointers instead of data.

> and... "rewiev" takes quite some time. i have started to
> implement an audio processor, mostly consisting of
> behavioural models connecting to jack. here, the
> sophisticated time step control is of no use, and a
> simplified variant of the tran command runs siginificantly
> faster. it reaches realtime on simple circuits.

If you are forcing a higher sample rate, what you say is 
probably true.  You can turn it off by "option trsteporder=0".

Also, there some loops that count to OPT::_keep_time_steps that 
probably should count to order() or order()+1 instead because 
the higher order derivatives are not used.  Then you could make 
the review step faster by "option trsteporder=1".

But then .. would that make the simulation faster? .. maybe.  A 
higher trsteporder allows larger time steps.

I see a way to speed it up, maybe.  ....  It estimates 
derivatives by divided differences, and uses the result to 
suggest a step size.

Starting with 5 data points, it computes 4 first derivatives, 3 
second derivatives, 2 third derivatives, 1 fourth derivative.  3 
of the 4 first derivatives are the same as last time, just 
shifted.  Likewise, 2 of the 3 .....     This is an opportunity 
for optimization. 

> > I think the overhead of parallelizing the dot product would
> > be too high, thinking of the multi-thread model.  The dot
> > product might be a candidate for GPU type processing, but
> > look at "status" to judge whether there is enough
> > potential benefit before doing this.
> 
> when i read this article [1] i had the impression that with
> some hints to the compiler, the dot product can be computed
> faster on recent hardware. i havent tried. maybe alignment
> hints help in other places as well?

You are mixing "parallelize" with "vectorize".

Unlike most sparse packages, the inner loop here (dot_product) 
is designed as a vector operation, known to be non-overlapping.  
I presented a conference paper on this a long time ago.  1987?  

That referenced another paper (1985?) comparing the different 
ways to do LU decomposition on a dense matrix on a vector 
machine.  I think it was written by Dongarra and others, 
published in one of the SIAM journals.  Summary: 6 variants, all 
theoretically the same, but vectorization and memory access 
speed makes a difference.

I can't find these papers now.

This was a big factor in why the matrix is stored the way it is, 
and why it doesn't do post-ordering.

It is probably worth checking to see if those hints make any 
difference, but the code should already be in the best form for 
this.

> 
> > No .. that would probably make it slower.  The speed gain
> > of a better order would be offset by the overhead of
> > ordering and the more complex access.
> 
> there's already a permutation matrix involved in matrix
> allocation, _sim->_nm (u_sim_data). what i do not understand
> is, why the permutation is computed before iwant_matrix is
> called on the circuit. i have changed this (in gnucap-uf
> [2]). the current use case is weeding out unconnected nodes.
> but it is also possible to compute a permutation from the
> adjecency matrix (which is constructed in BSMATRIX), or from
> the netlist hierarchy or from anything. i have implemented
> some simple examples.

What I was referring to by "slower" is a permutation that needs 
to be looked up on every memory access.  During calculations, 
the actual matrix index numbers are the only ones used.  The 
unpermuted are only used for i/o.

But yes .. _sim->_nm is the place to store that.  Just once.

I did some experimenting with ordering, also a long time ago, 
and went back to it a few times over the years, always backing 
out the fancy stuff.

The popular Markowitz algorithm takes quadratic time to run.  
This alone makes it unsuitable for use here.  Most (all that I 
know of) ordering algorithms are slower than linear, which is a 
big reason that ordering is done just locally, not on the entire 
matrix.  For big mixed-mode simulations, even linear is too 
slow.

Also .. the Markowitz algorithm applied globally results in a 
poor ordering for mixed-mode.  It totally destroys the 
incremental update advantage.

One conclusion I reached early was that ordering must be done 
separately on each subckt definition, before expansion.  One 
exception to this is the case where all or almost all elements 
within a subcircuit connect to ports.





reply via email to

[Prev in Thread] Current Thread [Next in Thread]