guile-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

array operations


From: Daniel Llorens
Subject: array operations
Date: Sat, 9 Mar 2013 00:42:43 +0100

On Mar 4, 2013, at 03:27, Noah Lavine wrote:

> (I'm snipping the rest of your message because it needs more thought than I 
> can give it right now.)

That message contained a number of errors, the consequence of talking before 
walking, I suppose.

I have added a simple reduction operator to the library, enough to define an 
inner product operation:

(define* (dot + * A B #:key type)
  "dot + * A B
   Inner product between the last axis of A and the first of B."
  (let ((type (or type (array-type* A))))
    (ply/type type
              (w/rank (Jv (cut folda/type type (lambda (c a b) (+ c (* a b))) 0 
<> <>) '_ '_)
                      1 '_)
              A B)))

There are three loops in here:

---the outer loop (ply) over the axes of A, save the last (whence the arg 1 to 
w/rank)
---the middle loop (folda) over the last of A and the first of B (the axes 
being reduced)
---the inner loop (ply) over the axes of B, save the first —--this is because 
folda applies its op (lambda (c a b) ...) as an array op.

If A & B are both rank 2, this is like 'ikj' in §1.1.11 of Golub & Van Loan.

For example:

(define a (i. 300 100))
(define b (i. 100 200))
(define af (array-copy 'f64 a))
(define bf (array-copy 'f64 b))
(define as (array-copy 's64 a))
(define bs (array-copy 's64 b))

> ,time (dot + * a b)
$19 = #
;; 4.836000s real time, 4.870000s run time.  0.350000s spent in GC.
> ,time (dot + * af bf)
$20 = #
;; 7.166000s real time, 7.210000s run time.  0.690000s spent in GC.
> ,time (dot + * as bs)
$21 = #
;; 10.218000s real time, 10.280000s run time.  0.240000s spent in GC.

So it's horribly slow, and the differences between types are very large. Still, 
this shows that it's spending a significant fraction of the time in  
arithmetic, or at least not in loop overhead (I think?)

The profiler output for the type #t version is

> ,profile (dot + * a b)
%     cumulative   self             
time   seconds     seconds      name
 71.43      5.07      5.07  #<procedure 11d767ab0 at util/reduce.scm:87:47 (c a 
b)>
  7.14      1.01      0.51  map
  7.14      0.51      0.51  =
  7.14      0.51      0.51  %after-gc-thunk
  7.14      0.51      0.51  #<procedure 11c52ff90 at util/ploy.scm:165:16 i>
  0.00      7.10      0.00  #<procedure 1163083a0 at ice-9/top-repl.scm:66:5 ()>
  0.00      7.10      0.00  array-index-map!
  0.00      7.10      0.00  call-with-prompt
  0.00      7.10      0.00  folda/type
  0.00      7.10      0.00  apply-smob/1
  0.00      7.10      0.00  catch
  0.00      7.10      0.00  start-repl
  0.00      7.10      0.00  run-repl
  0.00      7.10      0.00  ply/type
  0.00      7.10      0.00  statprof
  0.00      7.10      0.00  #<procedure 1163085a0 at ice-9/top-repl.scm:31:6 
(thunk)>
  0.00      7.10      0.00  #<procedure 1189d1d20 at util/ploy.scm:121:5 i>
  0.00      7.10      0.00  #<procedure 1189cf210 at statprof.scm:655:4 ()>
  0.00      5.07      0.00  array-map!
  0.00      1.52      0.00  nested-op-frames
  0.00      0.51      0.00  nested-match-frame
  0.00      0.51      0.00  array-map/frame!
  0.00      0.51      0.00  length=?
  0.00      0.51      0.00  make-shared-array
---
Sample count: 14
Total time: 7.1 seconds (0.22 seconds in GC)

The top entry is (lambda (c a b) ...) in the definition of dot above. The map 
comes from the definition of folda/type (the middle loop).

(define (folda/type type op_ z . a)
  (if (null? a)
    z
    (let ((op (if (verb? op_) op_ (make-verb op_ #f #f))))
      (let ((end (tally (car a))))
        (let loop ((i 0) (c z))
          (if (< i end)
; @todo factor out frame matching from repeated application of ply
            (loop (+ 1 i) (apply ply/type type op c (map (cut from_ <> i) a)))
            c))))))                                   
                                                      ^
   here -----------------------------------------------

There's no map in the inner loop since the op has rank 0, so it resolves to 
array-map!.

------------

As a reference point,

> ,time (blas-dgemm a b 0. 'no 'no )
$28 = #
;; 0.009000s real time, 0.000000s run time.  0.000000s spent in GC.
> ,time (blas-dgemm af bf 0. 'no 'no )
$29 = #
;; 0.002000s real time, 0.000000s run time.  0.000000s spent in GC.

is a few 1000 times faster. It takes more than 3 times longer to convert from 
scm to f64 than to multiply... lots of room for improvement.

------------


1) The variable rank/arity code is by necessity full of apply / map / rest 
args. However, in practice, it will be used most of the time with rank 1 / 2 / 
maybe 3. It's not acceptable to have  this apply-map business going on in an 
inner loop, but I don't want to write rank-specific or arity-specific variants 
by hand. I could try to use syntax-rules' ... for the fixed rank/arity 
functions and select with case-lambda, but that still requires that I write two 
versions of nearly every function in the library. Is there a way to avoid this? 
It occurred to me that the partial evaluator could help, but I'm not sure how. 
Or maybe somebody has a meta-macro that takes the version using ... and 
converts it to the variable rank/arity version or viceversa.

2) Iterating over an array is a pain. I had this issue before when mapping over 
subarrays, so I wrote an array-map/frame!. If I write an array-for-each/frame

(let ((c z))
  (apply array-for-each/frame f (lambda a (set! c (apply ply/type type op c 
a))) a)
  c)

at least this would remove the map-from in the scalar case. It's something... 
still, carrying c outside it's a bit ugly.

3) is it normal that the arithmetic is so slow? the code above generates tons 
of temporaries and has a ton of overhead and it still manages to spend most of 
its time in the + * op.

Regards,

        Daniel





reply via email to

[Prev in Thread] Current Thread [Next in Thread]