help-octave
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Fwd: Problem with pdist


From: Esteban Cervetto
Subject: Fwd: Problem with pdist
Date: Thu, 15 Jul 2010 14:43:17 -0300

I would like to make clear that mi initial problem of memory dissapear when i run the scripts directly from de console, without the frontend i used.
But is interesting the improvement the ind2sub has done.
 
Esteban 

---------- Forwarded message ----------
From: Esteban Cervetto <address@hidden>
Date: 2010/7/15
Subject: Fwd: Problem with pdist
To: address@hidden




---------- Forwarded message ----------
From: Jaroslav Hajek <address@hidden>
Date: 2010/7/15
Subject: Re: Problem with pdist
To: Esteban Cervetto <address@hidden>
Cc: address@hidden


On Wed, Jul 14, 2010 at 8:09 PM, Esteban Cervetto
<address@hidden> wrote:
> Hello:
>
> I am having problems with the pdist function. It seems not to do enough
> strong to calculate the next problem:
>
>
> octave:22> pdist(X,"mahalanobis")
>
> error: memory exhausted or requested size too large for range of Octave's
> index
>
> type -- trying to return to prompt
>
> octave:22> size(X)
>
> ans =
>
> 2 8993
>
> octave:23>
>
>
>
>
>
> However, Mahalanobis distance is usually used in datamining studies, and a
> sample of 8993 registers and two variables is small.
>
>
>
> It exist a form to improve dramatically the efficiency of octave
> with functions related with data mining or pattern classification?
>
>
>
> Links to similar topics are well received.
>
>
>

It seems the bottleneck here is Octave's nchoosek, which apparently
sucks for nchoosek(1:N, 2) where N is several thousand.
Hmmm. Perhaps something can be done there...

--
RNDr. Jaroslav Hajek, PhD
computing expert & GNU Octave developer
Aeronautical Research and Test Institute (VZLU)
Prague, Czech Republic
url: www.highegg.matfyz.cz

 
I am trying to improve nchoosek when n = 2. I used teh ind2sub function instead
 
n=4000;
x = [1:n]';
tic
nchoosek(1:n,2);
toc
tic
[r,c] = ind2sub ([n,n], 1:n.^2);
order = [r(r>c);c(r>c)];
toc
 
 
Results with n = 40
Elapsed time is 0.0157 seconds.
Elapsed time is -6.98e-009 seconds.
Results with n = 4000
Elapsed time is 290 seconds.
Elapsed time is 10.4 seconds.
 
Another thing is that I am using octave-3.2.3, but i am not using the pdist() of the package. The one is:
 
## Copyright (C) 2008  Francesco Potortì
##
## This program is free software; you can redistribute it and/or modify
## it under the terms of the GNU General Public License as published by
## the Free Software Foundation; either version 3, or (at your option)
## any later version.
##
## This program is distributed in the hope that it will be useful, but
## WITHOUT ANY WARRANTY; without even the implied warranty of
## MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
## General Public License for more details.
##
## You should have received a copy of the GNU General Public License
## along with this software; see the file COPYING.  If not, see
## <http://www.gnu.org/licenses/>.
## -*- texinfo -*-
## @deftypefn {Function File} @var{y} = pdist (@var{x})
## @deftypefnx {Function File} @var{y} = pdist (@var{x}, @var{distfun})
## @deftypefnx {Function File} @var{y} = pdist (@var{x},
## @var{distfun}, @var{distfunarg}, @dots{})
##
## Return the distance between any two rows in @var{x}.
##
## @var{x} is the matrix representing @var{q} row vectors of
## size @var{d}.
##
## The output is a dissimilarity matrix arranged into a row vector
## @var{y},(n - 1) * (n / 2) long, where the distances are in the order
## [(1, 2) (1, 3) @dots{} (2, 3) @dots{} (n-1, n)].  You can use the
## @seealso{squareform} function to display the distances between the
## vectors arranged into an matrix.
##
## @var{distfun} is an optional argument specifying how the distance is
## computed. It can be any of the following ones, defaulting to
## "euclidean", or a user defined function that takes two arguments
## distfun @var{x} and @var{y} plus any number of optional arguments,
## where @var{x} is a row vector and and @var{y} is a matrix having the
## same number of columns as @var{x}.  @var{distfun} returns a column
## vector where row @var{i} is the distance between @var{x} and row
## @var{i} of @var{y}. Any additional arguments after the @var{distfun}
## are passed as distfun (@var{x}, @var{y}, @var{distfunarg1},
## @var{distfunarg2} @dots{}).
##
## Predefined distance functions are:
##
## @table @samp
## @item "euclidean"
## Euclidean distance (default).
##
## @item "seuclidean"
## Standardized Euclidean distance. Each coordinate in the sum of
## squares is inverse weighted by the sample variance of that
## coordinate.
##
## @item "mahalanobis"
## Mahalanobis distance: @seealso{mahalanobis}.
##
## @item "cityblock"
## City Block metric, aka Manhattan distance.
##
## @item "minkowski"
## Minkowski metric.  Accepts a numeric parameter @var{p}: for @var{p}=1
## this is the same as the cityblock metric, with @var{p}=2 (default) it
## is equal to the euclidean metric.
##
## @item "cosine"
## One minus the cosine of the included angle between rows, seen as
## vectors.
##
## @item "correlation"
## One minus the sample correlation between points (treated as
## sequences of values).
##
## @item "spearman"
## One minus the sample Spearman's rank correlation between
## observations, treated as sequences of values.
##
## @item "hamming"
## Hamming distance: the quote of the number of coordinates that differ.
##
## @item "jaccard"
## One minus the Jaccard coefficient, the quote of nonzero
## coordinates that differ.
##
## @item "chebychev"
## Chebychev distance: the maximum coordinate difference.
## @end table
## @seealso{linkage,squareform}
## @end deftypefn
## Author: Francesco Potortì
function y = pdist (x, distfun, varargin)
  if (nargin < 1)
    print_usage ();
  elseif ((nargin > 1)
          && ! ischar (distfun)
          && ! isa (distfun, "function_handle"))
    error (["pdist: the distance function must be either a string or a "
            "function handle."]);
  endif
  if (nargin < 2)
    distfun = "euclidean";
  endif
  if (! ismatrix (x) || isempty (x))
    error ("pdist: x must be a nonempty matrix");
  elseif (length (size (x)) > 2)
    error ("pdist: x must be 1 or 2 dimensional");
  endif
  y = [];
  if (ischar (distfun))
    order = nchoosek(1:rows(x),2);
    Xi = order(:,1);
    Yi = order(:,2);
    X = x';
    switch (distfun)
      case "euclidean"
        diff = X(:,Xi) - X(:,Yi);
        if (str2num(version()(1:3)) > 3.1)
          y = norm (diff, "cols");
        else
          y = sqrt (sumsq (diff));
        endif
      case "seuclidean"
        diff = X(:,Xi) - X(:,Yi);
        weights = inv (diag (var (x)));
        y = sqrt (sum ((weights * diff) .* diff));
      case "mahalanobis"
        diff = X(:,Xi) - X(:,Yi);
        weights = inv (cov (x));
        y = sqrt (sum ((weights * diff) .* diff));
      case "cityblock"
        diff = X(:,Xi) - X(:,Yi);
        if (str2num(version()(1:3)) > 3.1)
          y = norm (diff, 1, "cols");
        else
          y = sum (abs (diff));
        endif
      case "minkowski"
        diff = X(:,Xi) - X(:,Yi);
        p = 2;                  # default
        if (nargin > 2)
          p = varargin{1};      # explicitly assigned
        endif;
        if (str2num(version()(1:3)) > 3.1)
          y = norm (diff, p, "cols");
        else
          y = (sum ((abs (diff)).^p)).^(1/p);
        endif
      case "cosine"
        prod = X(:,Xi) .* X(:,Yi);
        weights = sumsq (X(:,Xi)) .* sumsq (X(:,Yi));
        y = 1 - sum (prod) ./ sqrt (weights);
      case "correlation"
        corr = cor (X);
        y = 1 - corr (sub2ind (size (corr), Xi, Yi))';
      case "spearman"
        corr = spearman (X);
        y = 1 - corr (sub2ind (size (corr), Xi, Yi))';
      case "hamming"
        diff = logical (X(:,Xi) - X(:,Yi));
        y = sum (diff) / rows (X);
      case "jaccard"
        diff = logical (X(:,Xi) - X(:,Yi));
        weights = X(:,Xi) | X(:,Yi);
        y = sum (diff & weights) ./ sum (weights);
      case "chebychev"
        diff = X(:,Xi) - X(:,Yi);
        if (str2num(version()(1:3)) > 3.1)
          y = norm (diff, Inf, "cols");
        else
          y = max (abs (diff));
        endif
    endswitch
  endif
  if (isempty (y))
    ## Distfun is a function handle or the name of an external function
    l = rows (x);
    y = zeros (1, nchoosek (l, 2));
    idx = 1;
    for ii = 1:l-1
      for jj = ii+1:l
        y(idx++) = feval (distfun, x(ii,:), x, varargin{:})(jj);
      endfor
    endfor
  endif
endfunction
 


reply via email to

[Prev in Thread] Current Thread [Next in Thread]