Fwd: Problem with pdist

---------- Forwarded message ----------
From: Esteban Cervetto <address@hidden>
Date: 2010/7/15
Subject: Fwd: Problem with pdist
To: address@hidden

---------- Forwarded message ----------
From: Jaroslav Hajek <address@hidden>
Date: 2010/7/15
Subject: Re: Problem with pdist
To: Esteban Cervetto <address@hidden>
Cc: address@hidden

On Wed, Jul 14, 2010 at 8:09 PM, Esteban Cervetto
<address@hidden> wrote:
> Hello:
>
> I am having problems with the pdist function. It seems not to do enough
> strong to calculate the next problem:
>
>
> octave:22> pdist(X,"mahalanobis")
>
> error: memory exhausted or requested size too large for range of Octave's
> index
>
> type -- trying to return to prompt
>
> octave:22> size(X)
>
> ans =
>
> 2 8993
>
> octave:23>
>
>
>
>
>
> However, Mahalanobis distance is usually used in datamining studies, and a
> sample of 8993 registers and two variables is small.
>
>
>
> It exist a form to improve dramatically the efficiency of octave
> with functions related with data mining or pattern classification?
>
>
>
> Links to similar topics are well received.
>
>
>

It seems the bottleneck here is Octave's nchoosek, which apparently
sucks for nchoosek(1:N, 2) where N is several thousand.
Hmmm. Perhaps something can be done there...

--
RNDr. Jaroslav Hajek, PhD
computing expert & GNU Octave developer
Aeronautical Research and Test Institute (VZLU)
Prague, Czech Republic
url: www.highegg.matfyz.cz

I am trying to improve nchoosek when n = 2. I used teh ind2sub function instead

n=4000;
x = [1:n]';
tic
nchoosek(1:n,2);
toc
tic
[r,c] = ind2sub ([n,n], 1:n.^2);
order = [r(r>c);c(r>c)];
toc

Results with n = 40

Elapsed time is 0.0157 seconds.
Elapsed time is -6.98e-009 seconds.

Results with n = 4000

Elapsed time is 290 seconds.
Elapsed time is 10.4 seconds.

Another thing is that I am using octave-3.2.3, but i am not using the pdist() of the package. The one is:

## Copyright (C) 2008 Francesco Potortì
##
## This program is free software; you can redistribute it and/or modify
## it under the terms of the GNU General Public License as published by
## the Free Software Foundation; either version 3, or (at your option)
## any later version.
##
## This program is distributed in the hope that it will be useful, but
## WITHOUT ANY WARRANTY; without even the implied warranty of
## MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
## General Public License for more details.
##
## You should have received a copy of the GNU General Public License
## along with this software; see the file COPYING. If not, see
## <http://www.gnu.org/licenses/>.

## -*- texinfo -*-
## @deftypefn {Function File} @var{y} = pdist (@var{x})
## @deftypefnx {Function File} @var{y} = pdist (@var{x}, @var{distfun})
## @deftypefnx {Function File} @var{y} = pdist (@var{x},
## @var{distfun}, @var{distfunarg}, @dots{})
##
## Return the distance between any two rows in @var{x}.
##
## @var{x} is the matrix representing @var{q} row vectors of
## size @var{d}.
##
## The output is a dissimilarity matrix arranged into a row vector
## @var{y},(n - 1) * (n / 2) long, where the distances are in the order
## [(1, 2) (1, 3) @dots{} (2, 3) @dots{} (n-1, n)]. You can use the
## @seealso{squareform} function to display the distances between the
## vectors arranged into an matrix.
##
## @var{distfun} is an optional argument specifying how the distance is
## computed. It can be any of the following ones, defaulting to
## "euclidean", or a user defined function that takes two arguments
## distfun @var{x} and @var{y} plus any number of optional arguments,
## where @var{x} is a row vector and and @var{y} is a matrix having the
## same number of columns as @var{x}. @var{distfun} returns a column
## vector where row @var{i} is the distance between @var{x} and row
## @var{i} of @var{y}. Any additional arguments after the @var{distfun}
## are passed as distfun (@var{x}, @var{y}, @var{distfunarg1},
## @var{distfunarg2} @dots{}).
##
## Predefined distance functions are:
##
## @table @samp
## @item "euclidean"
## Euclidean distance (default).
##
## @item "seuclidean"
## Standardized Euclidean distance. Each coordinate in the sum of
## squares is inverse weighted by the sample variance of that
## coordinate.
##
## @item "mahalanobis"
## Mahalanobis distance: @seealso{mahalanobis}.
##
## @item "cityblock"
## City Block metric, aka Manhattan distance.
##
## @item "minkowski"
## Minkowski metric. Accepts a numeric parameter @var{p}: for @var{p}=1
## this is the same as the cityblock metric, with @var{p}=2 (default) it
## is equal to the euclidean metric.
##
## @item "cosine"
## One minus the cosine of the included angle between rows, seen as
## vectors.
##
## @item "correlation"
## One minus the sample correlation between points (treated as
## sequences of values).
##
## @item "spearman"
## One minus the sample Spearman's rank correlation between
## observations, treated as sequences of values.
##
## @item "hamming"
## Hamming distance: the quote of the number of coordinates that differ.
##
## @item "jaccard"
## One minus the Jaccard coefficient, the quote of nonzero
## coordinates that differ.
##
## @item "chebychev"
## Chebychev distance: the maximum coordinate difference.
## @end table
## @seealso{linkage,squareform}
## @end deftypefn

## Author: Francesco Potortì

function y = pdist (x, distfun, varargin)

if (nargin < 1)
    print_usage ();
elseif ((nargin > 1)
          && ! ischar (distfun)
          && ! isa (distfun, "function_handle"))
    error (["pdist: the distance function must be either a string or a "
            "function handle."]);
endif

if (nargin < 2)
distfun = "euclidean";
endif

if (! ismatrix (x) || isempty (x))
error ("pdist: x must be a nonempty matrix");
elseif (length (size (x)) > 2)
error ("pdist: x must be 1 or 2 dimensional");
endif

y = [];
if (ischar (distfun))
    order = nchoosek(1:rows(x),2);
    Xi = order(:,1);
    Yi = order(:,2);
    X = x';
    switch (distfun)
      case "euclidean"
        diff = X(:,Xi) - X(:,Yi);
        if (str2num(version()(1:3)) > 3.1)
          y = norm (diff, "cols");
        else
          y = sqrt (sumsq (diff));
        endif

      case "seuclidean"
        diff = X(:,Xi) - X(:,Yi);
        weights = inv (diag (var (x)));
        y = sqrt (sum ((weights * diff) .* diff));

      case "mahalanobis"
        diff = X(:,Xi) - X(:,Yi);
        weights = inv (cov (x));
        y = sqrt (sum ((weights * diff) .* diff));

      case "cityblock"
        diff = X(:,Xi) - X(:,Yi);
        if (str2num(version()(1:3)) > 3.1)
          y = norm (diff, 1, "cols");
        else
          y = sum (abs (diff));
        endif

      case "minkowski"
        diff = X(:,Xi) - X(:,Yi);
        p = 2;                  # default
        if (nargin > 2)
          p = varargin{1};      # explicitly assigned
        endif;
        if (str2num(version()(1:3)) > 3.1)
          y = norm (diff, p, "cols");
        else
          y = (sum ((abs (diff)).^p)).^(1/p);
        endif

      case "cosine"
        prod = X(:,Xi) .* X(:,Yi);
        weights = sumsq (X(:,Xi)) .* sumsq (X(:,Yi));
        y = 1 - sum (prod) ./ sqrt (weights);

      case "correlation"
        corr = cor (X);
        y = 1 - corr (sub2ind (size (corr), Xi, Yi))';

      case "spearman"
        corr = spearman (X);
        y = 1 - corr (sub2ind (size (corr), Xi, Yi))';

      case "hamming"
        diff = logical (X(:,Xi) - X(:,Yi));
        y = sum (diff) / rows (X);

      case "jaccard"
        diff = logical (X(:,Xi) - X(:,Yi));
        weights = X(:,Xi) | X(:,Yi);
        y = sum (diff & weights) ./ sum (weights);

      case "chebychev"
        diff = X(:,Xi) - X(:,Yi);
        if (str2num(version()(1:3)) > 3.1)
          y = norm (diff, Inf, "cols");
        else
          y = max (abs (diff));
        endif

endswitch
endif

if (isempty (y))
    ## Distfun is a function handle or the name of an external function
    l = rows (x);
    y = zeros (1, nchoosek (l, 2));
    idx = 1;
    for ii = 1:l-1
      for jj = ii+1:l
        y(idx++) = feval (distfun, x(ii,:), x, varargin{:})(jj);
      endfor
    endfor
endif

endfunction

From:	Esteban Cervetto
Subject:	Fwd: Problem with pdist
Date:	Thu, 15 Jul 2010 14:43:17 -0300