bug-gsl
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

gsl_multifit_linear: chisq is not residual sum of squares when X is not


From: Lijun WANG
Subject: gsl_multifit_linear: chisq is not residual sum of squares when X is not full column rank
Date: Sun, 4 Sep 2022 22:34:27 +0800
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101 Thunderbird/91.11.0

Hi,

Recently, I am using `gsl_multifit_linear` to regress a response vector y on a design matrix X. I found that `chisq` is not the sum of squares of the residuals when X is not full column rank. In other words, it is different from the residual sum of squares (RSS) calculated based on the returned coefficients, gif.latex (81×22).

In the related source code, "gsl-2.7/multifit/linear_common.c", on line 192, it assigns the norm calculated on line 118. However, if X is not full column rank, the modified SVD sets alpha = 0 on line 172 to discard the zero singular value, RSS is no longer the one || y - U U^T y || commented on line 113. Instead, it should be || y - U_* U_*^T y ||, where U_* is constructed by U after discarding the columns corresponding to zero singular values.

If you are familiar with R language, I can illustrate the idea with a toy example,

    > set.seed(1234)
    > x = matrix(rnorm(10*2), 10, 2)
    > y = x[,1] + rnorm(10) * 0.1
    > x2 = cbind(x, 0)
    > X = cbind(1, x2)
    > sv = svd(X)
    > # chisq in gsl_linear_multifit
    > norm(y - sv$u %*% t(sv$u) %*% y, type = "2")^2
   [1] 0.02189036
    > # residual sum of squares since there is a zero singular value
    > norm(y - sv$u[,1:3] %*% t(sv$u[,1:3]) %*% y, type = "2")^2
   [1] 0.02193989
    > # calculate via R
    > sum(lm(y ~ x2)$residuals^2)
   [1] 0.02193989


which shows that the current gsl_linear_multifit returns a different RSS compared to R language (0.02189036 vs 0.02193989), but the results (0.02193989 vs 0.02193989) agree after discarding the columns of zero singular values.

So `rho_ls` on line 192 should be recalculated after discarding the columns of U for zero singular values. Or a more direct way is to calculate `rho_ls` with the estimated coefficient `c`,

   /// Line 191

   gsl_vector_memcpy(&t.vector, y);
                    gsl_blas_dgemv(CblasNoTrans, -1.0, X, c, 1.0,
   &t.vector);
                    rho_ls = gsl_blas_dnrm2(&t.vector);

   /// Line 192

What do you think?

Best,
Lijun


reply via email to

[Prev in Thread] Current Thread [Next in Thread]