Page d'accueil du laboratoire > Équipe de Probabilités > Page principale > Statistique, Distributions de Kolmogorov

Distributions de Kolmogorov

Il est terriblement décevant de ne pas voir implémenté de manière standard les distributions de Kolmogorov dans des logiciels de calcul aussi sérieux que Maple, Scilab, Mat... (et non, je viens de vérifier sur le web, c'est fait !). Il faut dire que ça paraît compliqué de prime abord, surtout si on n'a jamais eu la curiosité d'approfondir un peu le test de Kolmogorov--Smirnov, et qu'on n'a pas eu la chance de trouver l'article de George Marsaglia and Wai Wan Tsang, « Evaluating Kolmogorov's Distribution ». Avec cet article, tout paraît facile (modulo la lecture des travaux de Dudley), même le code C dont l'apreté peut refroidir le débutant. On trouvera ci-dessous des définition des commandes

kolmocorovcdf(x, n) et kolmocorovicdf(p, n)

(cdf pour cumulative distribution function et icdf pour inverse cumulative distribution function comme il se doit) pour les langages

Scilab, Maple, C, MetaPost,

(et TeX ?). Des copier-coller devraient suffire ou presque pour ajouter à ses macros usuelles les fonctions de répartition des lois de Kolmogorov (le paramètre n en seconde position est la taille de l'échantillon) et ainsi construire sa petite procédure de test de Kolmogorov--Smirnov « maison ». On rappelle à cet effet que la statistique du test est

ks = max_{i = 1,...,n} { ( F(x_(i)) - (i-1)/n ) v ( i/n - F(x_(i)) ) }

où F est la fonction de répartition cible (qui est supposée continue, et c'est la seule restriction de ce test), a v b = max(a, b), et l'échantillon x, de taille n, est supposé trié dans l'ordre croissant. Si

1 - kolmogorovcdf(ks, n) < α,

où α est le seuil du test qu'on a choisi, alors on rejette l'hypothèse selon laquelle l'échantillon a pu être obtenu à partir de la loi dont F est la fonction de répartition. (Et si avec tout ça, il en a encore qui rechigne à faire des tests de Kolmogorov--Smirnov...)

Remarque. --- Le code de George Marsaglia and Wai Wan Tsang utilise le calcul asymptotique uniquement si le calcul exact n'apporte rien de mieux sur les 7 premières décimales (...). Pour inverser la fonction de répartition, j'ai juste écrit une petite dichotomie, ce qui, comparé au code des auteurs, est une horreur.

Code Scilab

// Distributions de Kolmogorov (Scilab)
// d'apr\`es George Marsaglia and Wai Wan Tsang,
// ``Evaluating Kolmogorov's Distribution''

// puissance de matrices avec conservation d'exposants

function [V, eV] = mPower(A, eA, n);
	if n == 1 then V = A; eV = eA; return;
	else
		[V, eV] = mPower(A, eA, floor(n/2));
		B = V*V; eB = 2*eV;
		if modulo(n, 2) == 0 then
			V = B; eV = eB; return;
		else
			V=A*B; eV = eA; return;
		end
	end
endfunction

function p = kolmogorovcdf(d, n)
//	local k m eH eQ i j k // entiers
//	      h s H eH Q eQ; // doubles
// Supprimer les prochaines lignes pour une
// pr\'ecision sup\'erieure \`a  7 d\'ecimales
	s = d*d*n;
	if s > 7.24 | (s > 3.76 & n > 99)
		p = 1-2*exp(-(2.000071+.331/sqrt(n)+1.409/n)*s);
		return;
	end
	k = floor((n*d)+1); m = 2*k-1; h = k-n*d;
	H = ones(m, m); Q = zeros(m, m);
	for i = 1:m;
		for j = 1:m;
			if (i+1 < j) then H(i,j) = 0;
			end
		end
	end
	for i = 1:m;
		// le calcul des puissances $h^n$ pourrait \^etre am\'elior\'e
		// au niveau du temps. La question se pose au niveau
		// de la pr\'ecision...
		H(i,1) = H(i,1)-h^i;
		H(m,i) = H(m, i)-h^(m-i+1);
	end
	if 2*h-1 > 0 then H(m,1) = H(m,1)+(2*h-1)^m; end
	for i = 1:m;
		for j = 1:m;
			if i-j+1 > 0 then
				for g = 1:(i-j+1);
					H(i,j) = H(i,j)/g;
				end
			end
		end
	end
	eH = 0; [Q, eQ] = mPower(H, eH, n);
	p = Q(k,k);
	for i = 1:n;
		p = p*i/n;
		if p < 1e-140 then p = p*1e140; eQ = eQ-140; end
	end
	p = p*(10^eQ);
endfunction

function x = kolmogorovicdf(p, n)
	global accuracy;
//	local a b;// doubles
	if p > 1-accuracy then x = 1; return;
	elseif p < accuracy then x = 0; return;
	else
		a = 0; b = 1;
		while b-a >= accuracy;
			if kolmogorovcdf((a+b)/2, n) < p then
				a = (a+b)/2;
			else b = (a+b)/2;
			end
		end
		x = (a+b)/2; return;
	end
endfunction

// la fonction suivante calcule la statistique de Kolmogorov--Smirnov
// lorsque la fonction de r\'epartition cible est donn\'ee par $F$.
// syntaxe : KSstat(<\'echantillon>, <fonction de r\'epartition>)
// exemple :
// --> deff("y = F(x)", "y = x");
// --> k := KSstat(u, F);
// --> p := 1-kolmogorovcdf(k, n);

function k = KSstat(u, F)
	// local k v n i;
	k = 0; n = size(u,1);
	v = -gsort(-u);// sort() est d\'epr\'ci\'e (attention \`a l'ordre)
	for i = 1:n; k = max(k, F(v(i))-(i-1)/n, i/n-F(v(i))); end
endfunction

Code Maple

# Distributions de Kolmogorov (Maple)
# d'apr\`es George Marsaglia and Wai Wan Tsang,
# ``Evaluating Kolmogorov's Distribution''

with(LinearAlgebra):

# puissance de matrices avec conservation d'exposants

accuracy := 1e-8:

mPower := proc(A, eA, m, n)
	local B, eB;# double, int
	B := Matrix(m);
	if n = 1 then return A, eA;
	else
		(B, eB) := mPower(A, eA, m, floor(n/2));
		B := B.B; eB := 2*eB;
		if modp(n, 2) = 0 then return B, eB;
		else return A.B, eA;
		end if;
	end if;
end proc:

kolmogorovcdf := proc(d, n)
	local k, m, eH, eQ, i, j, g,# int
		p, h, s, H, Q;# double
	# Supprimer les prochaines lignes pour une
	# pr\'ecision sup\'erieure \`a 7 d\'ecimales
	s := d*d*n;
	if (s > 7.24 or (s > 3.76 and n > 99)) then
		return evalf(1-2*exp(-(2.000071+.331/sqrt(n)+1.409/n)*s));
	end if;
	k := floor((n*d)+1); m := 2*k-1; h := k-n*d;
	H := Matrix(m); Q := Matrix(m);
	for i from 1 to m do
		for j from 1 to m do
			if (i+1 < j) then H[i,j] := 0;
			else H[i,j] := 1;
			end if;
		end do;
	end do;
	for i from 1 to m do
	# le calcul des puissances $h^n$ pourrait \^etre am\'elior\'e
	# au niveau du temps. La question se pose au niveau
	# de la pr\'ecision...
		H[i,1] := H[i,1]-h^i;
		H[m,i] := H[m, i]-h^(m-i+1);
	end do;
	if 2*h-1 > 0 then H[m,1] := H[m,1]+(2*h-1)^m; end if;
	for i from 1 to m do
		for j from 1 to m do
			if i-j+1 > 0 then
				for g from 1 to (i-j+1) do
					H[i,j] := H[i,j]/g;
				end do;
			end if;
		end do;
	end do;
	eH := 0; (Q, eQ) := mPower(H, eH, m, n); p := Q[k,k];
	for i from 1 to n do
		p := p*i/n;
		if p < 1e-140 then
			p := p*1e140; eQ := eQ-140;
		end if;
	end do;
	return p*(10^eQ);
end proc:

kolmogorovicdf := proc(p, n)
	global accuracy;
	local a, b;# double
	if p > 1-accuracy then return 1;
	elif p < accuracy then return 0;
	else a := 0; b := 1;
		while b-a >= accuracy do
			if kolmogorovcdf((a+b)/2, n) < p then
			a := (a+b)/2;
			else b := (a+b)/2;
			end if;
		end do;
	return evalf((a+b)/2);
	end if;
end proc:

# la fonction suivante calcule la statistique de Kolmogorov--Smirnov
# lorsque la fonction de r\'epartition cible est donn\'ee par $F$.
# syntaxe : KSstat(<\'echantillon>, <fonction de r\'epartition>)
# exemple : F := x -> x; k := KSstat(u, F); p := 1-kolmogorovcdf(k, n);

KSstat := proc(u, F)
	local k, v, n, i;
	k := 0; n := nops(u); v := sort(u);
	for i from 1 to n do
	k := evalf(max(k, max(F(v[i])-(i-1)/n, i/n-F(v[i]))));
	end do;
	return k;
end proc:

Code C

/* test-kolmogorov.h
   d'apr\`es George Marsaglia and Wai Wan Tsang,
   ``Evaluating Kolmogorov's Distribution'' */

#include <stdio.h>
#include <math.h>
#include <stdlib.h>

double accuracy = 1e-12;

/* Les matrices sont trait\'ees comme des colonnes
   unidimensionnelles de flottants en double pr\'ecision. */
 
void mMultiply(double *A, double *B, double *C, int m){
  int i, j, k;
  double s;
  for (i = 0; i < m; i++)
    for (j = 0; j < m; j++)
      {s = 0.; /* initialisation d'un flottant */
      for (k = 0; k < m; k++) s += A[i*m+k]*B[k*m+j];
      C[i*m+j] = s;}
}

void mPower(double *A, int eA, double *V, int *eV, int m, int n){
  double *B;
  int eB, i;
  if (n == 1) {for (i = 0; i < m*m; i++) V[i] = A[i]; *eV = eA; return;}
  mPower(A, eA, V, eV, m, n/2); /* n/2 est en fait \'egal \`a floor(n/2) */
  B = (double*)malloc((m*m)*sizeof(double));
  mMultiply(V, V, B, m);
  eB = 2*(*eV);
  if (n % 2 == 0)
    {for (i = 0; i < m*m; i++) V[i] = B[i];
    *eV = eB;}
  else
    {mMultiply(A, B, V, m);
    *eV = eA+eB;}
  if (V[(m/2)*m+(m/2)] > 1e140)
    {for (i = 0; i < m*m; i++) V[i] = V[i]*1e-140;
    *eV += 140;}
  free(B);
}

double kolmogorovcdf(double d, int n){
  int k, m, i, j, g, eH, eQ;
  double h, s, *H, *Q;
  /* Supprimer les 3 prochaines lignes pour une
     pr\'ecision sup\'erieure \à 7 d\'ecimales */
  s = d*d*n;
  if (s > 7.24 || (s > 3.76 && n > 99))
    return 1-2*exp(-(2.000071+.331/sqrt(n)+1.409/n)*s);
  k = (int)(n*d)+1; m = 2*k-1; h = k-n*d;
  H = (double*)malloc((m*m)*sizeof(double));
  Q = (double*)malloc((m*m)*sizeof(double));
  for (i = 0; i < m; i++)
    for (j = 0; j < m; j++)
      if (i-j+1 < 0) H[i*m+j] = 0;
      else H[i*m+j] = 1;
  for (i = 0; i < m; i++)
    {H[i*m] -= pow(h, i+1);
    H[(m-1)*m+i] -= pow(h, (m-i));}
  H[(m-1)*m] += (2*h-1 > 0 ? pow(2*h-1, m) : 0);
  for (i = 0; i < m; i++)
    for (j = 0; j < m; j++)
      if (i-j+1 > 0) for (g = 1; g <= i-j+1; g++) H[i*m+j] /= g;
  eH = 0; mPower(H, eH, Q, &eQ, m, n);
  s = Q[(k-1)*m+k-1];
  for (i = 1; i <= n; i++)
    {s = s*i/n; if (s < 1e-140) {s *= 1e140; eQ -= 140;}}
  s *= pow(10., eQ); /* attention au flottant 10.*/
  free(H);
  free(Q);
  return s;
}

double kolmogorovicdf(double alpha, int n){
  double a, b;
  if (alpha > 1-accuracy) return 1;
  else if (alpha < accuracy) return 0;
  else
    {
      a = 0; b = 1;
      while(b-a >= accuracy)
	{if (kolmogorovcdf((a+b)/2, n) < alpha)
	  a = (a+b)/2;
	else b = (a+b)/2;
	}
      return (a+b)/2;
    }
}

/* Trial
int main(){
  int n = 10;
  double d = .274;
  printf("kolmogorovcdf(%1.3f, %d) = %f\n", d, n, kolmogorovcdf(d, n));
 return 0;
}
*/

Code MetaPost

% test-kolmogorov.mp 
% d'apr\`es George Marsaglia and Wai Wan Tsang,
% ``Evaluating Kolmogorov's Distribution''

newinternal accuracy; accuracy := 0.00001;

% Les matrices sont trait\'ees comme des colonnes
% unidimensionnelles de ``pair'' (flottants).

fbasis := 64;

vardef reducefloat expr z =
  if xpart z = 0: (0,0)
  else:
    save x, y; x = xpart z; y = ypart z;
%    if floor y <> y: x := x*(fbasis**(y-floor y)); y := floor y; fi
    forever:
      if abs x >= fbasis: x := x/fbasis; y := y+1; fi
      if abs x < 1: x := x*fbasis; y := y-1; fi
      exitif (abs x >= 1) and (abs x < fbasis);
    endfor
    (x, y)
  fi
enddef;

tertiarydef x floatPlus y =
  if ypart x-10 > ypart y: x
  elseif ypart y-10 > ypart x: y
  else:
    begingroup
      save m;
      m = max(ypart x, ypart y);
      reducefloat(xpart x*(fbasis**(ypart x-m))
	+xpart y*(fbasis**(ypart y-m)), m)
    endgroup
  fi
enddef;

tertiarydef x floatMinus y =
  if ypart x-10 > ypart y: x
  elseif ypart y-10 > ypart x: (-xpart y, ypart y)
  else:
    begingroup
      save m;
      m = max(ypart x, ypart y);
      reducefloat(xpart x*(fbasis**(ypart x-m))
	-xpart y*(fbasis**(ypart y-m)), m)
    endgroup
  fi
enddef;

secondarydef x floatTimes y =
  reducefloat(xpart x*xpart y, ypart x+ypart y)
enddef;

vardef floatPower(expr x, n) =
  (1, 0) for i = 1 upto n: floatTimes x endfor
enddef;

vardef exp primary x = mexp(256x) enddef;

def mMultiply(suffix A, B, C)(expr m)=
  begingroup save s; pair s;
    for i = 0 upto m-1:
      for j = 0 upto m-1:
	s := (0, 0);
	for k = 0 upto m-1:
	  s := s floatPlus (A[i*m+k] floatTimes B[k*m+j]);
	endfor
	C[i*m+j] := s;
      endfor
    endfor
  endgroup
enddef;

def mPower(suffix A, V)(expr m, n) =
  if n = 1:
    for i = 0 upto m*m-1: V[i] := A[i]; endfor
  else:
    begingroup save B, eB;
      mPower(A, V, m, floor(n/2));
      pair B[];
      mMultiply(V, V, B, m);
      if n mod 2 = 0:
	for i = 0 upto m*m-1: V[i] := B[i]; endfor
      else:
	mMultiply(A, B, V, m);
      fi
    endgroup
  fi
enddef;

vardef kolmogorovcdf(expr d, n) =
  save k, m, eH, eQ, h, s, H, Q;
  % Omit 3 next lines if you require > 7 digits accuracy on the right tail
  s = d*d*n;
  if (s > 7.24) or ((s > 3.76) and (n > 99)):
    message "asymptotic kolmogorovcdf";
    1-2*exp(-(2.000071+.331/sqrt(n)+1.409/n)*s)
  else:
    k = floor(n*d)+1; m = 2*k-1; h = k-n*d;
    pair H[], Q[];
    for i = 0 upto m-1:
      for j = 0 upto m-1:
	if i-j+1 < 0: H[i*m+j] = (0, 0); else: H[i*m+j] = (1, 0); fi
      endfor
    endfor
    pair hi, hmi;
    hi := reducefloat(h, 0); hmi := floatPower((h,0), m);
    for i = 0 upto m-1:
      H[i*m] := H[i*m] floatMinus hi;
      H[(m-1)*m+i] := H[(m-1)*m+i] floatMinus hmi;
      hi := hi floatTimes (h,0);
      hmi := hmi floatTimes (1/h,0);
    endfor
    if 2*h-1 > 0: H[(m-1)*m] := H[(m-1)*m] floatPlus ((2*h-1)**m,0); fi
    for i = 0 upto m-1:
      for j = 0 upto m-1:
	if i-j+1 > 0:
	  for g = 1 upto i-j+1:
	    H[i*m+j] := reducefloat(xpart H[i*m+j]/g, ypart H[i*m+j]);
	  endfor
	fi
      endfor
    endfor
    mPower(H, Q, m, n);
    pair s;
    s := Q[(k-1)*m+k-1];
    for i = 1 upto n:
      s := s floatTimes (i/n, 0);
    endfor
    xpart s
%    if ypart s > 0: for i = 1 upto ypart s: * fbasis endfor
%    elseif ypart s < 0:
    for i = 1 upto -ypart s: /fbasis endfor
%  fi
  fi
enddef;

vardef kolmogorovicdf(p, n) =
  save a, b;
  if p > 1-accuracy: 1
  elseif p < accuracy: 0
  else
    a = 0; b = 1;
    forever:
      exitif b-a < accuracy;
      if kolmogorovcdf((a+b)/2, n) < p:
	a = (a+b)/2;
      else: b = (a+b)/2;
      fi
    endfor
    (a+b)/2
  fi
enddef;

%%
%% trial
%%
%n := 10; d:=.274;
%%for d = 0 step 0.01 until 1:
%message "kolmogorovcdf(" & decimal d &", "
%  & decimal n & ") = " & decimal(kolmogorovcdf(d, n));
%%endfor
%end.

Anthony Phan,
Laboratoire de Mathématiques et Applications
Boulevard Marie et Pierre Curie, Téléport 2,
BP 30179, F-86962 Chasseneuil-Futuroscope cedex
Tél : 05.49.49.69.00
Fax : 05.49.49.69.01