Saturday, 6 April 2013

pr.probability - Is any bias introduced from initial clustering

I've looked at this for PCA-based clustering.



alt text



Here is some PC clustering of a simulated dataset. We first draw the clusters as red xs on PCs computed from the entire dataset. A few random points (these are marked blue) are removed from the dataset and the PCs are computed on reduced data. The blue data are then plotted against the reduced PCs and colored green.



alt text



I don't have a rigorous bound (I would really like some references) but from what I can tell the most drastic changes are in directions that involve only noise.



Here is some matlab code:



clear all;close all;clc;

%create garbage SNP-style data
m = 500;
n = 25000;
M = zeros(m,n);

%create 2 populations
%one has 0s in the interesting columns
%the other has 2s
interesting_variables = randsample(1:n, .1*n);
M(1:2:m, interesting_variables) = 2.0;

%add sampling noise
sigma = 4.0;
M = M + double(sigma*randn(m,n) > ones(m,n));

%center data
for j=1:n
M(:,j) = M(:,j) - mean(M(:,j));
end

% plot PCA based clusters, Tygert's code is faster.
tic
[~,~,V] = svds(M,2);
toc


coordsx = zeros(1,m);
coordsy = zeros(1,m);
for i=1:m
coordsx(i) = M(i,:)*V(:,1);
coordsy(i) = M(i,:)*V(:,2);
end

figure()
title('strcuture')
plot(coordsx, coordsy, 'rx')


%%
% Now pick some guys from the dataset to monitor
% identify them as blue triangles in the plot
%

some = m/5;
guys = randsample(1:m, some);

hold on;
coordsx_blue = zeros(1,some);
coordsy_blue = zeros(1,some);
for i=1:some
coordsx_blue(i) = M(guys(i),:)*V(:,1);
coordsy_blue(i) = M(guys(i),:)*V(:,2);
end
plot(coordsx_blue, coordsy_blue, 'b<');


%%
% Finally, run PCA again but without the blue guys.
% Observe how this changes the clustering when we plot the blue
% against the PCs found from the reduced data
% green is the new blue.
%

M_small = M;
M_small(guys, :) = [];
[ms ns] = size(M_small);

tic
[~,~,V_small] = svds(M_small,2);
toc

coordsx_newblue = zeros(1,some);
coordsy_newblue = zeros(1,some);
for i=1:some
coordsx_newblue(i) = M(guys(i),:)*V_small(:,1);
coordsy_newblue(i) = M(guys(i),:)*V_small(:,2);
end
plot(coordsx_newblue, coordsy_newblue, 'gh');


% draw some lines to see how things moved
for i=1:some
line([coordsx_blue(i) coordsx_newblue(i)], [coordsy_blue(i) coordsy_newblue(i)],'Color','c');
end

No comments:

Post a Comment