Partial Least Squaresで次元圧縮

はじめに

近頃の機械学習・画像処理だと数万次元の特徴量を扱うのが当たり前の感があります。それだと扱いにくいですし、本当に分離できているのか不安なので次元圧縮することがあります。たとえば２次元に圧縮すれば分離できているか視覚的に分かるので、特徴量が特徴を表しているかどうかの判断材料になります。

Partial Least Squares(PLS)

有名な次元圧縮法として主成分分析があげられます。他にもあるか調べてみると、PLSという手法の性能がいいとのことでした。
Human Detection Using Partial Least Squares Analysisという論文には、HOGや色で構成する17万次元を２次元にした絵があります。

試しに実装しましょう。

実装

Wikipediaにpseudo codeがのってたので、そのまま行列演算ライブラリーで書き直します。
理論が分からなくても実装できるよ(-.-;)
Partial least squares regression - Wikipedia, the free encyclopedia

void PLS::nipals(const MatrixXf& x0, const VectorXf& y, int *factors)
{
	// PLS1 NIPALS
	// http://en.wikipedia.org/wiki/Partial_least_squares_regression

	std::vector<VectorXf> ws;
	MatrixXf x = x0;

	// an initial estimate of w
	VectorXf w = x.transpose() * y;
	w.normalize();
	ws.push_back(w);

	VectorXf t = x * w;

	VectorXf p;
	VectorXf c;
	for (int k = 1; k < *factors; k++)
	{
		std::cout << k << std::endl;
		float tk = t.transpose() * t;
		t /= tk;
		p.noalias() = x.transpose() * t;

		float q = y.transpose() * t;
		//if (q < 0.00001) {
		//	*factors = k;
		//	break;
		//}

		x.noalias() -= tk * t * p.transpose();
		w.noalias() = x.transpose() * y;
		w.normalize();
		ws.push_back(w);

		t.noalias() = x * w;

		c.noalias() = t.transpose() * y / tk;
	}

	W_ = MatrixXf::Zero(w.rows(), ws.size());
	for (size_t i = 0; i < ws.size(); ++i)
	{
		W_.col(i) = ws[i];
	}
	std::cout << "derive completed" << std::endl;
}

テスト用のデータで手持ちに良いものがなかったので、LIBSVMのサイトからiris.scaleをダウロードします。３値分類なので2値分類になるように抜粋します。
あとこのデータは欠損を含むので、欠損データは除外してます。これでだいぶ手間取りました。
pythonでグラフかくとこんな感じ。4次元を2次元にしています。主成分分析との比較は想像にお任せします。おい。
f:id:wildpie:20141007233810p:plain
残りの断片コード

PLS::PLS(int num_samples = 0, int num_features = 0)
	: sample_num_(0)
{
	x_ = MatrixXf::Zero(num_samples, num_features);
	y_ = VectorXf::Zero(num_samples);
}

// xをzに変換
void PLS::compute(std::vector<float>& x, std::vector<float>& z) const
{
	const Map<VectorXf> vec_x(&x[0], x.size());
	Map<VectorXf> vec_z(&z[0], W_.cols());
	vec_z.noalias() = vec_x.transpose() * W_;
}

  // データ読み込み LIBSVM形式
 void PLS::loadDescriptors(const std::string& filename)
 {
 	std::ifstream ifs;
 	ifs.open(filename);
 	if (!ifs.is_open())
 		std::cerr << "Error in loadDescriptor: File Not Found" << std::endl;

 	// TODO :　特徴量の次元，サンプル数は事前に設定する必要がある
 	std::string word;
 	std::string line;
 	const int rows = x_.rows();
 	const int cols = x_.cols();
 	for (int y = 0; y < rows; y++) {
  		getline(ifs, line);
 		std::istringstream oss(line);
 		oss >> word;
 		y_(y) = static_cast<float>(atof(word.c_str()));
 
 		for (int x = 0; x < cols; x++)
		{
 			oss >> word;
			char *ctx;
 			char *p = strtok_s(const_cast<char*>(word.c_str()), ":", &ctx);  
   
 			p = strtok_s(NULL, ":", &ctx);  
 			x_(y, x) = static_cast<float>(atof(p));
 		}
 	}
 	std::cout << "Loading completed" << std::endl;
 }

学習データからWをもとめて、新規のデータxとWをかけ算することで次元圧縮しています。最近だとDeep Learningでやっちゃうんでしょうけど、Wさえ分かれば行列のかけ算１回で求まるから軽くてよさそう。