One might wonder why SiMath was conceived since there are already several good libraries available. SiMath was originally created within Silicos to have access in stand-alone programs to a collection of available and newly implemented tools used to work with data matrices within the context of chemoinformatics. The libray was designed to cover the steps when building a classifier or a predictive model. Such a procedure typically starts by preprocessing the data matrix, next some form of feature selection is applied and then a model is trained and finally the model should be applied to new data. Mostly, these steps are not all present in one specific library. For instance, libsvm is an excellent library to train a support vector machine but there is no functionality to do a principal component analysis. PCA can be done from a SVD which is avalaible in JAMA but there it uses a template-based matrix representation while libsvm works with special sparse matrix representation. Combining both libraries directly into one program is as such a non-trivial task.
The goal of SiMath was to create a simple and consistent interface to existing tools which could be easily added into C++ applications. So, each of the included models has more or less the same set of methods. For instance, each model is initiated with a set of parameters and the model has a method to train the model directly from a data matrix or vector. There are also
Since SiMath is only intended to work with real-valued matrices and vectors. Therefore, the Matrix and Vector class only support double values. All included tools are adapted in such a fashion that they can work directly with Matrix and Vector objects. By not using a template-based implementation the design and implementation of the interface of library was kept simple and rather straightforward.
Where possible, it was also decided to work with classes and algorithms from the std namespace. A typical example is the Vector classes which holds a std::vector<double> as internal data structure and not an array of double. As such, operations on arrays can be efficiently passed to the standard algorithms which decreases the probability of errors and memory leaks.
There is no class available to do the IO of matrices and vectors. This is done to give the user of the library as much freedom as possible in the way the data are represented. Again, libsvm, for instance, imposes a strict representation of a sparse matrix while TNT uses a completely different scheme to write a matrix.
// example.cpp #include "SiMath/SiMath.h" SiMath::Matrix A(n,m,0.0); // initialise an [nxm] matrix with zeros // add some code to fill the matrix for ( int i=0; i<n; ++i ) for ( int j=0; j<m; ++j ) A[i][j] = some_input_function(i,j); // calibrate matrix by setting each column to have a zero mean and unit variance SiMath::Vector m = SiMath::meanOfAllColumns(A); SiMath::Vector s = SiMath::stDevOfAllColumns(A,m); columnNormalise(A,m,s); // to do a PCA analysis, do a SVD and store the V matrix SiMath::SVD svd(A,false,true); // get the V matrix SiMath::Matrix V = svd.getV(); // get the singular values and normalise them to get variances SiMath::Vector sv = svd.getSingularValues(); sv /= sqrt(n-1); // SV's are sorted, find those that are above a given threshold int c = 0; for ( ; c<sv.size(); ++c ) { if ( sv[c] < 0.1 ) break; } // c now holds index of first SV lower than threshold // if all are above treshold c is equal to sv.size() //approximate V with c best singular values SiMath::Matrix Vc(m,c,0); for ( int i=0; i<m; ++i ) for ( int j=0; j<c; ++j ) Vc[i][j] = V[i][j]; // do the rotation of A with app SiMath::Matrix Arot = product(A,Vc); // cluster -n 10 clusters the data with k-means clustering SiMath::KMeansParameters p; p.dimension = c; // dimension of the data is c p.nbrClusters = 10; // nbr of clusters // initialise the model SiMath::KMeans model(p); // cluster the data and return the cluster labels for each data point std::vector<unsigned int> labels = model.cluster(Arot); // print out the labels for ( int i=0; i<n; ++i ) std::cout << i << "\t" << labels[i] << std::endl;
The first example reads in the data, normalises the columns to have zero mean and unit variance. This normalised data matrix is then used to train an SVM classificator. The predicted labels are printed.
The second example reads in the data and the number of clusters requested. Then it normalises the columns to have zero mean and unit variance and transforms the data matrix using PCA. Finally a k-means clustering is done using the predefined number of clusters. The cluster numbers are printed in 3 columns corresponding to the three classes.
To compile the code against the compiled version of libSiMath, you should use a command like (if SiMath is installed in the default location)
# first example $ g++ -I/usr/local/include -L/usr/local/lib -lSiMath -o example1 example1.cpp $ ./example1 data.tab # second example $ g++ -I/usr/local/include -L/usr/local/lib -lSiMath -o example2 example2.cpp $ ./example2 data.tab 3
$ ./configure
$ make
$ sudo make install
$ ./configure --prefix=/my/install/dir
To use SiMath in your application add the following include statement to your code
#include "SiMath/SiMath.h"
Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:
THIS SOFTWARE IS PROVIDED BY SILICOS NV AND CONTRIBUTORS ``AS IS'' AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL SILICOS NV AND CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
1.3.4