Mission:
The mission of the of the integrated health science facility core (IHSFC) is to assist investigators in knowledge extraction from complex data, by applying data mining approaches to problems of today's biological systems and to identify new issues in biomedical research. The IHSFC specializes in analysis of “wide” data characteristic of multi-omics data sets such as that produced by genomics, proteomics or metabolomics experiments. In these types of data sets, the numbers of features are much greater than the number of samples, a characteristic that poses special problems in analysis. For this purpose, the IHSFC has acquired training, software and tools for both supervised (classification) and unsupervised learning. A large emphasis is placed on feature reduction with the intent to eliminate “noisy” features and retain those with the most information. The IHFSC will assist in selection of the most appropriate data analysis approach based on investigator need and type of data. Some of these approaches are described below.
Supervised learning (Classification):
The goal of supervised learning is in predicting an outcome or class. Supervised Classification is a specialized area of Machine Learning using computers to detect patterns and trends. In supervised learning a complex “omics” data set is analyzed using various computational algorithms to determine the behavior of the data in relationship to a particular event or to identify types/subtypes of disease present. A wide range of classifiers (supervised learning methods) are available, each with its own strengths and weaknesses. Classifier performance depends greatly on the characteristics of the data being analyzed. Determining a suitable classifier for a given problem and identifying the exact behavior of a particular disease for a particular response. Below are some of the classifiers.
-
Decision Trees: Decision trees are a machine
learning tool that produces a series of
choices to produce a prediction. This method uses the concept of “information gain” to a make a split on a particular attribute or variable that most accurately separates the data. Decision trees are very tolerant of “noisy” or
absent data, and the resulting analysis is very easy to interpret.
Trees are shown by leafs and nodes; an example is shown at right. -
Boosting: Boosting is a procedure that combines the outputs of many weak classifiers to produce a more powerful classifier. Frequently described as a vote by “committee”, boosting applies the weak classification algorithm to repeatedly modified versions of the data, producing a group of classifiers. The final prediction is made by a majority vote.
-
Nearest Shrunken Centroids (NSC): NSC is a dimension reduction and classification technique, primarily developed for gene expression data. A centroid is a vector representation of the particular class. In NSC, centroids are reduced (shrunken) based on a user applied threshold. This results in the added benefit of feature reduction, leaving only a small number of genes that best describe a class. Next, a sample’s gene expression profile is compared to centroids of each class, and the class that is closest to the unknown is used to predict the unknown.
- Lasso: The Lasso is a shrinkage and selection method for linear regression. It minimizes the usual sum of squared errors, with a bound on the sum of the absolute values of the coefficients.
Unsupervised learning:
Unsupervised learning is used to identify patterns or differences existing in the data. It may express certain unique patterns which may be representative of a particular event which dominates or is expressed in the data. It gives an understanding of how gene expression data has information and shows us possibilities if/how further analysis can be done. Few of the several different techniques are described below.
- Hierarchical clustering: Clustering uses a method to identify natural groups in a data set. This is known as a “bottom’s” up approach which starts by assigning each item to its own cluster. The method then calculates a distance metric between the index item and the remaining items in the data set. The next item closest to the index is then added, forming a new group. The distances are then re-computed between the new and the remaining data, and a new cluster is formed. This is repeated until all the data are clustered together. The length of the dendogram connecting the items indicates the degree of similarity between the groups. An example of a heat map is shown.
- Self Organizing maps: A Self organizing map is a type of clustering algorithm based on Neural Networks. The algorithm produces a Trellis profile chart, in which similar records appear close to each other, and less similar records appear more distant. From this map it is possible to visually investigate how records are related.

Tools/ Software:
Below are some of the software tools available:
- Spotfire
- R – (http://www.bioconductor.org)
- Weka – (http://www.cs.waikato.ac.nz/~ml/weka)
- CART (Classification and Regression Trees) (http://www.salford-systems.com/)
- Matlab (http://www.mathworks.com/)
Contact information if you have a project or data to analyze: