Fionn Murtagh's
Multivariate
Data Analysis
Software and Resources
Page
Contents
- Hierarchical Clustering Software in R/S-Plus
- MDA-J: Multivariate Data Analysis - Java
- Multivariate Data Analysis Software as Standalone Java Applications
- Gaussian Mixture Modeling with Bayes Factors in C
- Wavelet Transform on a Hierarchy or Dendrogram
- Multivariate Data Analysis Software in Fortran (and C)
- Resources and Links, including: JP Benzécri's Pascal Code
for Correspondence Analysis; Multidimensional Scaling;
Point Pattern Matching; book F. Murtagh, Multidimensional
Clustering Algorithms, Physica-Verlag, 1985;
Reading FITS Files in R.
- Software accompanying
the book:
Correspondence Analysis and Data Coding with R and Java
Legal notice: all software here can be freely used and incorporated
into any system, for
any purpose. It is required that the origin and authorship of any code
taken from here is acknowledged in code and other documentation.
1. Hierarchical Clustering Software in R/S-Plus
What is unique here: (1) All hierarchical clustering programs achieve
the optimal O(n2) computational bound using the nearest neighbors
chain algorithm. (2) The "stored dissimilarity" algorithm is used,
implying O(n2) storage (the "stored data" algorithm is an
alternative, with O(n) storage, but greater absolute computational
requirement).
(3) For hierarchical clustering, both
native R code and linked C code (crucial for efficiency, with large
data sets);
implementations are identical.
(4) Weighting of cases/observations (rows) supported. (5) Easy and
straightforward linkage with correspondence analysis to normalize (by
"Euclideanizing") the data input to the hierarchical clustering. (6)
The Ward minimum variance criterion is used in the software here. (See note
below.)
-
"Readme": Background details, and
examples of use on Intel Windows and Unix (including Mac OS X) systems.
- All
functions in one (standalone) R/S-Plus file
(call in R as function hierclust).
- Hierarchical clustering in C, used in R environment.
- Normalizing input data
- Correspondence analysis area
(including lots of data sets).
- The hierarchical clustering is identical to what has been listed in
this section.
- The correspondence analysis is used in R to "Euclideanize"
the data (e.g. frequencies of occurrences, ranks, qualitatitive, or
mixed qualitative/quantititive data types).
- In correspondence analysis
such data (both rows/objects and columns/variables)
is mapped into a constant and identically weighted coordinate system.
- Note 1 on other agglomerative criteria:
It is a
very easy matter to allow for any of the other Lance-Williams criteria.
Note in particular that the computational time of the complete link method
using this NN-chain algorithm is O(n2). (Quite a number of
published papers erroneously claim a higher computational requirement.)
- Note 2 on timing experiments with my hclust program available
in the official
R distribution: this program additionally finds cluster assignments,
and that part of the processing is O(n3), which dominates.
- Note 3: for background see F. Murtagh, Multidimensional Clustering
Algorithms, Physica-Verlag, 1985. Scanned version of this book
available (see below).
2. MDA-J: Multivariate Data Analysis - Java
- A new area has been set up for this code, which has its own address: see
www.correspondances.info.
- For background see F. Murtagh, Correspondence Analysis and Data Coding
with R and Java, Chapman & Hall/CRC Press, 2005.
3. Multivariate Data Analysis Software as Individual Java Applications
- [Doc:]
Documentation on all programs which follow.
- [Sample data:]
Fisher iris data, iris.dat.
Format:
row and column numbers (integer), followed by sequence of
matrix data values (floating), read row-wise. The output obtained is
available in each case below.
- [PCA:]
Principal Components Analysis on correlations,
PCAcorr.java,
which requires the JAMA class library. (See below for
a link to JAMA.)
To run:
(i) Assume JAMA class library is in subdirectory Jama in
current directory.
(ii) Compile: javac PCAcorr.java
General note: if your classpath variable is not set, you can run
"javac" or "java" as: "javac -classpath ." and "java -classpath .".
(iii) Get syntax to standard output: java PCAcorr
(iv) Run on iris.dat: java PCAcorr iris.dat >
pcaoutput.txt.
-
Hierarchical clustering - background reading:
A small text, but one with everything necessary on the most effective
hierarchical clustering algorithms (i.e., using nearest neighbor chains,
and/or reciprocal or mutual nearest neighbors) is F. Murtagh,
Multidimensional Clustering Algorithms, Physica-Verlag, 1985. See
in particular the table on page 68.
[HCL:] Hierachical Clustering, HCL.java.
No additional class
libraries required. Currently the minimum variances criterion (Ward's
method) is supported. Changes for other
criteria (single, complete link, etc.) are indicated in the code.
To run:
(i) Compile: javac HCL.java
(ii) Get syntax to standard output: java HCL
(iii) Run on iris.dat: java HCL iris.dat >
hcloutput.txt.
- Correspondence analysis - background reading:
For theoretical background and thoroughly
comprehensive discussion of applications and practice, and most of all
providing a deep discussion of algorithms, see J.-P. Benzécri,
Correspondence Analysis Handbook, translated by T.K. Gopolan,
Marcel Dekker, 1992.
[CA:]
Correspondence Analysis on fuzzy coded input,
CAfuzzy.java,
which requires the JAMA class library (see below for a link).
To run:
(i) Assume JAMA class library is in subdirectory Jama in
current directory.
(ii) Compile: javac CAfuzzy.java
(iii) Get syntax to standard output: java CAfuzzy
(iv) Run on iris.dat: java CAfuzzy iris.dat >
caoutput.txt.
- Partitioning and k-means clustering - background reading: H. Späth,
Cluster Dissection and Analysis: Theory, Fortran Programs, Examples,
Ellis Horwood, 1985, covers various algorithms including the
exchange algorithm. (This algorithm is related to work by Simon
Régnier, at the Maison des Sciences de l'Homme around 1964.)
[Partition:]
Partitioning or k-means using minimal
distance criterion and exchange algorithm, partition.java,
which requires the JAMA class library (see below for a link),
because the initial clusters
are defined from principal component projections.
To run: i) Assume JAMA class library is in subdirectory Jama in
current directory.
(ii) Compile: javac partition.java
(iii) Get syntax to standard output: java partition
(iv) Run on iris.dat: java partition iris0.dat >
partoutput.txt.
4. Gaussian Mixture Modeling with Bayes Factors
This is a new area, where we will get - soon - programs in C uploaded,
mainly for image segmentation (including multiband images) based on
Markov random field models, and with use of Bayes factor inference
- Bayes information criterion and BIC in the pseudolikelihood case.
- Gaussian mixture model of a univariate data set, and BIC to
assess goodness of model fit. Program gmm-bic.c.
Sample input data oil.dat, oil prices from Ross.
To run: gmm-bic 3 749 oil.dat (3 = number of clusters, 749 =
number of observations or values in the input data file.) Output to
standard output device.
Output from the above.
(Reference for the Ross data: Sheldon M. Ross, An Elementary
Introduction to Mathematical Finance : Options and other Topics,
Cambridge University Press, 2002.)
- Version of the same program, but with the following normalizion used
by Ross: values analyzed are
log value(i+1) - log value(i). Program (only
difference: use of this preprocessing of the data).
Output from the foregoing.
- Coming soon: code for 2D and 3D single band, and 2D and 3D
multiband, image segmentation, using BIC or pseudo-likelihood information
criterion. (Used for work on multiband astronomy, Earth observation,
medical imagery - see publications list and recent papers accessible from
homepage.)
5. Wavelet Transform on a Hierarchy or Dendrogram
New hierarchical Haar wavelet transform
in R (see commented lines at start for example of use), which works on
a hierarchy produced by the foregoing hierarchical clustering programs.
This hierarchical Haar wavelet transform carries out the following
processing tasks: (i) from the data and a hierarchy, produce the
wavelet transform; (ii) filter the wavelet coefficients, using a
user-specified hard threshold; and (iii) reconstruct the data, i.e. perform
the inverse wavelet transform.
6. Multivariate Data Analysis Software in Fortran (and C)
The following is provided in case it is still of
interest. (Of note:
the agglomeration sequence visualization program.)
This is a collection of stand-alone routines, in Fortran (mostly) and C.
Sample data sets are available. Indications are given on how to compile, link
and run. Download the programs and run on your system. Many of these
programs were originally used on a VAX/VMS system, and later on Linux and
Solaris systems.
Please notify the author of any problems (although the programs are
provided "as is" and there are evident improvements which could be made).
The example of compiling, linking and running
given for principal components analysis (in Fortran) is similar to what is
required for the other Fortran programs here.
- S-Plus/R area. This subdirectory took all code
into a package. Listed below are standalone, individual programs.
- Principal components analysis (Fortran)
pcat.f, driver program
pca.f, routines used
spectr.dat, small sample dataset, originally
related to stellar spectra.
To compile and link: f77 pcat.f pca.f -o pcat
To run: pcat (output to screen, which may be directed to a file).
- Principal components analysis (C)
pca.c, program
To compile and link: cc pca.c -lm -o pcac
To run: pcac spectr.dat 36 8 R
Or: pcac spectr.dat 36 8 r
- Partitioning (Fortran)
partt.f, driver program
part.f, routines used
To compile and link: f77 partt.f part.f -o partt
Set up to run on spectr.dat (hard-wired in partt.f).
- Hierarchical clustering using stored dissimilarities (Fortran)
hct.f, driver program
hc.f, routines
hcass.f, cluster assignments routine
hcden.f, draw part of dendrogram
To compile and link: f77 hct.f hc.f hcass.f hcden.f -o hct
To run on spectr.dat (hard-wired): hct
- Hierarchical clustering without storage of dissimilarities (Fortran)
hcon2t.f, driver program
hcon2.f, routines
To compile and link: f77 hcon2t.f hcon2.f hcass.f hcden.f -o hcon2t
To run on spectr.dat (hard-wired): hcon2t
- Linear discrimant analysis (Fortran)
ldat.f, driver program
lda.f, routine
spectr2.dat, input dataset
- Multiple distriminant analysis (Fortran)
mdalum.f, driver program
mda.f, routine
luminosity.dat, input dataset
Additional diagnostics provided for this 3-class
supergiant - giant - dwarf dataset.
- K-nearest neighbors discriminant analysis - 2-class case (Fortran)
knn.f, driver program and routines used
ngc.dat, data - class number, followed by11
variables, for 500 objects.
The data is a set of 500 objects derived from an HST WF/PC image, in
the context of a study in star/cosmic ray hit discrimination.
To use: copy ngc.dat to ngc2.dat, to provide a second dataset. We
check assignments among the latter, relative to the assignments of objects
in ngc.dat. Hence, evidently for k=1 we must get exactly the correct
assignments in all cases. But for k=2 and higher, we may find
inconsistencies, which point to incorrect original class assignments.
Note: for k = even number, we allow for no assignment decision. I.e.
no clear majority.
Both idential or proportional prior situations are considered (i.e.
taking into account the relative numbers of class 1 and 2 memberships,
or not).
Also we run through all cases from k = 1, 2, ..., 15, which takes some
time.
To run: f77 knn.f -o knn
Then: knn (which takes ngc.dat and ngc2.dat as inputs, and produces
output on screen).
- Correspondence analysis (Fortran)
cat.f, driver program
ca.f, routines
uno.dat, input data (UNO voting)
- Classical or metric multidimensional scaling (Fortran)
cmdst.f, driver program
cmds.f, routines
cities.dat, input dataset
Currently this program generates its own random input data.
- Sammon mapping (Fortran)
sammon.f, program
iris.dat, input dataset (Fisher's iris data)
- Kohonen self-organizing feature map (Fortran)
koh.f, program
iris_norm.dat, input dataset (normalized
Fisher's iris data)
iris.f, for information, program to normalize data
- Sort routine (Fortran)
sort.f
- Weighted Levenshtein or edit distance between character strings (Fortran)
levenshtein.f
- Errors-in-variables regression, 2-dimensional (Fortran)
leiv1.f, York (Can. J. Phys. 44, 1079-1086, 1966)
leiv2.f, Fasano and Vio
leiv3.f, Ripley
All these programs are set up with sample driver routines and sample data.
They should perform exactly the same task.
- Minimal spanning tree, efficient algorithm in 2 dimensions (Fortran)
mst2d.f, program due to F.J. Rohlf, SUNY
- GMDH - group method for data handling, or Ivakhnenko polynomial
method (Fortran)
gmdh.f, program
gmhd.dat, sample data
See book by Farrow on GMDH.
- Point pattern matching approaches, software, data sets.
Preliminary web
page.
7. Links and Resources
- F. Murtagh, Multidimensional Clustering Algorithms,
Physica-Verlag, 1985.
Title pages and preface,
Contents,
Chapter 1,
Chapter 2,
Chapter 3,
Chapter 4,
Chapter 5.
- F. Murtagh
(with A. Heck), Multivariate Data Analysis, Kluwer, Dordrecht, 1987.
An expanded version of this
work is available online as a
271-page PDF file.
- Lectures, presentations on principal components
analysis, correspondence analysis,
other dimensionality reduction methods, discriminant analysis,
cluster analysis, with various applications.
- Collection of time series,
used in, among other work, F. Murtagh, "Identifying the
ultrametricity of time series", European Physical Journal B, 43, 573-579, 2005.
- [Data sets:]
Time
series data library. Over 800 time series from various domains. Very
useful.
- Prof. J.P. Benzécri's original
correspondence analysis code in Pascal, for Macintosh OS 9 (or earlier)
platform.
- NewMDSX - Multidimensional
Scaling, including CANDECOMP, HICLUS, INDSCAL-S, MDPREF, MINI-RSA,
MINISSA, MRSCAL, PINDIS, PREFMAP, PRO-FIT, TRISOSCAL, WOMBATS. Windows
executables and documentation.
For original programs, see
Bell Labs MDS programs.
- Visual user interfaces, using Kohonen
self-organizing feature maps and other data display approaches.
- FITS Table and Image File Reading in R: Currently only reading
FITS (Flexible Image Transport System)
Table or Image files in Windows is supported.
File rfitsio.zip, containing two dll files to
be loaded, and four R files to get height, width and total size information,
and to read the FITS file. Uses CFITSIO libraries from GSFC.
- Pointers
to, and addresses of, lots of
multivariate data analysis code, a text file collected or summarized
by F. Murtagh from mail messages or digest announcements. Look here for
whereabouts of code for: decision trees, clustering (C code),
multidimensional scaling and lots of other psychometric mapping methods,
Voronoi diagrams, etc.
- Statlib,
major repository of
statistical software, datasets, and information such as email lists and
organisational addresses, at Carnegie Mellon University (Mike Meyer,
mikem@stat.cmu.edu).
- Weka Machine Learning
Project, Waikato Environment for Knowledge Analysis,
machine learning workbench.
- Mixture modeling page.
- Java resources.
- [Multiresolution analysis:]
Examples of application. Matlab (and IDL)
software,
accompanying the book Sparse Image and Signal Processing:
Wavelets, Curvelets, Morphological Diversity, JL Starck, F Murtagh
and J Fadili, Cambridge University Press, 2010. Some other sites
with software, a large
package, executables for common platforms, and
for Windows PCs.
Author:
fmurtagh at acm dot org
Homepage