inchlib_clust
inchlib_clust is a python script that performs data clustering and prepares input data for InCHlib. inchlib_clust can be used both from command line or from Python code. Data for clustering are supplied to inchlib_clust as a csv file. The first row with attribute headers is optional. If command line option -m is specified, metadata are also supplied in a csv format. New in inchlib_clust 0.1.3: If command line option -cm is specified, column metadata are also supplied in a csv format.
Input data example (file containing comma-separated values)
id,feature 1,feature 2,feature 3,feature 4
1,5.1,3.5,1.4,0.2
2,4.9,3,1.4,0.2
3,4.7,3.2,1.3,0.2
4,4.6,3.1,1.5,0.2
5,5,3.6,1.4,0.2
Metadata example (file containing comma-separated values)
id,class
1,class 1
2,class 1
3,class 2
4,class 2
5,class 3
Column metadata example (file containing comma-separated values)
Classification,active,inactive,inactive,active
Order,2,1,3,4
Alternative data example (file containing comma-separated values)
id,alt. header 1,alt. header 2,alt. header 3,alt. header 4
1,alt. value 1,alt. value 2,alt. value 3,alt. value 4
2,alt. value 1,alt. value 2,alt. value 3,alt. value 4
3,alt. value 1,alt. value 2,alt. value 3,alt. value 4
4,alt. value 1,alt. value 2,alt. value 3,alt. value 4
5,alt. value 1,alt. value 2,alt. value 3,alt. value 4
In the example below inchlib_clust is invoked from the command line. Input data for clustering are given in input_file.csv which contains attribute names in the first row (-dh parameter), individual values are separated by a comma (-dd , parameter). Metadata are supplied (-m parameter) in metadata.csv file. It also contains metadata names (-mh parameter), individual values are separated by a comma (-md , parameter). Column metadata are supplied (-cm parameter) in column_metadata.csv file. It also contains column metadata names (-cmh parameter) but unlike data and row metadata names, they are in the first column (because they have inverse orientation). Individual column metadata values are separated by a comma (-cmd , parameter). Both rows and columns are clustered (-a both parameter) using Ward’s clustering (-l ward parameter) with Euclidean distance (-d euclidean parameter).
Command-line example
python inchlib_clust.py input_file.csv -m metadata.csv -cm column_metadata.csv -dh -mh -cmh -d euclidean -l ward -a both -dd , -md , -cmd ,
The whole list of paramaters is given below.
Command-line parameters
positional arguments:
data_file csv(text) data file with delimited values
optional arguments:
-h, --help show this help message and exit
-o OUTPUT_FILE, --output_file OUTPUT_FILE
the name of output file (default: None)
-html DIRECTORY, --html_dir DIRECTORY
directory where simple HTML page with embedded cluster heatmap
and dependencies is stored (default: .)
-rd ROW_DISTANCE, --row_distance ROW_DISTANCE
set the distance to use for clustering rows(default: euclidean)
possible values: dice, hamming, jaccard,
kulsinski, matching, rogerstanimoto,
russellrao, sokalmichener, sokalsneath,
yule, braycurtis, canberra, chebyshev,
cityblock, correlation, cosine, euclidean,
mahalanobis, minkowski, seuclidean, sqeuclidean
-rl ROW_LINKAGE, --row_linkage ROW_LINKAGE
set the linkage to use for clustering rows (default: ward)
possible values: single, complete,
average, weighted, centroid, median, ward
-cd COLUMN_DISTANCE, --column_distance COLUMN_DISTANCE
set the distance to use for clustering columns(default: euclidean)
possible values: dice, hamming, jaccard,
kulsinski, matching, rogerstanimoto,
russellrao, sokalmichener, sokalsneath,
yule, braycurtis, canberra, chebyshev,
cityblock, correlation, cosine, euclidean,
mahalanobis, minkowski, seuclidean, sqeuclidean
-cl COLUMN_LINKAGE, --column_linkage COLUMN_LINKAGE
set the linkage to use for clustering columns (default: ward)
possible values: single, complete,
average, weighted, centroid, median, ward
-a AXIS, --axis AXIS define clustering axis (row/both) (default: row)
-dt DATATYPE, --datatype DATATYPE
specify the type of the data (numeric/binary)
(default: numeric)
-dd DATA_DELIMITER, --data_delimiter DATA_DELIMITER
delimiter of values in data file (default: ,)
-m METADATA, --metadata METADATA
csv(text) metadata file with delimited values
(default: None)
-md METADATA_DELIMITER, --metadata_delimiter METADATA_DELIMITER
delimiter of values in metadata file (default: ,)
-dh, --data_header whether the first row of data file is a header
(default: False)
-mh, --metadata_header
whether the first row of metadata file is a header
(default: False)
-c COMPRESS, --compress COMPRESS
compress the data to contain maximum of specified
count of rows
-cv COMPRESSED_VALUE, --compressed_value COMPRESSED_VALUE
the resulted value of merged rows (data points) (default: median)
possible values: median, mean
-mcv METADATA_COMPRESSED_VALUE, --metadata_compressed_value METADATA_COMPRESSED_VALUE
the resulted value from merged rows when the data are compressed (default: median)
possible values: median, mean, frequency
-dwd, --dont_write_data
don't write clustered data to the inchlib data format
(default: False)
-n, --normalize
normalize data to (0,1) scale
-wo, --write_original
only when normalize is set to True
cluster normalized data, but write original data to the heatmap
-cm COLUMN_METADATA, --column_metadata COLUMN_METADATA
csv(text) column metadata file with delimited values
(default: None)
-cmd COLUMN_METADATA_DELIMITER, --column_metadata_delimiter COLUMN_METADATA_DELIMITER
delimiter of values in column metadata file (default: ,)
-cmh, --column_metadata_header
whether the first column of the column metadata is the
row label ('header') (default: False)
-mv MISSING_VALUES, --missing_values MISSING_VALUES
defines the string representing missing/unknown values in the
data (default: False)
-ad ALTERNATIVE_DATA, --alternative_data ALTERNATIVE_DATA
csv(text) alternative data file with delimited values
(default: None)
-add ALTERNATIVE_DATA_DELIMITER, --alternative_data_delimiter ALTERNATIVE_DATA_DELIMITER
delimiter of values in alternative data file (default: ,)
-adh, --alternative_data_header
whether the first row of the alternative data is a
header (default: False)
-adcv ALTERNATIVE_DATA_COMPRESSED_VALUE, --alternative_data_compressed_value ALTERNATIVE_DATA_COMPRESSED_VALUE
the resulted value from merged rows when the data are compressed (default: median)
possible values: median, mean, frequency
inchlib_clust can also be invoked programatically from your Python code. Its API is fully documented. There are two main classes in inchlib_clust: Cluster and Dendrogram. Cluster reads in and clusters the data. The Dendrogram object takes the Cluster instance as an input, and generates the cluster heatmap in the InCHlib input format. The example use is given below.
Programming interface example
import inchlib_clust
#instantiate the Cluster object
c = inchlib_clust.Cluster()
# read csv data file with specified delimiter, also specify whether there is a header row, the type of the data (numeric/binary) and the string representation of missing/unknown values
c.read_csv(filename="/path/to/file.csv", delimiter=",", header=bool, missing_value=str/False, datatype="numeric/binary")
# c.read_data(data, header=bool, missing_value=str/False, datatype="numeric/binary") use read_data() for list of lists instead of a data file
# normalize data to (0,1) scale, but after clustering write the original data to the heatmap
c.normalize_data(feature_range=(0,1), write_original=bool)
# cluster data according to the parameters
c.cluster_data(row_distance="euclidean", row_linkage="single", axis="row", column_distance="euclidean", column_linkage="ward")
# instantiate the Dendrogram class with the Cluster instance as an input
d = inchlib_clust.Dendrogram(c)
# create the cluster heatmap representation and define whether you want to compress the data by defining the maximum number of heatmap rows, the resulted value of compressed (merged) rows and whether you want to write the features
d.create_cluster_heatmap(compress=int, compressed_value="median", write_data=bool)
# read metadata file with specified delimiter, also specify whether there is a header row
d.add_metadata_from_file(metadata_file="/path/to/file.csv", delimiter=",", header=bool, metadata_compressed_value="frequency")
# read column metadata file with specified delimiter, also specify whether there is a 'header' column
d.add_column_metadata_from_file(column_metadata_file="/path/to/file.csv", delimiter=",", header=bool)
# export the cluster heatmap on the standard output or to the file if filename specified
d.export_cluster_heatmap_as_json("filename.json")
#d.export_cluster_heatmap_as_html("/path/to/directory") function exports simple HTML page with embedded cluster heatmap and dependencies to given directory