Fork me on GitHub


inchlib_clust
inchlib_clust is a python script that performs data clustering and prepares input data for InCHlib. inchlib_clust can be used both from command line or from Python code. Data for clustering are supplied to inchlib_clust as a csv file. The first row with attribute headers is optional. If command line option -m is specified, metadata are also supplied in a csv format. New in inchlib_clust 0.1.3: If command line option -cm is specified, column metadata are also supplied in a csv format.
Input data example (file containing comma-separated values)
id,feature 1,feature 2,feature 3,feature 4 1,5.1,3.5,1.4,0.2 2,4.9,3,1.4,0.2 3,4.7,3.2,1.3,0.2 4,4.6,3.1,1.5,0.2 5,5,3.6,1.4,0.2
Metadata example (file containing comma-separated values)
id,class 1,class 1 2,class 1 3,class 2 4,class 2 5,class 3
Column metadata example (file containing comma-separated values)
Classification,active,inactive,inactive,active Order,2,1,3,4
Alternative data example (file containing comma-separated values)
id,alt. header 1,alt. header 2,alt. header 3,alt. header 4 1,alt. value 1,alt. value 2,alt. value 3,alt. value 4 2,alt. value 1,alt. value 2,alt. value 3,alt. value 4 3,alt. value 1,alt. value 2,alt. value 3,alt. value 4 4,alt. value 1,alt. value 2,alt. value 3,alt. value 4 5,alt. value 1,alt. value 2,alt. value 3,alt. value 4
In the example below inchlib_clust is invoked from the command line. Input data for clustering are given in input_file.csv which contains attribute names in the first row (-dh parameter), individual values are separated by a comma (-dd , parameter). Metadata are supplied (-m parameter) in metadata.csv file. It also contains metadata names (-mh parameter), individual values are separated by a comma (-md , parameter). Column metadata are supplied (-cm parameter) in column_metadata.csv file. It also contains column metadata names (-cmh parameter) but unlike data and row metadata names, they are in the first column (because they have inverse orientation). Individual column metadata values are separated by a comma (-cmd , parameter). Both rows and columns are clustered (-a both parameter) using Ward’s clustering (-l ward parameter) with Euclidean distance (-d euclidean parameter).
Command-line example
python inchlib_clust.py input_file.csv -m metadata.csv -cm column_metadata.csv -dh -mh -cmh -d euclidean -l ward -a both -dd , -md , -cmd ,
The whole list of paramaters is given below.
Command-line parameters
positional arguments: data_file csv(text) data file with delimited values optional arguments: -h, --help show this help message and exit -o OUTPUT_FILE, --output_file OUTPUT_FILE the name of output file (default: None) -html DIRECTORY, --html_dir DIRECTORY directory where simple HTML page with embedded cluster heatmap and dependencies is stored (default: .) -rd ROW_DISTANCE, --row_distance ROW_DISTANCE set the distance to use for clustering rows(default: euclidean) possible values: dice, hamming, jaccard, kulsinski, matching, rogerstanimoto, russellrao, sokalmichener, sokalsneath, yule, braycurtis, canberra, chebyshev, cityblock, correlation, cosine, euclidean, mahalanobis, minkowski, seuclidean, sqeuclidean -rl ROW_LINKAGE, --row_linkage ROW_LINKAGE set the linkage to use for clustering rows (default: ward) possible values: single, complete, average, weighted, centroid, median, ward -cd COLUMN_DISTANCE, --column_distance COLUMN_DISTANCE set the distance to use for clustering columns(default: euclidean) possible values: dice, hamming, jaccard, kulsinski, matching, rogerstanimoto, russellrao, sokalmichener, sokalsneath, yule, braycurtis, canberra, chebyshev, cityblock, correlation, cosine, euclidean, mahalanobis, minkowski, seuclidean, sqeuclidean -cl COLUMN_LINKAGE, --column_linkage COLUMN_LINKAGE set the linkage to use for clustering columns (default: ward) possible values: single, complete, average, weighted, centroid, median, ward -a AXIS, --axis AXIS define clustering axis (row/both) (default: row) -dt DATATYPE, --datatype DATATYPE specify the type of the data (numeric/binary) (default: numeric) -dd DATA_DELIMITER, --data_delimiter DATA_DELIMITER delimiter of values in data file (default: ,) -m METADATA, --metadata METADATA csv(text) metadata file with delimited values (default: None) -md METADATA_DELIMITER, --metadata_delimiter METADATA_DELIMITER delimiter of values in metadata file (default: ,) -dh, --data_header whether the first row of data file is a header (default: False) -mh, --metadata_header whether the first row of metadata file is a header (default: False) -c COMPRESS, --compress COMPRESS compress the data to contain maximum of specified count of rows -cv COMPRESSED_VALUE, --compressed_value COMPRESSED_VALUE the resulted value of merged rows (data points) (default: median) possible values: median, mean -mcv METADATA_COMPRESSED_VALUE, --metadata_compressed_value METADATA_COMPRESSED_VALUE the resulted value from merged rows when the data are compressed (default: median) possible values: median, mean, frequency -dwd, --dont_write_data don't write clustered data to the inchlib data format (default: False) -n, --normalize normalize data to (0,1) scale -wo, --write_original only when normalize is set to True cluster normalized data, but write original data to the heatmap -cm COLUMN_METADATA, --column_metadata COLUMN_METADATA csv(text) column metadata file with delimited values (default: None) -cmd COLUMN_METADATA_DELIMITER, --column_metadata_delimiter COLUMN_METADATA_DELIMITER delimiter of values in column metadata file (default: ,) -cmh, --column_metadata_header whether the first column of the column metadata is the row label ('header') (default: False) -mv MISSING_VALUES, --missing_values MISSING_VALUES defines the string representing missing/unknown values in the data (default: False) -ad ALTERNATIVE_DATA, --alternative_data ALTERNATIVE_DATA csv(text) alternative data file with delimited values (default: None) -add ALTERNATIVE_DATA_DELIMITER, --alternative_data_delimiter ALTERNATIVE_DATA_DELIMITER delimiter of values in alternative data file (default: ,) -adh, --alternative_data_header whether the first row of the alternative data is a header (default: False) -adcv ALTERNATIVE_DATA_COMPRESSED_VALUE, --alternative_data_compressed_value ALTERNATIVE_DATA_COMPRESSED_VALUE the resulted value from merged rows when the data are compressed (default: median) possible values: median, mean, frequency

inchlib_clust can also be invoked programatically from your Python code. Its API is fully documented. There are two main classes in inchlib_clust: Cluster and Dendrogram. Cluster reads in and clusters the data. The Dendrogram object takes the Cluster instance as an input, and generates the cluster heatmap in the InCHlib input format. The example use is given below.
Programming interface example
import inchlib_clust

#instantiate the Cluster object
c = inchlib_clust.Cluster()

# read csv data file with specified delimiter, also specify whether there is a header row, the type of the data (numeric/binary) and the string representation of missing/unknown values
c.read_csv(filename="/path/to/file.csv", delimiter=",", header=bool, missing_value=str/False, datatype="numeric/binary")
# c.read_data(data, header=bool, missing_value=str/False, datatype="numeric/binary") use read_data() for list of lists instead of a data file

# normalize data to (0,1) scale, but after clustering write the original data to the heatmap
c.normalize_data(feature_range=(0,1), write_original=bool)

# cluster data according to the parameters
c.cluster_data(row_distance="euclidean", row_linkage="single", axis="row", column_distance="euclidean", column_linkage="ward")

# instantiate the Dendrogram class with the Cluster instance as an input
d = inchlib_clust.Dendrogram(c)

# create the cluster heatmap representation and define whether you want to compress the data by defining the maximum number of heatmap rows, the resulted value of compressed (merged) rows and whether you want to write the features
d.create_cluster_heatmap(compress=int, compressed_value="median", write_data=bool)

# read metadata file with specified delimiter, also specify whether there is a header row
d.add_metadata_from_file(metadata_file="/path/to/file.csv", delimiter=",", header=bool, metadata_compressed_value="frequency")

# read column metadata file with specified delimiter, also specify whether there is a 'header' column
d.add_column_metadata_from_file(column_metadata_file="/path/to/file.csv", delimiter=",", header=bool)

# export the cluster heatmap on the standard output or to the file if filename specified
d.export_cluster_heatmap_as_json("filename.json")
#d.export_cluster_heatmap_as_html("/path/to/directory") function exports simple HTML page with embedded cluster heatmap and dependencies to given directory