MolSetInspector (Molecular Sets Inspector) is a Python package which facilitates the processing of multiple molecular sets stored in various text file formats. As its input, it takes a directory containing the sets of molecules stored in sdf, csv, smi or txt files. The sets are read and joined in one library consisting of distinct molecules. During processing, the molecules are canonicalized, can be standardised (neutralised, unsalted etc.) and tautomers can be removed. As a result, MolSetInspector outputs the intersections of individual molecular sets, the IDs of defective (not parsed) molecules and the list of distinct molecules including a hit table (a hit table shows in which set/s was the molecule found). MolSetInspector can also filter distinct molecules by their diversity using two approaches: by setting 1) a maximum total number of diverse molecules or 2) the maximum similarity treshold of a molecular pair in the set.
Installation
On Linux, install MolSetInspector using following commands:
sudo unzip MolSetInspector-0.1.0.zip
cd MolSetInspector-0.1.0/
sudo python setup.py install
On Windows, install MolSetInspector using following command:
python setup.py install
MolSetInspector requires the following dependencies to be installed:
- RDKit: Open-Source Cheminformatics Software (tested with release 15-03-01, download RDkit 2015-03-01)
- Standardiser (tested with version 0.1.7, download Standardiser 0.1.7)
Usage
MolSetInspector can be used both from command line or from Python code.
In the example below, MolSetInspector is invoked from the command line. Path to an input directory (/path/to/input_directory) is the only arbitrary argument of MolSetInspector. The output directory (-o /path/to/output_directory) specifies the directory where all the exports will be written. All molecular sets contained in the input directory are read and molecules standardised (-standardise option). The intersections table for all pairs of sets is written to the output directory (-inter option). Distinct compounds combined from all sets together are filtered to contain molecules with maximum similarity of 0.7 using Tanimoto similarity and ECFP4 fingerprints (-dist and -dbs 0.7 options). The diverse set is written in the form of an .sdf file (-outf sdf option) to the output directory.
python molsetinspector.py /path/to/input_directory -o /path/to/output_directory -dist -dbs 0.7 -inter -standardise -outf sdf
Example
We have an input directory with 4 molecular sets in files with different formats:
input_directory/
set_1.sdf (810 molecules, classic SDF file)
set_2.smi (319 molecules, file with molecules in SMILES format one per line)
set_3.csv (94 molecules, CSV file with 2 columns with header, molecules in SMILES)
set_4.csv (354 molecules, CSV file with 7 columns without header, molecules in SMILES)
Using the command above the output directory is created:
output_directory/
distinct.sdf (1442 distinct molecules, classic SDF file)
interesections.csv (CSV file with table containing number of common molecules for all pairs of sets)
If we used the command with -hit and -outf csv options we would get:
output_directory/
distinct.csv (1442 distinct molecules in SMILES with information about the presence of molecule in each set)
interesections.csv (CSV file with table containing number of common molecules for all pairs of sets)
interesections.csv
set_2.smi | set_4.csv | set_1.sdf | set_3.csv | |
---|---|---|---|---|
set_2.smi | 319 | 26 | 0 | 0 |
set_4.csv | 26 | 351 | 10 | 0 |
set_1.sdf | 0 | 10 | 791 | 0 |
set_3.csv | 0 | 0 | 0 | 94 |
distinct.csv (5 of 1443 lines in total)
smiles | set_2.smi | set_4.csv | set_1.sdf | set_3.csv | sum |
---|---|---|---|---|---|
CC(=NNC(=O)C1CCCC1)c1ccc(NC(=O)C(F)(F)F)cc1 | 1 | 1 | 2 | ||
O=C(O)c1cccnc1SCCc1ccccc1 | 1 | 1 | |||
Cc1c2cnccc2c(C)c2c1c1ccccc1n2CCOC(=O)c1ccccc1 | 1 | 1 | |||
CC1=C(C)C(Cc2ccc(O)cc2)N(Cc2ccccc2)CC1 | 1 | 1 |
List of all possible MolSetInspector arguments:
positional arguments:
input_directory directory containing molecular sets files
optional arguments:
-h, --help show this help message and exit
-o OUTPUT_DIRECTORY, --output_directory OUTPUT_DIRECTORY
output directory (default: False)
-outf OUTPUT_FORMAT, --output_format OUTPUT_FORMAT
output format for distinct molecules (smi, sdf, csv)
(default: csv)
-dist, --distinct write file with distinct molecules from all sets
(default: False)
-hit, --hit_table append table with 0/1 to distinct export which shows
the sets that contains the molecule (default: False)
-inter, --intersections
write file with set intersections (default: False)
-defect, --defective_molecules
write file with molecules that couldn't be parsed
(default: False)
-standardise, --standardise_structures
standardise structures using the standardiser
(default: False)
-tautomer, --remove_tautomers
remove tautomers (default: False)
-dbs DIVERSE_BY_MAX_SIMILARITY, --diverse_by_max_similarity DIVERSE_BY_MAX_SIMILARITY
get diverse structures by maximum similarity treshold (from 0 to 1, 1 is most similar)
(default: False)
-dbn DIVERSE_BY_TOTAL_NUMBER, --diverse_by_total_number DIVERSE_BY_TOTAL_NUMBER
get maximum specified number of diverse structures (positive integer)
(default: False)
The example below shows how to use MolSetInspector from a programming code:
from molsetinspector import MolSetInspector as MSI
"""
Create instance of the MolSetInspector,
specify the input directory, output directory and whether the standardization of molecules should be used
"""
msi = MSI(indirectory="/path/to/input_directory", outdirectory="/path/to/output_directory", standardise=True, remove_tautomers=False)
"""
Get table (list of lists) with the number of common molecules for all pairs of sets
When output_directory is specified it also writes the result to the .csv file
"""
msi.get_set_intersections()
"""
Get distinct molecules combining all molecular sets
diversity_type option can be used to filter the set by diversity (max_similarity/total_number)
treshold criterium for diversity selection (maximum_similarity - float from 0.0 to 1.0, total_number - positive integer)
hit_table appends the infromation about presence of each molecule in each set, only works with .csv output format
When output_directory is specified it also writes the result to the .sdf/.smi/.csv file
"""
msi.get_distinct(diversity_type="max_similarity/total_number", treshold=0.7, hit_table=True, output_format="sdf/smi/csv")
"""
Get molecules which couldn't be parsed by RDkit,
each molecule is identified by the original molecular set and its position index in the file (starting from position 1)
When output_directory is specified it also writes the result to the .csv file
"""
msi.get_defective()
This work was sponsored by the Project NPU I (LO1220) of the Ministry of Education, Youth and Sports.