aimsim.chemical_datastructures package
Submodules
aimsim.chemical_datastructures.molecule module
Abstraction of RDKit molecule with relevant property manipulation methods.
- class aimsim.chemical_datastructures.molecule.Molecule(mol_graph=None, mol_text=None, mol_property_val=None, mol_descriptor_val=None, mol_src=None, mol_smiles=None)
Bases:
object
An abstraction of a molecule
- mol_graph
Graph-level information of molecule. Implemented as an RDKIT mol object.
- Type
RDKIT mol object
- mol_text
Text identifier of the molecule.
- Type
str
- mol_property_val
Some property associated with the molecule. This is typically the response being studied. E.g. Boiling point, Selectivity etc.
- Type
float
- descriptor
Vector representation of a molecule. Commonly a fingerprint.
- Type
Descriptor object
- set_descriptor(
arbitrary_descriptor_val=None, fingerprint_type=None, fingerprint_params=None): Set the descriptor value either by passing an arbitrary value or by specifying a fingerprint that will be generated.
- get_descriptor_val()
Get the descriptor value as an numpy array.
- match_fingerprint_from(reference_mol)
Generate the same fingerprint as the reference_mol.
- get_similarity_to(target_mol, similarity_measure)
Get the similarity to target_mol using a similarity_measure of choice.
- get_name()
Get the mol_text attribute.
- get_mol_property_val()
Get mol_property_val attribute.
- draw(fpath=None, **kwargs)
Draw the molecule.
- is_same(source_molecule, target_molecule)
Static method used to check equivalence of two molecules.
- __init__(mol_graph=None, mol_text=None, mol_property_val=None, mol_descriptor_val=None, mol_src=None, mol_smiles=None)
Constructor
- Parameters
mol_graph (RDKIT mol object) – Graph-level information of molecule. Implemented as an RDKIT mol object. Default is None.
mol_text (str) – Text identifier of the molecule. Default is None. Identifiers can be: —————— 1. Name of the molecule. 2. SMILES string representing the molecule.
mol_property_val (float) – Some property associated with the molecule. This is typically the response being studied. E.g. Boiling point, Selectivity etc. Default is None.
mol_descriptor_val (numpy ndarray) – Descriptor value for the molecule. Must be numpy array or list. Default is None.
mol_src (str) –
Source file or SMILES string to load molecule. Acceptable files: -> .pdb file -> .txt file with SMILE string in first column, first row and
(optionally) property in second column, first row.
Default is None. If provided mol_graph is attempted to be loaded from it.
mol_smiles (str) – SMILES string for molecule. If provided, mol_graph is loaded from it. If mol_text not set in keyword argument, this string is used to set it.
- draw(fpath=None, **kwargs)
Draw or molecule graph.
- Parameters
fpath (str) – Path of file to store image. If None, image is displayed in io as a Tkinter windows. Default is None.
kwargs (keyword arguments) – Arguments to modify plot properties.
- get_descriptor_val()
Get value of molecule descriptor.
- Returns
value(s) of the descriptor.
- Return type
np.ndarray
- get_mol_property_val()
- get_name()
- get_similarity_to(target_mol, similarity_measure)
Get a similarity metric to a target molecule
- Parameters
target_mol (AIMSim.ops Molecule) – Target molecule. Similarity score is with respect to this molecule
similarity_measure (AIMSim.ops SimilarityMeasure) – metric used.
- Returns
- Similarity coefficient by the chosen
method.
- Return type
similarity_score (float)
- Raises
NotInitializedError – If target_molecule has uninitialized descriptor. See note.
- static is_same(source_molecule, target_molecule)
Check if the target_molecule is a duplicate of source_molecule.
- Parameters
source_molecule (AIMSim.chemical_datastructures Molecule) – Source molecule to compare.
target_molecule (AIMSim.chemical_datastructures Molecule) – Target molecule to compare.
- Returns
True if the molecules are the same.
- Return type
bool
- match_fingerprint_from(reference_mol)
If target_mol.descriptor is a fingerprint, this method will try to calculate the fingerprint of the self molecules. If this fails because of the absence of mol_graph attribute in target_molecule, a ValueError is raised.
- Parameters
reference_mol (AIMSim.ops Molecule) – Target molecule. Fingerprint
reference. (of this molecule is used as the) –
- Raises
ValueError –
- set_descriptor(arbitrary_descriptor_val=None, fingerprint_type=None, fingerprint_params=None)
Sets molecular descriptor attribute.
- Parameters
arbitrary_descriptor_val (np.array or list) – Arbitrary descriptor vector. Default is None.
fingerprint_type (str) – String label specifying which fingerprint to use. Default is None.
fingerprint_params (dict) – Additional parameters for modifying fingerprint defaults. Default is None.
aimsim.chemical_datastructures.molecule_set module
Abstraction of a data set comprising multiple Molecule objects.
- class aimsim.chemical_datastructures.molecule_set.MoleculeSet(molecule_database_src: str, molecule_database_src_type: str, is_verbose: bool, similarity_measure: str, n_threads=1, fingerprint_type=None, fingerprint_params=None, sampling_ratio=1.0, sampling_random_state=42)
Bases:
object
An abstraction of a collection of molecules constituting a chemical dataset.
- is_verbose
Controls how much information is displayed during plotting.
- Type
bool
- molecule_database
Collection of Molecule objects.
- Type
list
- descriptor
Descriptor or fingerprint used to featurize molecules in the molecule set.
- Type
- similarity_measure
Similarity measure used.
- Type
- similarity_matrix
n_mols X n_mols matrix of pairwise similarity scores.
- Type
numpy ndarray
- sampling_ratio
Fraction of dataset to keep for analysis. Default is 1.
- Type
float
- n_threads
Number of threads used for analysis. Can be an integer denoting the number of threads or ‘auto’ to heuristically determine if multiprocessing is worthwhile based on a curve fitted to the speedup data in the manuscript SI Default is 1.
- Type
int or str
- is_present(target_molecule)
Searches the name of a target molecule in the molecule set to determine if the target molecule is present in the molecule set.
- compare_against_molecule(query_molecule)
Compare the a query molecule to all molecules of the set.
- get_most_similar_pairs()
Get pairs of samples which are most similar.
- get_most_dissimilar_pairs()
Get pairs of samples which are least similar.
- get_property_of_most_similar()
Get property of pairs of molecules which are most similar to each other.
- get_property_of_most_dissimilar()
Get property of pairs of molecule which are most dissimilar to each other.
- get_similarity_matrix()
Get the similarity matrix for the data set.
- get_distance_matrix()
Get the distance matrix for the data set. This is can only be done for similarity measures which yields valid distances.
- get_pairwise_similarities()
Get an array of pairwise similarities of molecules in the set.
- get_mol_names()
Get names of the molecules in the set.
- get_mol_properties()
Get properties of all the molecules in the dataset.
- cluster(n_clusters=8, clustering_method=None, **kwargs)
Cluster the molecules of the MoleculeSet. Implemented methods.
- ‘complete_linkage’, ‘complete’:
Complete linkage agglomerative hierarchical clustering.
- ‘average_linkage’, ‘average’:
average linkage agglomerative hierarchical clustering.
- ‘single_linkage’, ‘single’:
single linkage agglomerative hierarchical clustering.
- ‘ward’:
for Ward’s algorithm.
- get_cluster_labels()
Get cluster membership of Molecules.
- get_transformed_descriptors(method_='pca', **kwargs)
Use an embedding method to transform molecular descriptor to a low dimensional representation. Implemented methods are Principal Component Analysis (‘pca’), Multidimensional scaling (‘mds’), t-SNE (‘tsne’), Isomap (‘isomap’), Spectral Embedding (‘spectral_embedding’)
- __init__(molecule_database_src: str, molecule_database_src_type: str, is_verbose: bool, similarity_measure: str, n_threads=1, fingerprint_type=None, fingerprint_params=None, sampling_ratio=1.0, sampling_random_state=42)
Constructor for the MoleculeSet class. :param sampling_ratio: Fraction of the molecules to keep. Useful
for selection subset of dataset for quick computations.
- Parameters
sampling_random_state (int) – Random state used for sampling. Default is 42.
- cluster(n_clusters=8, clustering_method=None, **kwargs)
Cluster the molecules of the MoleculeSet.
- Parameters
n_clusters (int) – Number of clusters. Default is 8.
clustering_method (str) –
Clustering algorithm to use. Default is None in which case the algorithm is chosen from the similarity measure in use. Implemented clustering_methods are: ‘complete_linkage’, ‘complete’:
Complete linkage agglomerative hierarchical clustering [2].
- ’average_linkage’, ‘average’:
average linkage agglomerative hierarchical clustering [2].
- ’single_linkage’, ‘single’:
single linkage agglomerative hierarchical clustering [2].
- ’ward’:
for Ward’s algorithm [2]. This method is useful for Euclidean descriptors.
kwargs (keyword args) –
Key word arguments to supply to clustering algorithm. See the documentation pages listed below for these arguments: ‘complete_linkage’, ‘average_linkage’, ‘single_linkage’, ‘ward’
- Returns
- Dictionary of cluster id
(key) –> Names of molecules in cluster.
- Return type
cluster_grouped_mol_names (dict)
References: [1] Hastie, T., Tibshirani R. and Friedman J.,
The Elements of statistical Learning: Data Mining, Inference, and Prediction, 2nd Ed., Springer Series in Statistics (2009).
- [2] Murtagh, F. and Contreras, P., Algorithms for hierarchical
clustering: an overview. WIREs Data Mining Knowl Discov (2011). https://doi.org/10.1002/widm.53
- compare_against_molecule(query_molecule)
Compare the a query molecule to all molecules of the set.
- Parameters
query_molecule (AIMSim.chemical_datastructures Molecule) – Target molecule to compare.
- Returns
- Similarity scores between query
molecule and all other molecules of the molecule set.
- Return type
set_similarity (np.ndarray)
- get_cluster_labels()
Get cluster membership of Molecules. :raises NotInitializedError: If MoleculeSet object not clustered.
- get_distance_matrix()
Get the distance matrix for the data set. This is can only be done for similarity measures which yields valid distances.
- Returns
- Distance matrix of the dataset.
Shape (n_samples, n_samples).
- Return type
(np.ndarray)
- get_mol_features()
Get features of the molecules in the set.
- Returns
(n_molecules, feature_dimensionality) array.
- Return type
np.ndarray
- get_mol_names()
Get names of the molecules in the set. This is the Molecule.mol_text attribute of the Molecule objects in the MoleculeSet. If this attribute is not present, then collection of mol_ids in the form “id: ” + str(mol_id) is returned.
- Returns
Array with molecules names.
- Return type
np.ndarray
- get_mol_properties()
- Get properties of all the molecules in the dataset.
If all molecules don’t have properties, None is returned.
- Returns
Array with molecules properties or None.
- Return type
np.ndarray or None
- get_most_dissimilar_pairs()
Get pairs of samples which are least similar.
- Returns
- List(Tuple(Molecule, Molecule))
List of pairs of indices closest to one another.
- Raises
NotInitializedError – If MoleculeSet object does not have similarity_measure attribute.
- get_most_similar_pairs()
Get pairs of samples which are most similar.
- Returns
- List(Tuple(Molecule, Molecule))
List of pairs of Molecules closest to one another. Since ties are broken randomly, this may be non-transitive i.e. (A, B) =/=> (B, A)
- Raises
NotInitializedError – If MoleculeSet object does not have similarity_measure attribute.
- get_pairwise_similarities()
Get an array of pairwise similarities of molecules in the set.
- Returns
Array of pairwise similarities of the molecules in the set. Self similarities are not calculated.
- Return type
(np.ndarray)
- get_property_of_most_dissimilar()
Get property of pairs of molecule which are most dissimilar to each other.
- Returns
The first index is an array of reference mol properties and the second index is an array of the property of the respective most dissimilar molecule. Skips pairs of molecules for which molecule properties are not initialized.
- Return type
(tuple)
- get_property_of_most_similar()
Get property of pairs of molecules which are most similar to each other.
- Returns
The first index is an array of reference mol properties and the second index is an array of the property of the respective most similar molecule. Skips pairs of molecules for which molecule properties are not initialized.
- Return type
(tuple)
- get_similarity_matrix()
Get the similarity matrix for the data set.
- Returns
- Similarity matrix of the dataset.
Shape (n_samples, n_samples).
- Return type
(np.ndarray)
Note
If un-set, sets the self.similarity_matrix attribute.
- get_transformed_descriptors(method_='pca', **kwargs)
Use an embedding method to transform molecular descriptor to a low dimensional representation.
- Parameters
method (str) – The method used for generating lower dimensional embedding. Implemented methods are: ‘pca’: Principal Component Analysis [1] ‘mds’: Multidimensional scaling [2-4] ‘tsne’: t-SNE [5] ‘isomap’: Isomap [6] ‘spectral_embedding’: Spectral Embedding [7]
kwargs (dict) – Keyword arguments to modify the behaviour of the respective embedding methods. See the documentation pages listed below for these arguments. ‘pca’: https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html ‘mds’: https://scikit-learn.org/stable/modules/generated/sklearn.manifold.MDS.html ‘tsne’: https://scikit-learn.org/stable/modules/generated/sklearn.manifold.TSNE.html ‘isomap’: https://scikit-learn.org/stable/modules/generated/sklearn.manifold.Isomap.html ‘spectral_embedding’: https://scikit-learn.org/stable/modules/generated/sklearn.manifold.SpectralEmbedding.html
- Returns
- Transformed descriptors of shape
(n_samples, n_components).
- Return type
X (np.ndarray)
- Raises
InvalidConfigurationError – If illegal method_ passed.
References
[1] Bishop, C. M., Pattern recognition and machine learning. 2006. [2] Borg, I. and P. Groenen, Modern Multidimensional Scaling:
Theory and Applications (Springer Series in Statistics). 2005.
- [3] Kruskal, J., Nonmetric multidimensional scaling:
A numerical method. Psychometrika, 1964. 29(2): p. 115-129.
- [4] Kruskal, J., Multidimensional scaling by optimizing goodness
of fit to a nonmetric hypothesis. Psychometrika, 1964. 29: p. 1-27.
- [5] van der Maaten, L. and G. Hinton, Viualizing data using t-SNE.
Journal of Machine Learning Research, 2008. 9: p. 2579-2605.
- [6] Tenenbaum, J.B., V.d. Silva, and J.C. Langford,
A Global Geometric Framework for Nonlinear Dimensionality Reduction. Science, 2000. 290(5500): p. 2319-2323.
- [7] Ng, A.Y., M.I. Jordan, and Y. Weiss. On Spectral Clustering:
Analysis and an algorithm. 2001. MIT Press.
- is_present(target_molecule)
Searches the name of a target molecule in the molecule set to determine if the target molecule is present in the molecule set.
- Parameters
target_molecule (AIMSim.chemical_datastructures.Molecule) – Target molecule to search.
- Returns
If the molecule is present in the molecule set or not.
- Return type
(bool)