aimsim.chemical_datastructures package

Submodules

aimsim.chemical_datastructures.molecule module

Abstraction of RDKit molecule with relevant property manipulation methods.

class aimsim.chemical_datastructures.molecule.Molecule(mol_graph=None, mol_text=None, mol_property_val=None, mol_descriptor_val=None, mol_src=None, mol_smiles=None)

Bases: object

An abstraction of a molecule

mol_graph

Graph-level information of molecule. Implemented as an RDKIT mol object.

Type: RDKIT mol object

mol_text

Text identifier of the molecule.

Type: str

mol_property_val

Some property associated with the molecule. This is typically the response being studied. E.g. Boiling point, Selectivity etc.

Type: float

descriptor

Vector representation of a molecule. Commonly a fingerprint.

Type: Descriptor object

set_descriptor(: arbitrary_descriptor_val=None, fingerprint_type=None, fingerprint_params=None): Set the descriptor value either by passing an arbitrary value or by specifying a fingerprint that will be generated.

get_descriptor_val(): Get the descriptor value as an numpy array.

match_fingerprint_from(reference_mol): Generate the same fingerprint as the reference_mol.

get_similarity_to(target_mol, similarity_measure): Get the similarity to target_mol using a similarity_measure of choice.

get_name(): Get the mol_text attribute.

get_mol_property_val(): Get mol_property_val attribute.

draw(fpath=None, **kwargs): Draw the molecule.

is_same(source_molecule, target_molecule): Static method used to check equivalence of two molecules.

__init__(mol_graph=None, mol_text=None, mol_property_val=None, mol_descriptor_val=None, mol_src=None, mol_smiles=None)

Constructor

Parameters

mol_graph (RDKIT mol object) – Graph-level information of molecule. Implemented as an RDKIT mol object. Default is None.
mol_text (str) – Text identifier of the molecule. Default is None. Identifiers can be: —————— 1. Name of the molecule. 2. SMILES string representing the molecule.
mol_property_val (float) – Some property associated with the molecule. This is typically the response being studied. E.g. Boiling point, Selectivity etc. Default is None.
mol_descriptor_val (numpy ndarray) – Descriptor value for the molecule. Must be numpy array or list. Default is None.
mol_src (str) –
Source file or SMILES string to load molecule. Acceptable files: -> .pdb file -> .txt file with SMILE string in first column, first row and

(optionally) property in second column, first row.

Default is None. If provided mol_graph is attempted to be loaded from it.
mol_smiles (str) – SMILES string for molecule. If provided, mol_graph is loaded from it. If mol_text not set in keyword argument, this string is used to set it.

draw(fpath=None, **kwargs)

Draw or molecule graph.

Parameters

fpath (str) – Path of file to store image. If None, image is displayed in io as a Tkinter windows. Default is None.
kwargs (keyword arguments) – Arguments to modify plot properties.

get_descriptor_val()

Get value of molecule descriptor.

Returns: value(s) of the descriptor.
Return type: np.ndarray

get_mol_property_val()

get_name()

get_similarity_to(target_mol, similarity_measure)

Get a similarity metric to a target molecule

Parameters

target_mol (AIMSim.ops Molecule) – Target molecule. Similarity score is with respect to this molecule
similarity_measure (AIMSim.ops SimilarityMeasure) – metric used.

Returns

Similarity coefficient by the chosen: method.

Return type

similarity_score (float)

Raises

NotInitializedError – If target_molecule has uninitialized descriptor. See note.

static is_same(source_molecule, target_molecule)

Check if the target_molecule is a duplicate of source_molecule.

Parameters

source_molecule (AIMSim.chemical_datastructures Molecule) – Source molecule to compare.
target_molecule (AIMSim.chemical_datastructures Molecule) – Target molecule to compare.

Returns

True if the molecules are the same.

Return type

bool

match_fingerprint_from(reference_mol)

If target_mol.descriptor is a fingerprint, this method will try to calculate the fingerprint of the self molecules. If this fails because of the absence of mol_graph attribute in target_molecule, a ValueError is raised.

Parameters

reference_mol (AIMSim.ops Molecule) – Target molecule. Fingerprint
reference. (of this molecule is used as the) –

Raises

ValueError –

set_descriptor(arbitrary_descriptor_val=None, fingerprint_type=None, fingerprint_params=None)

Sets molecular descriptor attribute.

Parameters

arbitrary_descriptor_val (np.array or list) – Arbitrary descriptor vector. Default is None.
fingerprint_type (str) – String label specifying which fingerprint to use. Default is None.
fingerprint_params (dict) – Additional parameters for modifying fingerprint defaults. Default is None.

aimsim.chemical_datastructures.molecule_set module

Abstraction of a data set comprising multiple Molecule objects.

class aimsim.chemical_datastructures.molecule_set.MoleculeSet(molecule_database_src: str, molecule_database_src_type: str, is_verbose: bool, similarity_measure: str, n_threads=1, fingerprint_type=None, fingerprint_params=None, sampling_ratio=1.0, sampling_random_state=42)

Bases: object

An abstraction of a collection of molecules constituting a chemical dataset.

is_verbose

Controls how much information is displayed during plotting.

Type: bool

molecule_database

Collection of Molecule objects.

Type: list

descriptor

Descriptor or fingerprint used to featurize molecules in the molecule set.

Type: Descriptor

similarity_measure

Similarity measure used.

Type: SimilarityMeasure

similarity_matrix

n_mols X n_mols matrix of pairwise similarity scores.

Type: numpy ndarray

sampling_ratio

Fraction of dataset to keep for analysis. Default is 1.

Type: float

n_threads

Number of threads used for analysis. Can be an integer denoting the number of threads or ‘auto’ to heuristically determine if multiprocessing is worthwhile based on a curve fitted to the speedup data in the manuscript SI Default is 1.

Type: int or str

is_present(target_molecule): Searches the name of a target molecule in the molecule set to determine if the target molecule is present in the molecule set.

compare_against_molecule(query_molecule): Compare the a query molecule to all molecules of the set.

get_most_similar_pairs(): Get pairs of samples which are most similar.

get_most_dissimilar_pairs(): Get pairs of samples which are least similar.

get_property_of_most_similar(): Get property of pairs of molecules which are most similar to each other.

get_property_of_most_dissimilar(): Get property of pairs of molecule which are most dissimilar to each other.

get_similarity_matrix(): Get the similarity matrix for the data set.

get_distance_matrix(): Get the distance matrix for the data set. This is can only be done for similarity measures which yields valid distances.

get_pairwise_similarities(): Get an array of pairwise similarities of molecules in the set.

get_mol_names(): Get names of the molecules in the set.

get_mol_properties(): Get properties of all the molecules in the dataset.

cluster(n_clusters=8, clustering_method=None, **kwargs)

Cluster the molecules of the MoleculeSet. Implemented methods.

‘complete_linkage’, ‘complete’:
Complete linkage agglomerative hierarchical clustering.

‘average_linkage’, ‘average’:
average linkage agglomerative hierarchical clustering.

‘single_linkage’, ‘single’:
single linkage agglomerative hierarchical clustering.

‘ward’:
for Ward’s algorithm.

get_cluster_labels(): Get cluster membership of Molecules.

get_transformed_descriptors(method_='pca', **kwargs): Use an embedding method to transform molecular descriptor to a low dimensional representation. Implemented methods are Principal Component Analysis (‘pca’), Multidimensional scaling (‘mds’), t-SNE (‘tsne’), Isomap (‘isomap’), Spectral Embedding (‘spectral_embedding’)

__init__(molecule_database_src: str, molecule_database_src_type: str, is_verbose: bool, similarity_measure: str, n_threads=1, fingerprint_type=None, fingerprint_params=None, sampling_ratio=1.0, sampling_random_state=42)

Constructor for the MoleculeSet class. :param sampling_ratio: Fraction of the molecules to keep. Useful

for selection subset of dataset for quick computations.

Parameters: sampling_random_state (int) – Random state used for sampling. Default is 42.

cluster(n_clusters=8, clustering_method=None, **kwargs)

Cluster the molecules of the MoleculeSet.

Parameters

n_clusters (int) – Number of clusters. Default is 8.
clustering_method (str) –
Clustering algorithm to use. Default is None in which case the algorithm is chosen from the similarity measure in use. Implemented clustering_methods are: ‘complete_linkage’, ‘complete’:

Complete linkage agglomerative hierarchical clustering [2].

’average_linkage’, ‘average’:
average linkage agglomerative hierarchical clustering [2].

’single_linkage’, ‘single’:
single linkage agglomerative hierarchical clustering [2].

’ward’:
for Ward’s algorithm [2]. This method is useful for Euclidean descriptors.
kwargs (keyword args) –
Key word arguments to supply to clustering algorithm. See the documentation pages listed below for these arguments: ‘complete_linkage’, ‘average_linkage’, ‘single_linkage’, ‘ward’

: https://scikit-learn.org/stable/modules/generated/sklearn.cluster.AgglomerativeClustering.html

Returns

Dictionary of cluster id: (key) –> Names of molecules in cluster.

Return type

cluster_grouped_mol_names (dict)

References: [1] Hastie, T., Tibshirani R. and Friedman J.,

The Elements of statistical Learning: Data Mining, Inference, and Prediction, 2nd Ed., Springer Series in Statistics (2009).

[2] Murtagh, F. and Contreras, P., Algorithms for hierarchical: clustering: an overview. WIREs Data Mining Knowl Discov (2011). https://doi.org/10.1002/widm.53

compare_against_molecule(query_molecule)

Compare the a query molecule to all molecules of the set.

Parameters

query_molecule (AIMSim.chemical_datastructures Molecule) – Target molecule to compare.

Returns

Similarity scores between query: molecule and all other molecules of the molecule set.

Return type

set_similarity (np.ndarray)

get_cluster_labels(): Get cluster membership of Molecules. :raises NotInitializedError: If MoleculeSet object not clustered.

get_distance_matrix()

Get the distance matrix for the data set. This is can only be done for similarity measures which yields valid distances.

Returns

Distance matrix of the dataset.: Shape (n_samples, n_samples).

Return type

(np.ndarray)

get_mol_features()

Get features of the molecules in the set.

Returns: (n_molecules, feature_dimensionality) array.
Return type: np.ndarray

get_mol_names()

Get names of the molecules in the set. This is the Molecule.mol_text attribute of the Molecule objects in the MoleculeSet. If this attribute is not present, then collection of mol_ids in the form “id: ” + str(mol_id) is returned.

Returns: Array with molecules names.
Return type: np.ndarray

get_mol_properties()

Get properties of all the molecules in the dataset.: If all molecules don’t have properties, None is returned.

Returns: Array with molecules properties or None.
Return type: np.ndarray or None

get_most_dissimilar_pairs()

Get pairs of samples which are least similar.

Returns

List(Tuple(Molecule, Molecule)): List of pairs of indices closest to one another.

Raises

NotInitializedError – If MoleculeSet object does not have similarity_measure attribute.

get_most_similar_pairs()

Get pairs of samples which are most similar.

Returns

List(Tuple(Molecule, Molecule)): List of pairs of Molecules closest to one another. Since ties are broken randomly, this may be non-transitive i.e. (A, B) =/=> (B, A)

Raises

NotInitializedError – If MoleculeSet object does not have similarity_measure attribute.

get_pairwise_similarities()

Get an array of pairwise similarities of molecules in the set.

Returns: Array of pairwise similarities of the molecules in the set. Self similarities are not calculated.
Return type: (np.ndarray)

get_property_of_most_dissimilar()

Get property of pairs of molecule which are most dissimilar to each other.

Returns: The first index is an array of reference mol properties and the second index is an array of the property of the respective most dissimilar molecule. Skips pairs of molecules for which molecule properties are not initialized.
Return type: (tuple)

get_property_of_most_similar()

Get property of pairs of molecules which are most similar to each other.

Returns: The first index is an array of reference mol properties and the second index is an array of the property of the respective most similar molecule. Skips pairs of molecules for which molecule properties are not initialized.
Return type: (tuple)

get_similarity_matrix()

Get the similarity matrix for the data set.

Returns

Similarity matrix of the dataset.: Shape (n_samples, n_samples).

Return type

(np.ndarray)

Note

If un-set, sets the self.similarity_matrix attribute.

get_transformed_descriptors(method_='pca', **kwargs)

Use an embedding method to transform molecular descriptor to a low dimensional representation.

Parameters

method (str) – The method used for generating lower dimensional embedding. Implemented methods are: ‘pca’: Principal Component Analysis [1] ‘mds’: Multidimensional scaling [2-4] ‘tsne’: t-SNE [5] ‘isomap’: Isomap [6] ‘spectral_embedding’: Spectral Embedding [7]
kwargs (dict) – Keyword arguments to modify the behaviour of the respective embedding methods. See the documentation pages listed below for these arguments. ‘pca’: https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html ‘mds’: https://scikit-learn.org/stable/modules/generated/sklearn.manifold.MDS.html ‘tsne’: https://scikit-learn.org/stable/modules/generated/sklearn.manifold.TSNE.html ‘isomap’: https://scikit-learn.org/stable/modules/generated/sklearn.manifold.Isomap.html ‘spectral_embedding’: https://scikit-learn.org/stable/modules/generated/sklearn.manifold.SpectralEmbedding.html

Returns

Transformed descriptors of shape: (n_samples, n_components).

Return type

X (np.ndarray)

Raises

InvalidConfigurationError – If illegal method_ passed.

References

[1] Bishop, C. M., Pattern recognition and machine learning. 2006. [2] Borg, I. and P. Groenen, Modern Multidimensional Scaling:

Theory and Applications (Springer Series in Statistics). 2005.

[3] Kruskal, J., Nonmetric multidimensional scaling:: A numerical method. Psychometrika, 1964. 29(2): p. 115-129.
[4] Kruskal, J., Multidimensional scaling by optimizing goodness: of fit to a nonmetric hypothesis. Psychometrika, 1964. 29: p. 1-27.
[5] van der Maaten, L. and G. Hinton, Viualizing data using t-SNE.: Journal of Machine Learning Research, 2008. 9: p. 2579-2605.
[6] Tenenbaum, J.B., V.d. Silva, and J.C. Langford,: A Global Geometric Framework for Nonlinear Dimensionality Reduction. Science, 2000. 290(5500): p. 2319-2323.
[7] Ng, A.Y., M.I. Jordan, and Y. Weiss. On Spectral Clustering:: Analysis and an algorithm. 2001. MIT Press.

is_present(target_molecule)

Searches the name of a target molecule in the molecule set to determine if the target molecule is present in the molecule set.

Parameters: target_molecule (AIMSim.chemical_datastructures.Molecule) – Target molecule to search.
Returns: If the molecule is present in the molecule set or not.
Return type: (bool)

aimsim.chemical_datastructures package

Submodules

aimsim.chemical_datastructures.molecule module

aimsim.chemical_datastructures.molecule_set module

Module contents