aimsim.chemical_datastructures package

Submodules

aimsim.chemical_datastructures.molecule module

Abstraction of RDKit molecule with relevant property manipulation methods.

class aimsim.chemical_datastructures.molecule.Molecule(mol_graph=None, mol_text=None, mol_property_val=None, mol_descriptor_val=None, mol_src=None, mol_smiles=None)

Bases: object

An abstraction of a molecule

mol_graph

Graph-level information of molecule. Implemented as an RDKIT mol object.

Type

RDKIT mol object

mol_text

Text identifier of the molecule.

Type

str

mol_property_val

Some property associated with the molecule. This is typically the response being studied. E.g. Boiling point, Selectivity etc.

Type

float

descriptor

Vector representation of a molecule. Commonly a fingerprint.

Type

Descriptor object

set_descriptor(

arbitrary_descriptor_val=None, fingerprint_type=None, fingerprint_params=None): Set the descriptor value either by passing an arbitrary value or by specifying a fingerprint that will be generated.

get_descriptor_val()

Get the descriptor value as an numpy array.

match_fingerprint_from(reference_mol)

Generate the same fingerprint as the reference_mol.

get_similarity_to(target_mol, similarity_measure)

Get the similarity to target_mol using a similarity_measure of choice.

get_name()

Get the mol_text attribute.

get_mol_property_val()

Get mol_property_val attribute.

draw(fpath=None, **kwargs)

Draw the molecule.

is_same(source_molecule, target_molecule)

Static method used to check equivalence of two molecules.

__init__(mol_graph=None, mol_text=None, mol_property_val=None, mol_descriptor_val=None, mol_src=None, mol_smiles=None)

Constructor

Parameters
  • mol_graph (RDKIT mol object) – Graph-level information of molecule. Implemented as an RDKIT mol object. Default is None.

  • mol_text (str) – Text identifier of the molecule. Default is None. Identifiers can be: —————— 1. Name of the molecule. 2. SMILES string representing the molecule.

  • mol_property_val (float) – Some property associated with the molecule. This is typically the response being studied. E.g. Boiling point, Selectivity etc. Default is None.

  • mol_descriptor_val (numpy ndarray) – Descriptor value for the molecule. Must be numpy array or list. Default is None.

  • mol_src (str) –

    Source file or SMILES string to load molecule. Acceptable files: -> .pdb file -> .txt file with SMILE string in first column, first row and

    (optionally) property in second column, first row.

    Default is None. If provided mol_graph is attempted to be loaded from it.

  • mol_smiles (str) – SMILES string for molecule. If provided, mol_graph is loaded from it. If mol_text not set in keyword argument, this string is used to set it.

draw(fpath=None, **kwargs)

Draw or molecule graph.

Parameters
  • fpath (str) – Path of file to store image. If None, image is displayed in io as a Tkinter windows. Default is None.

  • kwargs (keyword arguments) – Arguments to modify plot properties.

get_descriptor_val()

Get value of molecule descriptor.

Returns

value(s) of the descriptor.

Return type

np.ndarray

get_mol_property_val()
get_name()
get_similarity_to(target_mol, similarity_measure)

Get a similarity metric to a target molecule

Parameters
  • target_mol (AIMSim.ops Molecule) – Target molecule. Similarity score is with respect to this molecule

  • similarity_measure (AIMSim.ops SimilarityMeasure) – metric used.

Returns

Similarity coefficient by the chosen

method.

Return type

similarity_score (float)

Raises

NotInitializedError – If target_molecule has uninitialized descriptor. See note.

static is_same(source_molecule, target_molecule)

Check if the target_molecule is a duplicate of source_molecule.

Parameters
  • source_molecule (AIMSim.chemical_datastructures Molecule) – Source molecule to compare.

  • target_molecule (AIMSim.chemical_datastructures Molecule) – Target molecule to compare.

Returns

True if the molecules are the same.

Return type

bool

match_fingerprint_from(reference_mol)

If target_mol.descriptor is a fingerprint, this method will try to calculate the fingerprint of the self molecules. If this fails because of the absence of mol_graph attribute in target_molecule, a ValueError is raised.

Parameters
  • reference_mol (AIMSim.ops Molecule) – Target molecule. Fingerprint

  • reference. (of this molecule is used as the) –

Raises

ValueError

set_descriptor(arbitrary_descriptor_val=None, fingerprint_type=None, fingerprint_params=None)

Sets molecular descriptor attribute.

Parameters
  • arbitrary_descriptor_val (np.array or list) – Arbitrary descriptor vector. Default is None.

  • fingerprint_type (str) – String label specifying which fingerprint to use. Default is None.

  • fingerprint_params (dict) – Additional parameters for modifying fingerprint defaults. Default is None.

aimsim.chemical_datastructures.molecule_set module

Abstraction of a data set comprising multiple Molecule objects.

class aimsim.chemical_datastructures.molecule_set.MoleculeSet(molecule_database_src: str, molecule_database_src_type: str, is_verbose: bool, similarity_measure: str, n_threads=1, fingerprint_type=None, fingerprint_params=None, sampling_ratio=1.0, sampling_random_state=42)

Bases: object

An abstraction of a collection of molecules constituting a chemical dataset.

is_verbose

Controls how much information is displayed during plotting.

Type

bool

molecule_database

Collection of Molecule objects.

Type

list

descriptor

Descriptor or fingerprint used to featurize molecules in the molecule set.

Type

Descriptor

similarity_measure

Similarity measure used.

Type

SimilarityMeasure

similarity_matrix

n_mols X n_mols matrix of pairwise similarity scores.

Type

numpy ndarray

sampling_ratio

Fraction of dataset to keep for analysis. Default is 1.

Type

float

n_threads

Number of threads used for analysis. Can be an integer denoting the number of threads or ‘auto’ to heuristically determine if multiprocessing is worthwhile based on a curve fitted to the speedup data in the manuscript SI Default is 1.

Type

int or str

is_present(target_molecule)

Searches the name of a target molecule in the molecule set to determine if the target molecule is present in the molecule set.

compare_against_molecule(query_molecule)

Compare the a query molecule to all molecules of the set.

get_most_similar_pairs()

Get pairs of samples which are most similar.

get_most_dissimilar_pairs()

Get pairs of samples which are least similar.

get_property_of_most_similar()

Get property of pairs of molecules which are most similar to each other.

get_property_of_most_dissimilar()

Get property of pairs of molecule which are most dissimilar to each other.

get_similarity_matrix()

Get the similarity matrix for the data set.

get_distance_matrix()

Get the distance matrix for the data set. This is can only be done for similarity measures which yields valid distances.

get_pairwise_similarities()

Get an array of pairwise similarities of molecules in the set.

get_mol_names()

Get names of the molecules in the set.

get_mol_properties()

Get properties of all the molecules in the dataset.

cluster(n_clusters=8, clustering_method=None, **kwargs)

Cluster the molecules of the MoleculeSet. Implemented methods.

‘complete_linkage’, ‘complete’:

Complete linkage agglomerative hierarchical clustering.

‘average_linkage’, ‘average’:

average linkage agglomerative hierarchical clustering.

‘single_linkage’, ‘single’:

single linkage agglomerative hierarchical clustering.

‘ward’:

for Ward’s algorithm.

get_cluster_labels()

Get cluster membership of Molecules.

get_transformed_descriptors(method_='pca', **kwargs)

Use an embedding method to transform molecular descriptor to a low dimensional representation. Implemented methods are Principal Component Analysis (‘pca’), Multidimensional scaling (‘mds’), t-SNE (‘tsne’), Isomap (‘isomap’), Spectral Embedding (‘spectral_embedding’)

__init__(molecule_database_src: str, molecule_database_src_type: str, is_verbose: bool, similarity_measure: str, n_threads=1, fingerprint_type=None, fingerprint_params=None, sampling_ratio=1.0, sampling_random_state=42)

Constructor for the MoleculeSet class. :param sampling_ratio: Fraction of the molecules to keep. Useful

for selection subset of dataset for quick computations.

Parameters

sampling_random_state (int) – Random state used for sampling. Default is 42.

cluster(n_clusters=8, clustering_method=None, **kwargs)

Cluster the molecules of the MoleculeSet.

Parameters
  • n_clusters (int) – Number of clusters. Default is 8.

  • clustering_method (str) –

    Clustering algorithm to use. Default is None in which case the algorithm is chosen from the similarity measure in use. Implemented clustering_methods are: ‘complete_linkage’, ‘complete’:

    Complete linkage agglomerative hierarchical clustering [2].

    ’average_linkage’, ‘average’:

    average linkage agglomerative hierarchical clustering [2].

    ’single_linkage’, ‘single’:

    single linkage agglomerative hierarchical clustering [2].

    ’ward’:

    for Ward’s algorithm [2]. This method is useful for Euclidean descriptors.

  • kwargs (keyword args) –

    Key word arguments to supply to clustering algorithm. See the documentation pages listed below for these arguments: ‘complete_linkage’, ‘average_linkage’, ‘single_linkage’, ‘ward’

Returns

Dictionary of cluster id

(key) –> Names of molecules in cluster.

Return type

cluster_grouped_mol_names (dict)

References: [1] Hastie, T., Tibshirani R. and Friedman J.,

The Elements of statistical Learning: Data Mining, Inference, and Prediction, 2nd Ed., Springer Series in Statistics (2009).

[2] Murtagh, F. and Contreras, P., Algorithms for hierarchical

clustering: an overview. WIREs Data Mining Knowl Discov (2011). https://doi.org/10.1002/widm.53

compare_against_molecule(query_molecule)

Compare the a query molecule to all molecules of the set.

Parameters

query_molecule (AIMSim.chemical_datastructures Molecule) – Target molecule to compare.

Returns

Similarity scores between query

molecule and all other molecules of the molecule set.

Return type

set_similarity (np.ndarray)

get_cluster_labels()

Get cluster membership of Molecules. :raises NotInitializedError: If MoleculeSet object not clustered.

get_distance_matrix()

Get the distance matrix for the data set. This is can only be done for similarity measures which yields valid distances.

Returns

Distance matrix of the dataset.

Shape (n_samples, n_samples).

Return type

(np.ndarray)

get_mol_features()

Get features of the molecules in the set.

Returns

(n_molecules, feature_dimensionality) array.

Return type

np.ndarray

get_mol_names()

Get names of the molecules in the set. This is the Molecule.mol_text attribute of the Molecule objects in the MoleculeSet. If this attribute is not present, then collection of mol_ids in the form “id: ” + str(mol_id) is returned.

Returns

Array with molecules names.

Return type

np.ndarray

get_mol_properties()
Get properties of all the molecules in the dataset.

If all molecules don’t have properties, None is returned.

Returns

Array with molecules properties or None.

Return type

np.ndarray or None

get_most_dissimilar_pairs()

Get pairs of samples which are least similar.

Returns

List(Tuple(Molecule, Molecule))

List of pairs of indices closest to one another.

Raises

NotInitializedError – If MoleculeSet object does not have similarity_measure attribute.

get_most_similar_pairs()

Get pairs of samples which are most similar.

Returns

List(Tuple(Molecule, Molecule))

List of pairs of Molecules closest to one another. Since ties are broken randomly, this may be non-transitive i.e. (A, B) =/=> (B, A)

Raises

NotInitializedError – If MoleculeSet object does not have similarity_measure attribute.

get_pairwise_similarities()

Get an array of pairwise similarities of molecules in the set.

Returns

Array of pairwise similarities of the molecules in the set. Self similarities are not calculated.

Return type

(np.ndarray)

get_property_of_most_dissimilar()

Get property of pairs of molecule which are most dissimilar to each other.

Returns

The first index is an array of reference mol properties and the second index is an array of the property of the respective most dissimilar molecule. Skips pairs of molecules for which molecule properties are not initialized.

Return type

(tuple)

get_property_of_most_similar()

Get property of pairs of molecules which are most similar to each other.

Returns

The first index is an array of reference mol properties and the second index is an array of the property of the respective most similar molecule. Skips pairs of molecules for which molecule properties are not initialized.

Return type

(tuple)

get_similarity_matrix()

Get the similarity matrix for the data set.

Returns

Similarity matrix of the dataset.

Shape (n_samples, n_samples).

Return type

(np.ndarray)

Note

If un-set, sets the self.similarity_matrix attribute.

get_transformed_descriptors(method_='pca', **kwargs)

Use an embedding method to transform molecular descriptor to a low dimensional representation.

Parameters
Returns

Transformed descriptors of shape

(n_samples, n_components).

Return type

X (np.ndarray)

Raises

InvalidConfigurationError – If illegal method_ passed.

References

[1] Bishop, C. M., Pattern recognition and machine learning. 2006. [2] Borg, I. and P. Groenen, Modern Multidimensional Scaling:

Theory and Applications (Springer Series in Statistics). 2005.

[3] Kruskal, J., Nonmetric multidimensional scaling:

A numerical method. Psychometrika, 1964. 29(2): p. 115-129.

[4] Kruskal, J., Multidimensional scaling by optimizing goodness

of fit to a nonmetric hypothesis. Psychometrika, 1964. 29: p. 1-27.

[5] van der Maaten, L. and G. Hinton, Viualizing data using t-SNE.

Journal of Machine Learning Research, 2008. 9: p. 2579-2605.

[6] Tenenbaum, J.B., V.d. Silva, and J.C. Langford,

A Global Geometric Framework for Nonlinear Dimensionality Reduction. Science, 2000. 290(5500): p. 2319-2323.

[7] Ng, A.Y., M.I. Jordan, and Y. Weiss. On Spectral Clustering:

Analysis and an algorithm. 2001. MIT Press.

is_present(target_molecule)

Searches the name of a target molecule in the molecule set to determine if the target molecule is present in the molecule set.

Parameters

target_molecule (AIMSim.chemical_datastructures.Molecule) – Target molecule to search.

Returns

If the molecule is present in the molecule set or not.

Return type

(bool)

Module contents