Frequently Asked Questions





1.


SuperPred2 offers a knowledge-based method for ATC code and target predicition of your compounds which is based on a 2D, fragment and 3D similarity search pipeline.

If you use our database, please cite us. Thanks.

top of page


2.


We recommend a recent version of Mozilla Firefox, Microsoft Internet Explorer or Google Chrome. For the usage of the ChemDoodle web component Internet Explorer 9 or newer is recommended. If you have an older version of Internet Explorer Google chrome frame is required for the right presentation of ChemDoodle. In addition to that JavaScript has to be enabled. You find ChemDoodle implemented on the "Classify Molecule" page.

top of page


3.


3.1
The Anatomical Therapeutic Chemical (ATC) classification system is used for the classification of drugs. It is published by the World Health Organization (WHO). The classification into groups is based on therapeutic and chemical characteristics of the drugs.Each ATC code is divided into 5 levels:
  • 1. level: Anatomical main group
  • 2. level: Therapeutic main group
  • 3. level: Therapeutic/pharmacological subgroup
  • 4. level: Chemical/therapeutic/pharmacological subgroup
  • 5. level: Chemical substance
Substances or combination of substances in the 5th level refer to a single indication. Drugs having more than one indication belong to more than one ATC code. Aspirine for example has 3 ATC codes assigned.



3.2

The score is based on the output of the similarity pipeline. The similarity pipeline compares structural fingerprints as well as 3D structures using a superposition algorithm. JChem's Extended-Connectivity Fingerprints (ECFP4) are applied for 2D and fragment similarity. They are circular topological fingerprints and consist of 1024 bits. Their comparison is calculated by the Tanimoto coefficient. A superposition algorithm is used for the 3D similarity search. Here, the similarity of compounds is represented by RMSD values.

3.3

Molecular fingerprints represent certain structural features of a molecule. There are two processes fingerprints are primarily used for:
similarity measures like calculations and screenings. Whereas calculation is a quantify of similarity of two molecules. However screening is a way of eliminating molecules as candidates in a substructure search. The fingerprint algorithm examines the molecule and generates patterns of the atom. The output is a string of bits and is added to the fingerprint.

3.4

For similarity screening of a compound against the SuperPred database fingerprints of both molecules are used. Fragments of the molecules are assigned to set bits in the 1024 bit vector ECFP4 fingerprint. To compare the similarity between the compounds the Tanimoto coefficient is applied.



The tanimoto coefficient uses the bits set to one in both fingerprints. AB is the number of bits set to one in both molecules. A is the number of bits set to one in molecule A and B is the number of bits set to one in molecule B.

3.5

The root-mean-sqaure deviation (rmsd) measures the average distance between atoms of superimposed molecules. The smaller the rmsd the better the superimposition of the molecules.

3.6

The dataset for the SuperPred server can be divided into the following two subsets:
Firstly, ATC prediction dataset which contains 2,650 drugs with assigned ATC codes from SuperDrug database. To evaluate the prediction accuracy, subsets were built from the ATC prediction as well as the target prediction dataset. To ensure comparability between the previous SuperPred version (SuperPred I) and the current version, the number of drugs (1,035) for ATC evaluation remained the same. Based on the actual classified drugs by the WHO, an external dataset was created for validation of the prediction pipeline. The validation dataset conatining 190 novel drugs was filtered in the same way as for the evaluation dataset described in SuperPred I.
Secondly, the target prediction dataset which is based on approximately 13,000,000 ligand-target interactions data extracted from SuperTarget, ChEMBL and bindingDB. Several filtering steps have been applied with resctriction to certain targets and binding type (e,g IC50, KI), resulting in reduced dataset with 1,900,000 interactions. Additionally, targets which have interactions with atleast five compounds were considered. The resulted final dataset consists 341,000 compounds and 1,800 targets with 665,000 compound-target interactions. Furthermore, the target evaluation set consists of 221 therapeutic targets, which are classified as 'succesful targets' in the Therapeutic Target Database (TTD) and have more than four compounds per target. This set contains a total of 95,000 compounds and 174,000 compound-target interactions.

3.7

The cutoff was chosen so, that it could be differentiated between randomly similar compounds and the larger similarity between compounds, caused by the fact, that they bind to the same target. For the similarity evaluations we used the Tanimoto score between their ECFP 4 fingerprints. We estimated that with this type of fingerprint 99.998% of these random similarities lie below the cutoff 0.45. Additionally at least one compound pair with Tanimoto score above this cutoff could be found inside every target group.

3.8

Raw scores are calculated by summing up the Tanimoto coefficients above the cutoff 0.45. They are subsequently normalized by the target set sizes and converted to Z-scores with the formula:

zAB=(raw scoreAB /NAB - mean) * exp(0.335 *ln(NAB)) / σ,

where NAB represents the number of compounds in the target sets,

mean = 0.0000269,

σ =0.00523.

Mean can be evaluated as the mean of the normalized raw scores (raw score/set size). σ is the standard deviation of the prototype Z-scores calculated by:

prototype_zAB=(raw scoreAB /NAB - mean) * exp(0.335 *ln(NAB)) / 1.22

In this way, the real Z-scores will have a comparable distribution like the generalized extreme value distribution, which is used to convert the Z-scores into E-values. This is done by applying the formula:

E(z)=Ndb * (1-exp(-exp(-z/sqrt(6)-'(1)))
where Ndb is the number of targets in the database.

3.9

E-value is the expectation value of finding an equally high Z-score in the dataset by random. Predictions of E-value above 1 are therefore considered as random. Nevertheless our cross validation showed, that even with an E-value > 1, the prediction rate of the targets ranked on the first place was about 80 % when using the weighted Z-score for the ranking.

3.10

Weighted Z-scores are the Z-scores that take the diversity within target groups into consideration.


top of page

4.


4.1
The ATC prediction pipeline is a combination of three different structure based methods, considering 2D, fragment and 3D similarity, described in detail below. This combination ensures an optimal coverage of the structural features represented by a final score. The consensus of these methods is taken into account. If at least two methods predict the same ATC class, that class is considered as final prediction. If three different ATC classes are predicted, a threshold for every method is used to decide for the most probable ATC class.


4.2

In order to select the optimal fingerprint for the 2D similarity comparison, several fingerprints have been compared (see paper). The extended-connectivity fingerprints (ECFP) exhibit the best performance for our dataset and hence, have been used in the prediction pipeline. ECFP4, which belongs to the class of radial fingerprints, is generated by a modified version of the Morgan Algorithm (32). It identifies the the substructural environment of each atom up to a diameter of 4 ( in this case). The calculated fingerprints were subsequently compared by the Tanimoto similarity measure for bit strings.


4.2.1

We analysed the prediction values for ECFP-fingerprints. The results shown in the following figure.

In our case, we do not have a simple yes/no test, instead we have to deal with many atc groups. In order to that, normally every atc-group need an individual limit. Here, many atc-groups only hold a small number of compounds whereby you allways get a large number of right negativ and a small number of right positiv predictions. To optimize the right positiv prediction we used the following formula to figure that out:

2d_treshold = sensitivity^2* (percentage of dataset where the best tanimoto is above the actual limit) (visualised in the overlaying figure)

By doing this we came up with a 2D-threshold of 0.35.


4.3

All 2,650 drugs from the prediction dataset have been fragmented according to the linker rule. This method preferentially generates cyclic fragments by removing the linker atoms between ring structures. All non-redundant fragments which were produced by the fragmentation method are considered for comparison. While comparing the fragments of two small molecules (A and B) having n and m fragments, a similarity matrix with n  x m fields is constructed. Each field contains the Tanimoto coefficient of the particular fragment comparison. The matrix is used to calculate nm possible fragment combinations. For each combination, a final Tanimoto score is estimated by summing up its Tanimoto coefficients from the matrix. The final score is further divided by the smaller number of fragments belonging to one of the molecules.


4.3.1

Doing the same calculation as described in section 4.2.1 we came up with a Fragment-treshold of 0.45.


4.4

In the superposition method, the superimpositions of one molecule onto a reference molecule structure is done by mapping atoms with optimal distances. In order to reduce time complexity, only 100 low -energy conformations are generated for the two molecules to be compared and afterwards each conformation of the first molecule is compared with each conformation of the second molecule. Hence, given a molecule pair, a maximum 10,000 comparisons are computed (Thimm et al.).

Basic steps of this 3D-algorithm explained in the following example:

(I): Two point sets - (A-H) and (1-9) before the superposition top of page
(II): Both point sets after the normalization according the principal axes of inertia:
  1. horizontal is the largest extent
  2. vertical the middle extent
  3. orthogonal to the picture (indicated by the arrow in the right picture) the least extent
The normalization of the first set is taken arbitrary but fixed, the possible normalizations of the second sets in relation to the first set are the 4 shown versions, where the inversion of the arrow symbolizes the inversion of the z-axis (the least extend)
top of page
(III): For all 4 possible normalizations the assignment of the points is evaluated; assigned pairs are indicated by ellipses points are assigned to pairs if they are the nearest points vica versa;
e.g.
Assignment procedure:
Within the first picture the nearest point to A is 1 - the nearest point to 1 is A, therefore they are assigned as pair;
the nearest point to G is 7, while the nearest point to 7 is F, therefore G is not assigned;

For all normalizations the number of pairs is evaluated, and the normalization with the most pairs is considered further. If there are several normalizations with the same number of pairs, the normalization with the smallest rmsd is considered in the further steps. Here we have 6 pairs in the first assignment, 8 in the second, 6 in the third and 8 pairs in the fourth figure.
top of page
(IV) The best assignment is chosen from the 4 normalizations, and the best transformation that minimizes the distances of the pairs is evaluated according the Kabsch algorithm (W. Kabsch). top of page
(V) The assignments and transformations are optimized; during the optimization the assignment of the pairs can change: the former pair (D,5) is changed to (D,6) because their distances changed top of page



4.4.1

In the paper of Thimm et al. deals with 2D and 3D similarity of ATC-drugs (N-class). Unfortunately it is not possible to compare the values of Tanimoto coefficients with the 3D-similarity scores one-to-one. Introduction it is generally accepted that Tanimoto coefficients larger than 0.85 (for pathway-based fingerprint methods) start to indicate similar activity. To find a corresponding value for the 3D-score we counted the number of hits with a Tanimoto coefficient larger than 0.85 and found that a 3D-score of about 0.75 gives approximately the same number of hits. For that reason we decided to use a 3D-Treshold of 0.75.


top of page


5.


5.1
The website gives you the possibility of classifying your compound into an ATC group if it has drug character and estimating possible targets. You can classify your small molecule at the Drug Classification button or predict targets for it at the Target-Prediction button.

If you have any questions, which are not answered in the FAQs, please feel free to contact us!

5.2

Drug Classification provides the possibility in classifying compounds into ATC groups by using 2D, fragment and 3D similarity. This information given might help in further drug development process.
There are 4 input options available:
  • using PubChem names (make sure that the names of the compounds are written like in PubChem)
  • using SMILES
  • uploading a MOL file
  • drawing a structure


The result site shows on the top of the page Details for Drug Classification. Here, information about the input compound and the most similar drug are supplied. The input compound will be screened against 2,650 drugs that have ATC codes assigned. Therefore, only those drugs can be displayed as similar drugs. To switch between the 5 most similar drugs for the input compound, previous and next buttons are provided. On this page, the results are based on 2D similarity only. To get to the results considering 2D, fragment and 3D similarity, please click on Detailed report. Please consider that the calculations may take at least 5 to 10 minutes and may take even longer if the overall request on the server is bigger. Additionally, a possibility to bookmark the Detailed report site is given. While the calculation is running the site will refresh itself every 30 seconds till the results are given. If you don't want to wait for that, you can also use the bookmark to see the results later. They will be kept for at least one month.



The detailed report gives the classification information calculated by the prediction pipeline consisting of 2D, fragment and 3D similarity. In the example shown below, the classification was calculated by 2D and fragment similarity.



Furthermore, statistics for the predicted ATC class is given.



5.3

The Target-Prediction site offers the possibility to predict targets for an input compound of interest.
There are 4 input options available:
  • using PubChem names (make sure that the names of the compounds are written like in PubChem)
  • using SMILES
  • uploading a MOL file
  • drawing a structure


The result page is ordered as following: First, structure information about the input compound is shown. Second, known ligand-target interactions for the input compound are displayed. Third, predicted ligand-target interactions for the input compound. A dropdown list is provided to select the target of interest both for known ligand-target interactions and predicted ligand-target interactions. The dropdown list is divided into two parts: Therapeutic Targets (TTD, Drugbank) found in the Therapeutic Target Database (TTD) and Drugbank (for the input comopund); and More known targets - targets which are known to bind the input compound but are not therapeutic targets for it (in case of known ligand-target interactions).
After selecting a target from the dropdown list, information available about the target is shown. Several links to other publicly available databases are given as well as PDB structures if available. It should be pointed out, that only PDB structures from mammalian organisms (human, bovin, rabit, mouse, rat, pig) are displayed. Due to this, it may be possible that only incomplete PDB structures are shown although there might be a complete PDB structure available but within another species we don't display. The same holds true for predicted targets.


5.4

The ATC-Tree gives an overview of the WHO ATC codes with respect to the systems classification. By browsing the tree you get information about the classification of the drug. By clicking on the 5th level of the tree you will be forwarded to the compounds detail site.




top of page