In order to obtain a dataset that is best suited for predicting ATC classes, a number of filtering steps were established.
Starting with 5067 ATC codes, only codes with different descriptions were kept, reducing the number to 4403 ATC codes.
Taking out all classes that are assigned to "V" at level 1, 4128 structures remained, without combinations of different drugs the number
further reduces to 3208. The next step is to only consider ATC codes that are assigned to structures and therefore can be translated into
a SMILES string, removing plants, materials, proteins and similar substances, which leads to 2652 remaining structures. Removing salts, 2576
drugs are still included.
As a last step, only structures with a single ATC code assigned were kept in the training data set, and all groups were removed
that now consisted of only one remaining element, which left 1959. Finally, taking out all ATC classes that end on "X", the final 1552 drugs
for the prediction dataset remained, which spread over 233 ATC classes.
For the target definition the ChEMBL data base version 29 was used. Binding values
were used for assays with organism 'Homo sapiens', assay type B (binding)
or F (functional), and confidence level of at least 7. Activity values were used
only for type IC50, EC50, Ki, Kd and Potency. Standard unit for these values is nm (where a
value of 0 was discarded).
For strong binders at least one of such activity values should be 1000 nm or less,
the published relation '=' or '<'. For the prediction data set, non binders are defined as
substances with activity values of 30 000 or more, a relation of '=' or '>', and no additional
data set for the same relation with less than 1000. For the strong binders of the prediction data set,
the additional condition was used that there are no data sets for the same relation that defines it as
non binder. In this way, for the prediction data set several examples with data validity comment
'Outside typical range' were neither used for the strong binder data set nor for the non-binder data set.
These definitions result into 500 979 trusted unique relations between
365719 strong binders and 2353 targets.
Targets can have upl to 6203 trusted strong binding substances.
1736 targets have trusted data to both binders and non-binders.
691 targets have data to at least 20 binders and non-binders and were part of the prediction.
Accuracy of the 691 machine learning models for the target predcition, determined by
10-fold cross-validation.