In order to obtain a dataset that is best suited for predicting ATC classes, a number of filtering steps were established. Starting with 5067 ATC codes, only codes with different descriptions were kept, reducing the number to 4403 ATC codes. Taking out all classes that are assigned to "V" at level 1, 4128 structures remained, without combinations of different drugs the number further reduces to 3208. The next step is to only consider ATC codes that are assigned to structures and therefore can be translated into a SMILES string, removing plants, materials, proteins and similar substances, which leads to 2652 remaining structures. Removing salts, 2576 drugs are still included.

As a last step, only structures with a single ATC code assigned were kept in the training data set, and all groups were removed that now consisted of only one remaining element, which left 1959. Finally, taking out all ATC classes that end on "X", the final 1552 drugs for the prediction dataset remained, which spread over 233 ATC classes.