Toxicity is likely one of the 5 pillars of the drug improvement filtering course of: ADMET, the place T stands for Toxicity. Its relevance extends far past prescription drugs, enjoying a important position in environmental toxicology, public well being, regulatory security, and decreasing reliance on animal testing. In the present day, toxicity is among the many most closely studied parameters in biomedical sciences, with rising fields like toxicogenomics shaping our capacity to foretell and forestall hostile well being results attributable to chemical publicity.
To discover this area, I labored with the Tox21 dataset — initially launched as a part of the 2014 Tox21 Problem by the NIH’s Nationwide Heart for Advancing Translational Sciences. The dataset simulates real-world modeling challenges: extreme class imbalance, heterogeneous multi-label targets, and restricted optimistic samples for sure endpoints.
On this challenge, I approached toxicity prediction utilizing a low-compute, interpretable pipeline:
- Descriptor-based function extraction utilizing RDKit,
- Resampling methods (SMOTEENN) to mitigate class imbalance,
- Goal-wise modeling utilizing One-vs-Relaxation classifiers,
- Ensemble strategies (VotingClassifier, Random Forest, XGBoost) for efficiency benchmarking.
Every goal was modeled independently, providing granular management over thresholding and analysis. The ultimate classifier was deployed by way of a light-weight Streamlit interface, enabling user-friendly exploration of molecular toxicity predictions.
This put up walks by the complete pipeline — demonstrating that with area data, even classical fashions can carry out competitively on complicated organic issues.
Toxicity prediction stays a notoriously onerous activity — the issue coming from a fancy interaction of organic, chemical, and data-centric components. Challenges embrace multifaceted chemical interactions, heterogeneous and incomplete datasets, unclear hostile consequence pathways, poor consensus throughout fashions, and the persistent want for sturdy validation strategies.
The Tox21 dataset is a product of the Tox21 collaborative analysis initiative, collectively coordinated by the EPA, NIEHS, NCATS, and FDA. Its central mission is to revolutionize toxicology by speedy, high-throughput, and cost-effective approaches to guage chemical security. Tox21 provides an enormous quantity of information generated by quantitative high-throughput screening (qHTS), making it a goldmine for machine studying functions.
A number of landmark research — equivalent to DeepTox and MolToxPred , leverage deep studying and molecular fingerprints for toxicity classification. Nevertheless, such approaches typically demand important computational sources, making them much less accessible to researchers with restricted infrastructure.
My method takes a leaner route: constructing fashions based mostly on physicochemical molecular descriptors utilizing RDKit. This selection not solely reduces the computational burden, but additionally enhances interpretability — a significant facet when transitioning from algorithm to real-world decision-making.
Dataset Overview
The Tox21 dataset, was established to revolutionize toxicology testing course of, by making use of speedy, high-throughput and efficient strategies for chemical security analysis. The info was generated by qHTS techniques the place automated robotics techniques have been utilized to check hundreds of chemical substances towards a various panel of cell based mostly assays.
There are majorly 2 forms of toxicity endpoints assessed:
- Nuclear Receptor Assays: measure exercise on essential receptors concerned in hormone regulation.
- Stress Response Pathway Assays: consider activation of mobile defenses towards injury.
Outcomes for every assay are often binary labels. The dataset additionally consisted of plenty of lacking values, as not each chemical was examined in each assay.
Descriptor Extraction
RDKit, an open supply cheminformatics toolkit, was utilized to extract the descriptors, to characterize molecular construction in numerical format. This prevented reliance on SMILES based mostly embeddings and fingerprints, decreasing computational load.
The next descriptors have been generated:
- Physicochemical descriptors: Molecular weight, lipophilicity, topological polar floor space, molar refractivity, aqueous solubility (LogS)
- Hydrogen Bonding Capability: Hydrogen bond donors and hydrogen bond acceptors
- Structural Flexibility: Variety of rotatable bonds, FractionCSP3
- Topological and Graph Primarily based Complexity: Zero order molecular connectivity index and Variety of heavy atoms
- Ring techniques and aromaticity: variety of rings, fragrant ring depend, fragrant rings containing heteroatoms, non-aromatic ring techniques
- Digital Properties: formal cost
Whereas molecular fingerprints and embeddings are highly effective representations, descriptors provide a number of benefits, equivalent to interpretability and ease. Additionally they align significantly better with pharmacokinetics and medicinal chemistry.
Class Imbalance Dealing with
One of many key caveats confronted, was the huge class imbalance and lacking values current within the Tox21 dataset. The category distribution skewed mannequin studying in direction of predicting the bulk class, with excessive accuracy, but additionally precipitated poor recall for poisonous compounds.
Random oversampling and undersampling would result in overfitting and lack of chemical variety respectively, which is why a mixture of SMOTEEN and Random Sampling carried out finest.
Random Sampling fine-tuned the category stability with out fully erasing variety, whereas SMOTE (Artificial Minority Oversampling Method) generates artificial examples of minority class based mostly on function area similarities, and ENN (Edited Nearest Neighbors) cleans up the bulk class by eradicating borderline and noisy samples, appearing as a filter to take away ambiguous majority factors. The mixture ensures that the dataset stays balanced and consultant, and led to elevated F1 scores particularly for the minority class.
In toxicity screening, false negatives are pricey. Therefore, dealing with imbalance isn’t just a statistical selection, however a organic crucial. The mixture of SMOTEEN and Random Sampling makes the mannequin extra conservative in screening, whereas making certain that it learns true biochemical separations in descriptor area.
Mannequin Choice
A number of machine studying fashions have been experimented with, for evaluating efficiency of toxicity prediction by molecular descriptors.
Several types of fashions have been used:
- Random Forest: Used as it’s sturdy to overfitting and works properly with tabular information
- XGBoost: Wonderful at dealing with class imbalance, whereas additionally providing environment friendly and highly effective gradient boosting
- One-vs-Relaxation: appropriate for multilabel binary classification issues
- Voting Classifier: ensemble of numerous fashions by smooth voting improves generalization.
All these fashions are sturdy and interpretable and work properly with imbalanced information.
Stratified train-test break up was used with a 67–33 break up, to make sure class stability was preserved throughout splits. 5-fold cross validation was used to evaluate stability and keep away from overfitting. Efficiency metrics reported embrace precision, recall, F1 rating, and ROC-AUC, given excessive class imbalance.
Per-Goal Coaching
As an alternative of treating the issue as a multilabel classification activity throughout all 12 endpoints, I skilled separate classification fashions for every particular person goal. This allowed cleaner analysis and fewer cross-target noise, decreased complexity and permits versatile optimization. This method made it simpler to determine which molecular descriptors have been most informative, which might not be as clear in a multitask setting.
The 12 completely different fashions, skilled for 12 completely different targets, have been evaluated utilizing AUC-ROC, F1 rating for the poisonous class, and general accuracy. The F1 rating for the minority class was prioritized for remaining mannequin choice because it displays efficiency a lot straight in the case of detecting poisonous compounds. Each goal represents a singular mechanism, and mannequin efficiency mirrored descriptor-to-phenotype mapping complexity and mannequin’s inductive bias.
All fashions have been skilled on class-rebalanced datasets utilizing SMOTE-ENN to deal with the pronounced class imbalance throughout Tox21 targets. Every mannequin was modularly exported per goal, enabling centered optimization and analysis.
- Logistic Regression carried out exceptionally properly for targets like NR-AhR, NR-ER, and NR-Aromatase. These endpoints confirmed robust linear relationships with classical descriptors equivalent to molecular weight, topological polar floor space (TPSA), and LogP, making them amenable to easier, interpretable fashions.
- XGBoost outshone different fashions for targets equivalent to NR-AR, SR-HSE, and SR-MMP, the place toxicity appeared to hinge on non-linear interactions between structural and physicochemical properties. Its gradient-boosted determination bushes successfully captured these complicated thresholds.
- Ensemble fashions (particularly soft-voting mixtures of tree-based and neural fashions) proved helpful for targets like SR-ARE, NR-ER-LBD, and SR-ATAD5, the place no single structure persistently dominated. The ensemble method helped clean out model-specific biases, resulting in extra sturdy predictions.
- For SR-p53, a goal related to extremely context-dependent transcriptional stress responses, efficiency was particularly difficult as a consequence of sparse positives and important intra-class heterogeneity. An OvR ensemble was needed right here to stabilize predictions and stability sensitivity with specificity.
Total, the variety in target-level efficiency reveals how descriptor-based fashions seize some toxicity mechanisms properly — notably these ruled by broad physicochemical properties. Nevertheless, extra intricate endpoints might require enhanced options, equivalent to substructure-based fingerprints or graph-level representations, to totally characterize their predictive indicators.
How Does This Examine to Present Work?
The Tox21 problem has lengthy served as a benchmark for evaluating machine studying approaches in toxicology. Whereas deep studying and molecular fingerprints have dominated current entries, descriptor-based strategies nonetheless maintain their floor — particularly with the precise modeling technique.
Huang et al. (2016) reported robust efficiency utilizing a mix of random forests and deep neural networks, notably on targets equivalent to NR-AhR and SR-MMP, with ROC-AUCs reaching as much as 0.82. Nevertheless, in addition they famous that “efficiency drops sharply on sparse-label targets,” underscoring the problem of generalizing throughout biologically numerous endpoints like SR-p53.
In a associated effort, Mayr et al. (2018) explored multi-task deep studying, aiming to enhance generalization throughout all 12 targets. Whereas their structure carried out competitively, they acknowledged that “multi-task fashions wrestle when endpoints have minimal shared organic mechanisms,” suggesting that per-target specialization should still be needed.
Wu et al. (2017), then again, took a extra interpretable route. Their mixture of XGBoost and Mordred descriptors carried out competitively, particularly on nuclear receptor pathways equivalent to NR-AR and NR-ER, main them to conclude that “descriptor-based fashions, when correctly tuned, can rival fingerprint-driven pipelines.”
Within the current examine, a modular per-target method utilizing physicochemical descriptors alone (e.g., molecular weight, TPSA, LogP) achieved AUCs within the vary of 0.74–0.78 on a number of targets — notably NR-AhR, NR-ER, and SR-HSE. With out counting on fingerprints or SMILES parsing, these fashions demonstrated that conventional descriptors can stay extremely efficient, particularly when paired with SMOTEENN rebalancing and architecture-specific tuning.
In distinction to monolithic fashions that purpose to suit all endpoints equally, this modular method respects the organic heterogeneity throughout targets — and leverages it.
Whereas it might not surpass all state-of-the-art multi-task deep studying fashions, the outcomes affirm an vital perception shared by earlier work:
“When descriptor high quality is excessive and the modeling is focused, easier pipelines can ship aggressive and interpretable toxicology predictions.”
This challenge got down to reply a deceptively easy query: Can conventional molecular descriptors, when paired with light-weight machine studying fashions, nonetheless maintain floor within the period of deep studying and molecular graphs?
The reply, as proven throughout twelve distinct toxicity targets from the Tox21 dataset, is a cautious however assured sure. By taking a per-target method — fine-tuning mannequin architectures like XGBoost, Random Forest, and Logistic Regression on rebalanced datasets — the fashions achieved AUC scores similar to extra complicated baselines on a number of endpoints.
What stood out most wasn’t simply the efficiency, however the interpretability and modularity of the whole pipeline. Each function was human-readable. Each mannequin was inspectable. And each goal was handled as a separate organic query moderately than an summary node in a multi-task grid. In a area that more and more leans towards black-box modeling, this method served as a reminder that readability and customization can nonetheless compete.
Whereas this challenge centered on descriptor-based fashions, a number of extensions might strengthen and generalize the findings:
- Richer Options: Incorporating substructure alerts, docking scores, or toxicophore flags might seize mechanisms missed by basic descriptors.
- Goal-Particular Interpretability: Utilizing SHAP or permutation significance to clarify key options per endpoint might deepen organic perception.
- Exterior Validation: Testing on datasets like ToxCast or REACH would assess real-world robustness and spotlight area shifts.
- Hybrid Fashions: Future work might discover mixing descriptors with SMILES- or graph-based encodings to merge interpretability with structural constancy.
Thanks for studying! When you’re engaged on one thing related — toxicity prediction, cheminformatics, or simply exploring AI for drug discovery — I’d love to attach.
🔬 Mission Repository: GitHub — Tox21 Descriptor-Based Toxicity Prediction
Let’s construct smarter fashions — collectively.