From RDKit to Streamlit: A Complete ML Pipeline for Toxicity Prediction | by Shreyasree G

Toxicity is likely one of the 5 pillars of the drug improvement filtering course of: ADMET, the place T stands for Toxicity. Its relevance extends far past prescription drugs, enjoying a important position in environmental toxicology, public well being, regulatory security, and decreasing reliance on animal testing. In the present day, toxicity is among the many most closely studied parameters in biomedical sciences, with rising fields like toxicogenomics shaping our capacity to foretell and forestall hostile well being results attributable to chemical publicity.

To discover this area, I labored with the Tox21 dataset — initially launched as a part of the 2014 Tox21 Problem by the NIH’s Nationwide Heart for Advancing Translational Sciences. The dataset simulates real-world modeling challenges: extreme class imbalance, heterogeneous multi-label targets, and restricted optimistic samples for sure endpoints.

On this challenge, I approached toxicity prediction utilizing a low-compute, interpretable pipeline:

Descriptor-based function extraction utilizing RDKit,
Resampling methods (SMOTEENN) to mitigate class imbalance,
Goal-wise modeling utilizing One-vs-Relaxation classifiers,
Ensemble strategies (VotingClassifier, Random Forest, XGBoost) for efficiency benchmarking.

Every goal was modeled independently, providing granular management over thresholding and analysis. The ultimate classifier was deployed by way of a light-weight Streamlit interface, enabling user-friendly exploration of molecular toxicity predictions.

This put up walks by the complete pipeline — demonstrating that with area data, even classical fashions can carry out competitively on complicated organic issues.

Toxicity prediction stays a notoriously onerous activity — the issue coming from a fancy interaction of organic, chemical, and data-centric components. Challenges embrace multifaceted chemical interactions, heterogeneous and incomplete datasets, unclear hostile consequence pathways, poor consensus throughout fashions, and the persistent want for sturdy validation strategies.

The Tox21 dataset is a product of the Tox21 collaborative analysis initiative, collectively coordinated by the EPA, NIEHS, NCATS, and FDA. Its central mission is to revolutionize toxicology by speedy, high-throughput, and cost-effective approaches to guage chemical security. Tox21 provides an enormous quantity of information generated by quantitative high-throughput screening (qHTS), making it a goldmine for machine studying functions.

A number of landmark research — equivalent to DeepTox and MolToxPred , leverage deep studying and molecular fingerprints for toxicity classification. Nevertheless, such approaches typically demand important computational sources, making them much less accessible to researchers with restricted infrastructure.

My method takes a leaner route: constructing fashions based mostly on physicochemical molecular descriptors utilizing RDKit. This selection not solely reduces the computational burden, but additionally enhances interpretability — a significant facet when transitioning from algorithm to real-world decision-making.

Dataset Overview

The Tox21 dataset, was established to revolutionize toxicology testing course of, by making use of speedy, high-throughput and efficient strategies for chemical security analysis. The info was generated by qHTS techniques the place automated robotics techniques have been utilized to check hundreds of chemical substances towards a various panel of cell based mostly assays.

There are majorly 2 forms of toxicity endpoints assessed:

Nuclear Receptor Assays: measure exercise on essential receptors concerned in hormone regulation.
Stress Response Pathway Assays: consider activation of mobile defenses towards injury.

Outcomes for every assay are often binary labels. The dataset additionally consisted of plenty of lacking values, as not each chemical was examined in each assay.

Descriptor Extraction

RDKit, an open supply cheminformatics toolkit, was utilized to extract the descriptors, to characterize molecular construction in numerical format. This prevented reliance on SMILES based mostly embeddings and fingerprints, decreasing computational load.

The next descriptors have been generated:

Physicochemical descriptors: Molecular weight, lipophilicity, topological polar floor space, molar refractivity, aqueous solubility (LogS)
Hydrogen Bonding Capability: Hydrogen bond donors and hydrogen bond acceptors
Structural Flexibility: Variety of rotatable bonds, FractionCSP3
Topological and Graph Primarily based Complexity: Zero order molecular connectivity index and Variety of heavy atoms
Ring techniques and aromaticity: variety of rings, fragrant ring depend, fragrant rings containing heteroatoms, non-aromatic ring techniques
Digital Properties: formal cost

Whereas molecular fingerprints and embeddings are highly effective representations, descriptors provide a number of benefits, equivalent to interpretability and ease. Additionally they align significantly better with pharmacokinetics and medicinal chemistry.

Class Imbalance Dealing with

One of many key caveats confronted, was the huge class imbalance and lacking values current within the Tox21 dataset. The category distribution skewed mannequin studying in direction of predicting the bulk class, with excessive accuracy, but additionally precipitated poor recall for poisonous compounds.

Random oversampling and undersampling would result in overfitting and lack of chemical variety respectively, which is why a mixture of SMOTEEN and Random Sampling carried out finest.

Random Sampling fine-tuned the category stability with out fully erasing variety, whereas SMOTE (Artificial Minority Oversampling Method) generates artificial examples of minority class based mostly on function area similarities, and ENN (Edited Nearest Neighbors) cleans up the bulk class by eradicating borderline and noisy samples, appearing as a filter to take away ambiguous majority factors. The mixture ensures that the dataset stays balanced and consultant, and led to elevated F1 scores particularly for the minority class.

In toxicity screening, false negatives are pricey. Therefore, dealing with imbalance isn’t just a statistical selection, however a organic crucial. The mixture of SMOTEEN and Random Sampling makes the mannequin extra conservative in screening, whereas making certain that it learns true biochemical separations in descriptor area.

Source link

10 Common SQL Patterns That Show Up in FAANG Interviews | by Rohan Dutt | Aug, 2025

Anaconda : l’outil indispensable pour apprendre la data science sereinement | by Wisdom Koudama | Aug, 2025

Peering into the Heart of AI. Artificial intelligence (AI) is no… | by Artificial Intelligence Details | Aug, 2025

Boost Team Productivity and Security With Windows 11 Pro, Now $15 for Life

I Tried Buying a Car Through Amazon: Here Are the Pros, Cons

Amazon and eBay to pay ‘fair share’ for e-waste recycling

Artificial Intelligence Concerns & Predictions For 2025

Barbara Corcoran: Entrepreneurs Must ‘Embrace Change’

Most Popular

driving the next evolution of AI for business

AI-generated child sex abuse images targeted with new laws

Unlock Your Brain’s Problem-Solving Potential With These 3 Neuroscience Hacks

Our Picks

Boost Team Productivity and Security With Windows 11 Pro, Now $15 for Life

10 Common SQL Patterns That Show Up in FAANG Interviews | by Rohan Dutt | Aug, 2025

This Mac and Microsoft Bundle Pays for Itself in Productivity

From RDKit to Streamlit: A Complete ML Pipeline for Toxicity Prediction | by Shreyasree G | Jul, 2025

Dataset Overview

Descriptor Extraction

Class Imbalance Dealing with

Mannequin Choice

Per-Goal Coaching

How Does This Examine to Present Work?

Related Posts