Close Menu
    Trending
    • Boost Team Productivity and Security With Windows 11 Pro, Now $15 for Life
    • 10 Common SQL Patterns That Show Up in FAANG Interviews | by Rohan Dutt | Aug, 2025
    • This Mac and Microsoft Bundle Pays for Itself in Productivity
    • Candy AI NSFW AI Video Generator: My Unfiltered Thoughts
    • Anaconda : l’outil indispensable pour apprendre la data science sereinement | by Wisdom Koudama | Aug, 2025
    • Automating Visual Content: How to Make Image Creation Effortless with APIs
    • A Founder’s Guide to Building a Real AI Strategy
    • Starting Your First AI Stock Trading Bot
    AIBS News
    • Home
    • Artificial Intelligence
    • Machine Learning
    • AI Technology
    • Data Science
    • More
      • Technology
      • Business
    AIBS News
    Home»Machine Learning»From RDKit to Streamlit: A Complete ML Pipeline for Toxicity Prediction | by Shreyasree G | Jul, 2025
    Machine Learning

    From RDKit to Streamlit: A Complete ML Pipeline for Toxicity Prediction | by Shreyasree G | Jul, 2025

    Team_AIBS NewsBy Team_AIBS NewsJuly 21, 2025No Comments10 Mins Read
    Share Facebook Twitter Pinterest LinkedIn Tumblr Reddit Telegram Email
    Share
    Facebook Twitter LinkedIn Pinterest Email


    Toxicity Prediction utilizing Tox21

    Toxicity is likely one of the 5 pillars of the drug improvement filtering course of: ADMET, the place T stands for Toxicity. Its relevance extends far past prescription drugs, enjoying a important position in environmental toxicology, public well being, regulatory security, and decreasing reliance on animal testing. In the present day, toxicity is among the many most closely studied parameters in biomedical sciences, with rising fields like toxicogenomics shaping our capacity to foretell and forestall hostile well being results attributable to chemical publicity.

    To discover this area, I labored with the Tox21 dataset — initially launched as a part of the 2014 Tox21 Problem by the NIH’s Nationwide Heart for Advancing Translational Sciences. The dataset simulates real-world modeling challenges: extreme class imbalance, heterogeneous multi-label targets, and restricted optimistic samples for sure endpoints.

    On this challenge, I approached toxicity prediction utilizing a low-compute, interpretable pipeline:

    • Descriptor-based function extraction utilizing RDKit,
    • Resampling methods (SMOTEENN) to mitigate class imbalance,
    • Goal-wise modeling utilizing One-vs-Relaxation classifiers,
    • Ensemble strategies (VotingClassifier, Random Forest, XGBoost) for efficiency benchmarking.

    Every goal was modeled independently, providing granular management over thresholding and analysis. The ultimate classifier was deployed by way of a light-weight Streamlit interface, enabling user-friendly exploration of molecular toxicity predictions.

    This put up walks by the complete pipeline — demonstrating that with area data, even classical fashions can carry out competitively on complicated organic issues.

    Toxicity prediction stays a notoriously onerous activity — the issue coming from a fancy interaction of organic, chemical, and data-centric components. Challenges embrace multifaceted chemical interactions, heterogeneous and incomplete datasets, unclear hostile consequence pathways, poor consensus throughout fashions, and the persistent want for sturdy validation strategies.

    The Tox21 dataset is a product of the Tox21 collaborative analysis initiative, collectively coordinated by the EPA, NIEHS, NCATS, and FDA. Its central mission is to revolutionize toxicology by speedy, high-throughput, and cost-effective approaches to guage chemical security. Tox21 provides an enormous quantity of information generated by quantitative high-throughput screening (qHTS), making it a goldmine for machine studying functions.

    A number of landmark research — equivalent to DeepTox and MolToxPred , leverage deep studying and molecular fingerprints for toxicity classification. Nevertheless, such approaches typically demand important computational sources, making them much less accessible to researchers with restricted infrastructure.

    My method takes a leaner route: constructing fashions based mostly on physicochemical molecular descriptors utilizing RDKit. This selection not solely reduces the computational burden, but additionally enhances interpretability — a significant facet when transitioning from algorithm to real-world decision-making.

    Workflow of Toxicity Prediction

    Dataset Overview

    The Tox21 dataset, was established to revolutionize toxicology testing course of, by making use of speedy, high-throughput and efficient strategies for chemical security analysis. The info was generated by qHTS techniques the place automated robotics techniques have been utilized to check hundreds of chemical substances towards a various panel of cell based mostly assays.

    There are majorly 2 forms of toxicity endpoints assessed:

    • Nuclear Receptor Assays: measure exercise on essential receptors concerned in hormone regulation.
    • Stress Response Pathway Assays: consider activation of mobile defenses towards injury.

    Outcomes for every assay are often binary labels. The dataset additionally consisted of plenty of lacking values, as not each chemical was examined in each assay.

    Descriptor Extraction

    RDKit, an open supply cheminformatics toolkit, was utilized to extract the descriptors, to characterize molecular construction in numerical format. This prevented reliance on SMILES based mostly embeddings and fingerprints, decreasing computational load.

    The next descriptors have been generated:

    • Physicochemical descriptors: Molecular weight, lipophilicity, topological polar floor space, molar refractivity, aqueous solubility (LogS)
    • Hydrogen Bonding Capability: Hydrogen bond donors and hydrogen bond acceptors
    • Structural Flexibility: Variety of rotatable bonds, FractionCSP3
    • Topological and Graph Primarily based Complexity: Zero order molecular connectivity index and Variety of heavy atoms
    • Ring techniques and aromaticity: variety of rings, fragrant ring depend, fragrant rings containing heteroatoms, non-aromatic ring techniques
    • Digital Properties: formal cost

    Whereas molecular fingerprints and embeddings are highly effective representations, descriptors provide a number of benefits, equivalent to interpretability and ease. Additionally they align significantly better with pharmacokinetics and medicinal chemistry.

    Class Imbalance Dealing with

    One of many key caveats confronted, was the huge class imbalance and lacking values current within the Tox21 dataset. The category distribution skewed mannequin studying in direction of predicting the bulk class, with excessive accuracy, but additionally precipitated poor recall for poisonous compounds.

    Random oversampling and undersampling would result in overfitting and lack of chemical variety respectively, which is why a mixture of SMOTEEN and Random Sampling carried out finest.

    Random Sampling fine-tuned the category stability with out fully erasing variety, whereas SMOTE (Artificial Minority Oversampling Method) generates artificial examples of minority class based mostly on function area similarities, and ENN (Edited Nearest Neighbors) cleans up the bulk class by eradicating borderline and noisy samples, appearing as a filter to take away ambiguous majority factors. The mixture ensures that the dataset stays balanced and consultant, and led to elevated F1 scores particularly for the minority class.

    In toxicity screening, false negatives are pricey. Therefore, dealing with imbalance isn’t just a statistical selection, however a organic crucial. The mixture of SMOTEEN and Random Sampling makes the mannequin extra conservative in screening, whereas making certain that it learns true biochemical separations in descriptor area.

    Mannequin Choice

    A number of machine studying fashions have been experimented with, for evaluating efficiency of toxicity prediction by molecular descriptors.
    Several types of fashions have been used:

    • Random Forest: Used as it’s sturdy to overfitting and works properly with tabular information
    • XGBoost: Wonderful at dealing with class imbalance, whereas additionally providing environment friendly and highly effective gradient boosting
    • One-vs-Relaxation: appropriate for multilabel binary classification issues
    • Voting Classifier: ensemble of numerous fashions by smooth voting improves generalization.

    All these fashions are sturdy and interpretable and work properly with imbalanced information.

    Stratified train-test break up was used with a 67–33 break up, to make sure class stability was preserved throughout splits. 5-fold cross validation was used to evaluate stability and keep away from overfitting. Efficiency metrics reported embrace precision, recall, F1 rating, and ROC-AUC, given excessive class imbalance.

    Per-Goal Coaching

    As an alternative of treating the issue as a multilabel classification activity throughout all 12 endpoints, I skilled separate classification fashions for every particular person goal. This allowed cleaner analysis and fewer cross-target noise, decreased complexity and permits versatile optimization. This method made it simpler to determine which molecular descriptors have been most informative, which might not be as clear in a multitask setting.

    The 12 completely different fashions, skilled for 12 completely different targets, have been evaluated utilizing AUC-ROC, F1 rating for the poisonous class, and general accuracy. The F1 rating for the minority class was prioritized for remaining mannequin choice because it displays efficiency a lot straight in the case of detecting poisonous compounds. Each goal represents a singular mechanism, and mannequin efficiency mirrored descriptor-to-phenotype mapping complexity and mannequin’s inductive bias.

    Mannequin Performances

    All fashions have been skilled on class-rebalanced datasets utilizing SMOTE-ENN to deal with the pronounced class imbalance throughout Tox21 targets. Every mannequin was modularly exported per goal, enabling centered optimization and analysis.

    • Logistic Regression carried out exceptionally properly for targets like NR-AhR, NR-ER, and NR-Aromatase. These endpoints confirmed robust linear relationships with classical descriptors equivalent to molecular weight, topological polar floor space (TPSA), and LogP, making them amenable to easier, interpretable fashions.
    • XGBoost outshone different fashions for targets equivalent to NR-AR, SR-HSE, and SR-MMP, the place toxicity appeared to hinge on non-linear interactions between structural and physicochemical properties. Its gradient-boosted determination bushes successfully captured these complicated thresholds.
    • Ensemble fashions (particularly soft-voting mixtures of tree-based and neural fashions) proved helpful for targets like SR-ARE, NR-ER-LBD, and SR-ATAD5, the place no single structure persistently dominated. The ensemble method helped clean out model-specific biases, resulting in extra sturdy predictions.
    • For SR-p53, a goal related to extremely context-dependent transcriptional stress responses, efficiency was particularly difficult as a consequence of sparse positives and important intra-class heterogeneity. An OvR ensemble was needed right here to stabilize predictions and stability sensitivity with specificity.

    Total, the variety in target-level efficiency reveals how descriptor-based fashions seize some toxicity mechanisms properly — notably these ruled by broad physicochemical properties. Nevertheless, extra intricate endpoints might require enhanced options, equivalent to substructure-based fingerprints or graph-level representations, to totally characterize their predictive indicators.

    How Does This Examine to Present Work?

    The Tox21 problem has lengthy served as a benchmark for evaluating machine studying approaches in toxicology. Whereas deep studying and molecular fingerprints have dominated current entries, descriptor-based strategies nonetheless maintain their floor — particularly with the precise modeling technique.

    Huang et al. (2016) reported robust efficiency utilizing a mix of random forests and deep neural networks, notably on targets equivalent to NR-AhR and SR-MMP, with ROC-AUCs reaching as much as 0.82. Nevertheless, in addition they famous that “efficiency drops sharply on sparse-label targets,” underscoring the problem of generalizing throughout biologically numerous endpoints like SR-p53.

    In a associated effort, Mayr et al. (2018) explored multi-task deep studying, aiming to enhance generalization throughout all 12 targets. Whereas their structure carried out competitively, they acknowledged that “multi-task fashions wrestle when endpoints have minimal shared organic mechanisms,” suggesting that per-target specialization should still be needed.

    Wu et al. (2017), then again, took a extra interpretable route. Their mixture of XGBoost and Mordred descriptors carried out competitively, particularly on nuclear receptor pathways equivalent to NR-AR and NR-ER, main them to conclude that “descriptor-based fashions, when correctly tuned, can rival fingerprint-driven pipelines.”

    Within the current examine, a modular per-target method utilizing physicochemical descriptors alone (e.g., molecular weight, TPSA, LogP) achieved AUCs within the vary of 0.74–0.78 on a number of targets — notably NR-AhR, NR-ER, and SR-HSE. With out counting on fingerprints or SMILES parsing, these fashions demonstrated that conventional descriptors can stay extremely efficient, particularly when paired with SMOTEENN rebalancing and architecture-specific tuning.

    In distinction to monolithic fashions that purpose to suit all endpoints equally, this modular method respects the organic heterogeneity throughout targets — and leverages it.

    Whereas it might not surpass all state-of-the-art multi-task deep studying fashions, the outcomes affirm an vital perception shared by earlier work:

    “When descriptor high quality is excessive and the modeling is focused, easier pipelines can ship aggressive and interpretable toxicology predictions.”

    This challenge got down to reply a deceptively easy query: Can conventional molecular descriptors, when paired with light-weight machine studying fashions, nonetheless maintain floor within the period of deep studying and molecular graphs?

    The reply, as proven throughout twelve distinct toxicity targets from the Tox21 dataset, is a cautious however assured sure. By taking a per-target method — fine-tuning mannequin architectures like XGBoost, Random Forest, and Logistic Regression on rebalanced datasets — the fashions achieved AUC scores similar to extra complicated baselines on a number of endpoints.

    What stood out most wasn’t simply the efficiency, however the interpretability and modularity of the whole pipeline. Each function was human-readable. Each mannequin was inspectable. And each goal was handled as a separate organic query moderately than an summary node in a multi-task grid. In a area that more and more leans towards black-box modeling, this method served as a reminder that readability and customization can nonetheless compete.

    Whereas this challenge centered on descriptor-based fashions, a number of extensions might strengthen and generalize the findings:

    • Richer Options: Incorporating substructure alerts, docking scores, or toxicophore flags might seize mechanisms missed by basic descriptors.
    • Goal-Particular Interpretability: Utilizing SHAP or permutation significance to clarify key options per endpoint might deepen organic perception.
    • Exterior Validation: Testing on datasets like ToxCast or REACH would assess real-world robustness and spotlight area shifts.
    • Hybrid Fashions: Future work might discover mixing descriptors with SMILES- or graph-based encodings to merge interpretability with structural constancy.

    Thanks for studying! When you’re engaged on one thing related — toxicity prediction, cheminformatics, or simply exploring AI for drug discovery — I’d love to attach.

    🔬 Mission Repository: GitHub — Tox21 Descriptor-Based Toxicity Prediction

    💻 LinkedIn

    📨 shreyagopal28@gmail.com

    Let’s construct smarter fashions — collectively.



    Source link

    Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
    Previous ArticleNumber of housing markets with falling home prices jumps sharply to 109—up from 31 in January
    Next Article Why business schools need to teach character development
    Team_AIBS News
    • Website

    Related Posts

    Machine Learning

    10 Common SQL Patterns That Show Up in FAANG Interviews | by Rohan Dutt | Aug, 2025

    August 2, 2025
    Machine Learning

    Anaconda : l’outil indispensable pour apprendre la data science sereinement | by Wisdom Koudama | Aug, 2025

    August 2, 2025
    Machine Learning

    Peering into the Heart of AI. Artificial intelligence (AI) is no… | by Artificial Intelligence Details | Aug, 2025

    August 2, 2025
    Add A Comment
    Leave A Reply Cancel Reply

    Top Posts

    Boost Team Productivity and Security With Windows 11 Pro, Now $15 for Life

    August 2, 2025

    I Tried Buying a Car Through Amazon: Here Are the Pros, Cons

    December 10, 2024

    Amazon and eBay to pay ‘fair share’ for e-waste recycling

    December 10, 2024

    Artificial Intelligence Concerns & Predictions For 2025

    December 10, 2024

    Barbara Corcoran: Entrepreneurs Must ‘Embrace Change’

    December 10, 2024
    Categories
    • AI Technology
    • Artificial Intelligence
    • Business
    • Data Science
    • Machine Learning
    • Technology
    Most Popular

    driving the next evolution of AI for business

    December 14, 2024

    AI-generated child sex abuse images targeted with new laws

    February 2, 2025

    Unlock Your Brain’s Problem-Solving Potential With These 3 Neuroscience Hacks

    March 17, 2025
    Our Picks

    Boost Team Productivity and Security With Windows 11 Pro, Now $15 for Life

    August 2, 2025

    10 Common SQL Patterns That Show Up in FAANG Interviews | by Rohan Dutt | Aug, 2025

    August 2, 2025

    This Mac and Microsoft Bundle Pays for Itself in Productivity

    August 2, 2025
    Categories
    • AI Technology
    • Artificial Intelligence
    • Business
    • Data Science
    • Machine Learning
    • Technology
    • Privacy Policy
    • Disclaimer
    • Terms and Conditions
    • About us
    • Contact us
    Copyright © 2024 Aibsnews.comAll Rights Reserved.

    Type above and press Enter to search. Press Esc to cancel.