Toxicological research papers relevance prediction


Client, a leading pharmaceutical manufacturer, manually reviews hundreds of publications on toxicology from different online scientific databases to ensure awareness and compliance with latest research findings. However, since the review process is manual, it is extremely time-consuming (takes about ~70 man-hours per month) and bias-prone (experts classify documents based on their experiential knowledge). Client wanted to automate the classification process to ensure efficiency and standardization.

Toxicological research papers relevance prediction


Data Ingestion and Pre-processing

Automated pipeline was developed for ingestion of articles from public research databases such as PubMed for relevant substances in regular intervals. The pipeline built was capable of processing and utilizing Chemical and Biological ontologies curated manually or ingested from public APIs (e.g., PubChem) or scraped from public websites such as Cactus, Bio-Portal, AberOWL, Systems Biology etc. Relevant content was extracted into a structured format followed by text cleaning, standardization, and disambiguation using the ontologies and lexicons, making the data amenable for machine learning in the downstream.

Fine-grained Relevancy Prediction

NLP (Natural Language Processing) techniques for importance-weighted vectorization followed by an ensemble of various text classification algorithms were applied to predict the relevance / non-relevance of research articles. Finally, articles that were found relevant by the model were also grouped into 20+ clusters of toxicological profiling sections corresponding to the potential area of anatomical and ecological impact.

Active Learning

As the model was based on training on a small number of annotated articles due to lack of historical annotated data, a feedback loop (where an expert can review and provide feedback about model predictions for each article) was incorporated within the model workflow to aid continuous Active Learning in order to achieve greater accuracy and robustness over time.


Model achieved F1-score of ~75% on unseen data, with ~71% precision and ~80% recall for relevant articles; and was considered highly successful by the client for adoption in their business process.

Case Studies