Relevance Prediction and Toxicological Profiling of research papers


Client, a leading pharmaceutical manufacturer, manually reviews hundreds of publications on toxicology from different online scientific databases to ensure awareness and compliance with the latest research findings. However, due to its manual nature, the process is incredibly time-consuming, requiring approximately 70 man-hours per month. Additionally, the subjective nature of the classification process, based on expert knowledge, rendered the process susceptible to bias. Hence, the client wanted to automate the classification process to ensure efficiency and standardization.

Toxicological research papers relevance prediction


Data Ingestion and Pre-processing

We collaborated with the client to develop an automated pipeline for the ingestion of articles from public research databases such as PubMed at regular intervals with a focus on relevant substances. The pipeline built was capable of processing and utilizing Chemical and Biological ontologies curated manually or ingested from public APIs (e.g., PubChem) or scraped from public websites such as Cactus, Bio-Portal, AberOWL, Systems Biology, etc.

Relevant content was then extracted into a structured format followed by text cleaning, standardization, and disambiguation using the ontologies and lexicons, making the data suitable for machine learning.

Model Evolution

Our initial approach involved creating a Fine-grained Relevancy Prediction model

To determine the relevance of research articles, we utilized advanced NLP techniques for importance-weighted vectorization. An ensemble of various text classification algorithms was employed to accurately predict the relevance or non-relevance of each research article.

Finally, articles that were found relevant by the model were also grouped into 20+ clusters of toxicological profiling sections corresponding to the potential area of anatomical and ecological impact. However, the results were still unsatisfactory due to a lack of sufficient quantum of annotated data.


Hence, we incorporated an active learning model

As the model was based on training on a small number of annotated articles, a feedback loop (where an expert can review and provide feedback about model predictions for each article) was incorporated within the model workflow to aid continuous Active Learning in order to achieve greater accuracy and robustness over time.


For further refinements, we leveraged a zero-shot classifier

With a zero-shot classifier in place, the Relevance Prediction model as well as the Section Identification model (developed for Toxicological profiling) provided better accuracy. Pooled embeddings from titles, keywords, etc. were added to improve the relevance accuracy. Meta-tags such as author name, author’s affiliations, etc. helped in improving the accuracy further.


The entire workflow was deployed as a reusable solution, which could be extended to multiple other substances. Multiple competing models including fine-tuning of LLMs (large language models) were incorporated. As data richness improved through experts’ feedback, more complex models started performing better than the simpler ones. The model outputs were written to the client database for further consumption, and a monitoring dashboard was developed.

                                                                                                               End-to-End Model Approach


  • Robust model performance: The model achieved accuracy (F1-score) of ~86% for article relevance prediction and accuracy (F1-score) of ~75% for Toxicological section prediction on new data.
  • ~80%+ reduction in manual efforts due to automatic data ingestion and automatic processing of research papers.
  • Standardized the relevance prediction and section identification process reducing human bias to a great extent with continuous active learning via expert feedback leading to accuracy improvements over time.

Case Studies