07 / NLP

Smoker Status Detection from Clinical Notes

A 5-class NLP pipeline that classifies patient smoking status directly from clinical discharge notes.

NLPCLINICALBERTPYTHON
NLP text classification visualization over clinical discharge notes

Overview

Whether a patient smokes is one of the most predictive signals in medicine — but it's usually locked inside free-text discharge notes. I built an NLP pipeline to pull it out automatically.

The pipeline classifies notes into five smoking-status categories, engineering Bag-of-Words features and using ADASYN to rebalance the skewed classes, then benchmarking Logistic Regression, Random Forest, and SVM against a fine-tuned ClinicalBERT.

The result replaces slow, inconsistent manual chart review with a reproducible model that turns unstructured notes into structured, research-ready labels.

At a glance

OPPORTUNITY

Smoking status is buried in free-text clinical notes, making it costly to extract for research and risk models.

WHAT I BUILT

A 5-class NLP pipeline combining Bag-of-Words features, ADASYN rebalancing, and fine-tuned ClinicalBERT across multiple classifiers.

IMPACT

Automated smoking-status classification from discharge notes, replacing manual chart review with a reproducible model.

Highlights

  • 5-class classification from unstructured discharge notes
  • ADASYN rebalancing to handle heavy class imbalance
  • Fine-tuned ClinicalBERT benchmarked vs. classic ML
  • Reproducible pipeline replacing manual chart review