Smoker Status Detection from Clinical Notes
A 5-class NLP pipeline that classifies patient smoking status directly from clinical discharge notes.

Overview
Whether a patient smokes is one of the most predictive signals in medicine — but it's usually locked inside free-text discharge notes. I built an NLP pipeline to pull it out automatically.
The pipeline classifies notes into five smoking-status categories, engineering Bag-of-Words features and using ADASYN to rebalance the skewed classes, then benchmarking Logistic Regression, Random Forest, and SVM against a fine-tuned ClinicalBERT.
The result replaces slow, inconsistent manual chart review with a reproducible model that turns unstructured notes into structured, research-ready labels.
At a glance
Smoking status is buried in free-text clinical notes, making it costly to extract for research and risk models.
A 5-class NLP pipeline combining Bag-of-Words features, ADASYN rebalancing, and fine-tuned ClinicalBERT across multiple classifiers.
Automated smoking-status classification from discharge notes, replacing manual chart review with a reproducible model.
Highlights
- 5-class classification from unstructured discharge notes
- ADASYN rebalancing to handle heavy class imbalance
- Fine-tuned ClinicalBERT benchmarked vs. classic ML
- Reproducible pipeline replacing manual chart review