News topic modeling with matrix factorization

Image created by author with ChatGPT
I implement and compare different approaches to classify BBC news articles into five categories: business, entertainment, politics, sports, and technology. The project specifically contrasts Non-negative Matrix Factorization (NMF), an unsupervised technique, with Histogram-Based Gradient Boosting Classification (HGBC), a supervised learning approach.
Key Components
Data Exploration and Preprocessing
- Analyzed a dataset of 2,225 BBC news articles, with 1,490 available for training
- Performed text preprocessing using spaCy, including lemmatization and removal of stop words
- Evaluated class distribution and text length characteristics across categories
- Applied TF-IDF vectorization to convert text into a numerical representation
Unsupervised Approach: Non-negative Matrix Factorization
- Implemented NMF to decompose the TF-IDF matrix into topic distributions
- Achieved 91.25% accuracy on training data and 92.52% on the test set with default parameters
- Performed hyperparameter tuning via GridSearchCV to optimize model performance
- Visualized results using confusion matrices to analyze classification patterns
Supervised Approach: Histogram-Based Gradient Boosting
- Implemented HGBC as a comparative supervised technique
- Analyzed performance across different training data proportions (10% to 90%)
- Observed accuracy improvements as training data increased, with optimal performance at 70%
- Achieved 94.69% accuracy on the test set, outperforming the unsupervised approach
Model Evaluation
- Created confusion matrices to identify classification patterns and errors
- Submitted predictions to Kaggle for independent evaluation
- Analyzed trade-offs between model complexity, accuracy, and potential overfitting
Results and Insights
The analysis revealed that while the unsupervised NMF approach provided impressive accuracy (92.52%) without requiring labeled data, the supervised HGBC method achieved slightly higher accuracy (94.69%) when sufficient training data was available. This project demonstrates the effectiveness of both approaches and provides insights into their relative strengths for text classification tasks.