News topic modeling with matrix factorization

April 14, 2025
Machine Learning NLP spaCy

I implement and compare different approaches to classify BBC news articles into five categories: business, entertainment, politics, sports, and technology. The project specifically contrasts Non-negative Matrix Factorization (NMF), an unsupervised technique, with Histogram-Based Gradient Boosting Classification (HGBC), a supervised learning approach.

Key Components

Data Exploration and Preprocessing

  • Analyzed a dataset of 2,225 BBC news articles, with 1,490 available for training
  • Performed text preprocessing using spaCy, including lemmatization and removal of stop words
  • Evaluated class distribution and text length characteristics across categories
  • Applied TF-IDF vectorization to convert text into a numerical representation

Unsupervised Approach: Non-negative Matrix Factorization

  • Implemented NMF to decompose the TF-IDF matrix into topic distributions
  • Achieved 91.25% accuracy on training data and 92.52% on the test set with default parameters
  • Performed hyperparameter tuning via GridSearchCV to optimize model performance
  • Visualized results using confusion matrices to analyze classification patterns

Supervised Approach: Histogram-Based Gradient Boosting

  • Implemented HGBC as a comparative supervised technique
  • Analyzed performance across different training data proportions (10% to 90%)
  • Observed accuracy improvements as training data increased, with optimal performance at 70%
  • Achieved 94.69% accuracy on the test set, outperforming the unsupervised approach

Model Evaluation

  • Created confusion matrices to identify classification patterns and errors
  • Submitted predictions to Kaggle for independent evaluation
  • Analyzed trade-offs between model complexity, accuracy, and potential overfitting

Results and Insights

The analysis revealed that while the unsupervised NMF approach provided impressive accuracy (92.52%) without requiring labeled data, the supervised HGBC method achieved slightly higher accuracy (94.69%) when sufficient training data was available. This project demonstrates the effectiveness of both approaches and provides insights into their relative strengths for text classification tasks.

Filed under: Machine Learning , NLP , spaCy