Skip to the content.

NLP Primer with Humor Detection in Texts

Objective

This mini-project aims to evaluate the effectiveness of different text preprocessing and vectorization techniques in conjunction with three classification models—Naïve Bayes, Logistic Regression, and Decision Trees— for classifying humor in text. Using the ColBERT dataset from Kaggle, this project will explore various approaches with text pre-processing and feature extraction with classification models, serving as an introductory exploration into the rich field of Natural Language Processing (NLP).

Goal

Identify the most effective combination of text preprocessing steps, vectorization methods, and classifiers for classifying humor in textual data. The project seeks to enhance understanding of how different NLP techniques influence model performance, providing foundational insights into text classification and sentiment analysis.

Summary

A screenshot of a computer Description automatically generated

A graph with text on it Description automatically generated

A bar graph with numbers and text Description automatically generated

Conclusion

Colbert Dataset Humor Classification: The Winning Formula

For top-notch accuracy, go with a Logistic Regression classifier using stemming and TF-IDF vectorization—it scores a solid 0.86 accuracy in just 6.19 seconds!

But if you’re looking to strike the perfect balance between speed and performance, the Naive Bayes classifier with lemmatization and count vectorization is your best bet, delivering a nearly perfect 0.8556 accuracy in just 3.31 seconds.