Detecting Racism in Online French Text

 
Abstract
The proliferation of hate speech online is an increasingly pressing issue for both companies and governments. Given the unprecedented scale of such content, the rise of automatic systems capable of racism detection seems ineluctable.

The main goal of this project is to build a binary Support Vector Machines text classification model to detect racism in French.
First, this project involved building an annotated gold standard of scraped French social media comments containing racist text for training and testing purposes.
Second, it entailed developing two parallel SVM classifiers based on the optimal combinations of training data, preprocessing steps, features, and hyperparameters. The first model was trained unbalanced data while the second was trained on a sub-sample obtained by undersampling.
Quantitative and qualitative analyses of the developed models’ predictions for a held-out evaluation set are also proposed.

The results are promising given the limited training set and the subtleties of language data. Further research is needed to improve the performance of classifiers for imbalanced text data.
Context
This project served as my thesis for my Master's in Artificial Intelligence at KU Leuven and my research project during my internship at Textgain.
Technologies
Python (scikit-learn, NLTK, spaCy, Pattern, pandas, NumPy, matplotlib, seaborn)