Sie befinden Sich nicht im Netzwerk der Universität Paderborn. Der Zugriff auf elektronische Ressourcen ist gegebenenfalls nur via VPN oder Shibboleth (DFN-AAI) möglich. mehr Informationen...
In this study, we address the task of discerning gender through the textual content of social media, a crucial step in detecting and mitigating counterfeit account activity. Ensuring accurate gender portrayal on digital platforms is essential for creating a secure and inclusive cyberspace. While research exists for languages like English, Russian, and Arabic, Bangla remains underexplored. To address this, we compiled 15,000 Bangla posts from Facebook groups, profiles, pages, blogs, and forums. We trained seven traditional machine learning algorithms (NB, SVM, LR, DT, RF, SGD, KNN) and three deep learning models (MLP, LSTM, GRU), using stylometric features, Term Frequency-Inverse Document Frequency (TF-IDF), and word embeddings. Traditional models generally outperformed deep learning models, except with stylometric features. Notably, the Stochastic Gradient Descent (SGD) model with TF-IDF achieved the highest accuracy (78.33%) and F1-Score (87.67%). Additionally, Continuous Bag of Words (CBOW) out-performed Skip-Gram (SG) in training the word2vec model, with top accuracy and F1-Score of 75.13% and 79.92%, respectively. These findings represents a significant stride forward in the field of gender identification from Bangla text.