School of Engineering Science#College of Engineering#University of Tehran#Tehran#Iran
Abstract
Natural Language Processing (NLP) is one of the promising fields of artificial intelligence. Recently, a high volume of text data has been generated through the Internet. This kind of data is a valuable source of information that can be used in various fields such as information retrieval, recommender systems, etc. One practical task of text mining is document classification. In this paper, we mainly focus on Persian document classification. We introduce a new feature extraction approach derived from the combination of K-means clustering and Word2Vec to acquire semantically relevant and discriminant word representations. We call our proposed approach CC-Word2Vec (Categorical Clustering-Word2Vec) and use different classification models to compare the performance of our approach with other techniques like Term Frequency Inverse Document Frequency (TF-IDF), Word2Vec, and Latent Dirichlet Allocation (LDA) methods. Our proposed method resulted in an improvement in the obtained accuracy of all classifiers in comparison with other techniques.
Davoudi,S. and Mirzaei,S. (2020). A Semantic-based Feature Extraction Method Using Categorical Clustering for Persian Document Classification. The CSI Journal on Computer Science and Engineering, 18(1), 28-35.
MLA
Davoudi,S. , and Mirzaei,S. . "A Semantic-based Feature Extraction Method Using Categorical Clustering for Persian Document Classification", The CSI Journal on Computer Science and Engineering, 18, 1, 2020, 28-35.
HARVARD
Davoudi S., Mirzaei S. (2020). 'A Semantic-based Feature Extraction Method Using Categorical Clustering for Persian Document Classification', The CSI Journal on Computer Science and Engineering, 18(1), pp. 28-35.
CHICAGO
S. Davoudi and S. Mirzaei, "A Semantic-based Feature Extraction Method Using Categorical Clustering for Persian Document Classification," The CSI Journal on Computer Science and Engineering, 18 1 (2020): 28-35,
VANCOUVER
Davoudi S., Mirzaei S. A Semantic-based Feature Extraction Method Using Categorical Clustering for Persian Document Classification. CSIonJCSE, 2020; 18(1): 28-35.