SenU-PTM: a novel phrase-based topic model for short-text topic discovery by exploiting word embeddings
Data Technologies and Applications
ISSN: 2514-9288
Article publication date: 29 April 2021
Issue publication date: 11 October 2021
Abstract
Purpose
Topic model has been widely applied to discover important information from a vast amount of unstructured data. Traditional long-text topic models such as Latent Dirichlet Allocation may suffer from the sparsity problem when dealing with short texts, which mostly come from the Web. These models also exist the readability problem when displaying the discovered topics. The purpose of this paper is to propose a novel model called the Sense Unit based Phrase Topic Model (SenU-PTM) for both the sparsity and readability problems.
Design/methodology/approach
SenU-PTM is a novel phrase-based short-text topic model under a two-phase framework. The first phase introduces a phrase-generation algorithm by exploiting word embeddings, which aims to generate phrases with the original corpus. The second phase introduces a new concept of sense unit, which consists of a set of semantically similar tokens for modeling topics with token vectors generated in the first phase. Finally, SenU-PTM infers topics based on the above two phases.
Findings
Experimental results on two real-world and publicly available datasets show the effectiveness of SenU-PTM from the perspectives of topical quality and document characterization. It reveals that modeling topics on sense units can solve the sparsity of short texts and improve the readability of topics at the same time.
Originality/value
The originality of SenU-PTM lies in the new procedure of modeling topics on the proposed sense units with word embeddings for short-text topic discovery.
Keywords
Acknowledgements
This research was funded by the National Natural Science Foundation of China [Grant No. 62002137], the Fundamental Research Funds for the Central Universities [No. JUSRP12021] and the State Key Lab. for Novel Software Technology, Nanjing University, P.R. China [No. KFKT2020B02].
Citation
Lu, H.-Y., Zhang, Y. and Du, Y. (2021), "SenU-PTM: a novel phrase-based topic model for short-text topic discovery by exploiting word embeddings", Data Technologies and Applications, Vol. 55 No. 5, pp. 643-660. https://doi.org/10.1108/DTA-02-2021-0039
Publisher
:Emerald Publishing Limited
Copyright © 2021, Emerald Publishing Limited