Toshiki Tomihira, Atsushi Otsuka, Akihiro Yamashita and Tetsuji Satoh
Recently, Unicode has been standardized with the penetration of social networking services, the use of emojis has become common. Emojis, as they are also known, are most effective…
Abstract
Purpose
Recently, Unicode has been standardized with the penetration of social networking services, the use of emojis has become common. Emojis, as they are also known, are most effective in expressing emotions in sentences. Sentiment analysis in natural language processing manually labels emotions for sentences. The authors can predict sentiment using emoji of text posted on social media without labeling manually. The purpose of this paper is to propose a new model that learns from sentences using emojis as labels, collecting English and Japanese tweets from Twitter as the corpus. The authors verify and compare multiple models based on attention long short-term memory (LSTM) and convolutional neural networks (CNN) and Bidirectional Encoder Representations from Transformers (BERT).
Design/methodology/approach
The authors collected 2,661 kinds of emoji registered as Unicode characters from tweets using Twitter application programming interface. It is a total of 6,149,410 tweets in Japanese. First, the authors visualized a vector space produced by the emojis by Word2Vec. In addition, the authors found that emojis and similar meaning words of emojis are adjacent and verify that emoji can be used for sentiment analysis. Second, it involves entering a line of tweets containing emojis, learning and testing with that emoji as a label. The authors compared the BERT model with the conventional models [CNN, FastText and Attention bidirectional long short-term memory (BiLSTM)] that were high scores in the previous study.
Findings
Visualized the vector space of Word2Vec, the authors found that emojis and similar meaning words of emojis are adjacent and verify that emoji can be used for sentiment analysis. The authors obtained a higher score with BERT models compared to the conventional model. Therefore, the sophisticated experiments demonstrate that they improved the score over the conventional model in two languages. General emoji prediction is greatly influenced by context. In addition, the score may be lowered due to a misunderstanding of meaning. By using BERT based on a bi-directional transformer, the authors can consider the context.
Practical implications
The authors can find emoji in the output words by typing a word using an input method editor (IME). The current IME only considers the most latest inputted word, although it is possible to recommend emojis considering the context of the inputted sentence in this study. Therefore, the research can be used to improve IME performance in the future.
Originality/value
In the paper, the authors focus on multilingual emoji prediction. This is the first attempt of comparison at emoji prediction between Japanese and English. In addition, it is also the first attempt to use the BERT model based on the transformer for predicting limited emojis although the transformer is known to be effective for various NLP tasks. The authors found that a bidirectional transformer is suitable for emoji prediction.
Details
Keywords
Shuhei Yamamoto, Kei Wakabayashi, Tetsuji Satoh, Yuri Nozaki and Noriko Kando
The purpose of this paper is to clarify the characteristics of growth users over a long time to strategically collect a large amount of specific users’ tweets. Twitter reflects…
Abstract
Purpose
The purpose of this paper is to clarify the characteristics of growth users over a long time to strategically collect a large amount of specific users’ tweets. Twitter reflects events and trends in users’ real lives because many of them post tweets related to their experiences. Many studies have succeeded in detecting events along with real-life information from a large amount of tweets by assuming users as social sensors. To collect a large amount of tweets based on specific users for successful Twitter studies, the authors have to know the characteristics of users who are active over long periods of time.
Design/methodology/approach
The authors explore the status of users who were active in 2012, and classify users into three statuses of Dead, Lock and Alive. Based on the differences between the numbers of tweets in 2012 and 2016, the authors further classify Alive users into three types of Eraser, Slumber and Growth. The authors analyze the characteristic feature values observed in each user behavior and provide interesting findings with each status/type based on Gaussian mixture model clustering and point-wise mutual information.
Findings
From their sophisticated experimental evaluations, the authors found that active users more easily dropped out than inactive users, and users who engaged in reciprocal communications often became Growth type. Also, the authors found that active users and users who were not retweeted by other users often became Eraser type. The authors’ proposed methods effectively predicted Growth/Eraser-type users compared with the logistic regression model. From these results, the authors clarified the effectiveness of five feature values per active hour to detect intended Twitter user growth for strategically collecting a large amount of tweets.
Originality/value
The authors focus on user growth prediction. To appropriately estimate users who have potential for growth, they collect a large amount of users and explore their status and growth after three years. The research quantitatively clarifies the characteristics of growth users by clustering using robust feature values and provides interesting findings obtained by analysis. After that, the authors propose an effective prediction method for growth users and evaluate the effectiveness of their proposed method.
Details
Keywords
Shuhei Yamamoto, Kei Wakabayashi, Noriko Kando and Tetsuji Satoh
Many Twitter users post tweets that are related to their particular interests. Users can also collect information by following other users. One approach clarifies user interests…
Abstract
Purpose
Many Twitter users post tweets that are related to their particular interests. Users can also collect information by following other users. One approach clarifies user interests by tagging labels based on the users. A user tagging method is important to discover candidate users with similar interests. This paper aims to propose a new user tagging method using the posting time series data of the number of tweets.
Design/methodology/approach
Our hypothesis focuses on the relationship between a user’s interests and the posting times of tweets: as users have interests, they will post more tweets at the time when events occur compared with general times. The authors assume that hashtags are labeled tags to users and observe their occurrence counts in each timestamp. The authors extract burst timestamps using Kleinberg’s burst enumeration algorithm and estimate the burst levels. The authors manage the burst levels as term frequency in documents and calculate the score using typical methods such as cosine similarity, Naïve Bayes and term frequency (TF) in a document and inversed document frequency (IDF; TF-IDF).
Findings
From the sophisticated experimental evaluations, the authors demonstrate the high efficiency of the tagging method. Naïve Bayes and cosine similarity are particular suitable for the user tagging and tag score calculation tasks, respectively. Some users, whose hashtags were appropriately estimated by our methods, experienced higher the maximum value of the number of tweets than other users.
Originality/value
Many approaches estimate user interest based on the terms in tweets and apply such graph theory as following networks. The authors propose a new estimation method that uses the time series data of the number of tweets. The merits to estimating user interest using the time series data do not depend on language and can decrease the calculation costs compared with the above-mentioned approaches because the number of features is fewer.
Details
Keywords
Shuhei Yamamoto and Tetsuji Satoh
This paper aims to propose a multi-label method that estimates appropriate aspects against unknown tweets using the two-phase estimation method. Many Twitter users share daily…
Abstract
Purpose
This paper aims to propose a multi-label method that estimates appropriate aspects against unknown tweets using the two-phase estimation method. Many Twitter users share daily events and opinions. Some beneficial comments are posted on such real-life aspects as eating, traffic, weather and so on. Such posts as “The train is not coming” are categorized in the Traffic aspect. Such tweets as “The train is delayed by heavy rain” are categorized in both the Traffic and Weather aspects.
Design/methodology/approach
The proposed method consists of two phases. In the first, many topics are extracted from a sea of tweets using Latent Dirichlet Allocation (LDA). In the second, associations among many topics and fewer aspects are built using a small set of labeled tweets. The aspect scores for tweets were calculated using associations based on the extracted terms. Appropriate aspects are labeled for unknown tweets by averaging the aspect scores.
Findings
Using a large amount of actual tweets, the sophisticated experimental evaluations demonstrate the high efficiency of the proposed multi-label classification method. It is confirmed that high F-measure aspects are strongly associated with topics that have high relevance. Low F-measure aspects are associated with topics that are connected to many other aspects.
Originality/value
The proposed method features two-phase semi-supervised learning. Many topics are extracted using an unsupervised learning model called LDA. Associations among many topics and fewer aspects are built using labeled tweets.
Details
Keywords
Yutaro Yamaguchi, Shuhei Yamamoto and Tetsuji Satoh
The purpose of this paper is to activate latent users posts by modeling user behaviors by a transition of clusters that represent particular posting activities. Twitter has…
Abstract
Purpose
The purpose of this paper is to activate latent users posts by modeling user behaviors by a transition of clusters that represent particular posting activities. Twitter has rapidly spread and become an easy and convenient microblog that enables users to exchange instant text messages called tweets. There are so many latent users whose posting activities have decreased.
Design/methodology/approach
Under this model, two kinds of time-series analysis methods are proposed to clarify the lifecycles of Twitter users. In the first one, all users belong to a cluster consisting of several features at individual time slots and move among the clusters in a time series. In the second one, the posting activities of Twitter users are analyzed by the amount of tweets that vary with time.
Findings
This sophisticated evaluation using a large actual tweet-set demonstrated the proposed methods effectiveness. The authors found a big difference in the state transition diagrams between long- and short-term users. Analysis of short-term users introduces effective knowledge for encouraging continued Twitter use.
Originality/value
An the efficient user behavior model, which describes transitions of posting activities, is proposed. Two kinds of time longitudinal analysis method are evaluated using a large amount of actual tweets.