Ammara Zamir, Hikmat Ullah Khan, Waqar Mehmood, Tassawar Iqbal and Abubakker Usman Akram
This research study proposes a feature-centric spam email detection model (FSEDM) based on content, sentiment, semantic, user and spam-lexicon features set. The purpose of this…
Abstract
Purpose
This research study proposes a feature-centric spam email detection model (FSEDM) based on content, sentiment, semantic, user and spam-lexicon features set. The purpose of this study is to exploit the role of sentiment features along with other proposed features to evaluate the classification accuracy of machine learning algorithms for spam email detection.
Design/methodology/approach
Existing studies primarily exploits content-based feature engineering approach; however, a limited number of features is considered. In this regard, this research study proposed a feature-centric framework (FSEDM) based on existing and novel features of email data set, which are extracted after pre-processing. Afterwards, diverse supervised learning techniques are applied on the proposed features in conjunction with feature selection techniques such as information gain, gain ratio and Relief-F to rank most prominent features and classify the emails into spam or ham (not spam).
Findings
Analysis and experimental results indicated that the proposed model with sentiment analysis is competitive approach for spam email detection. Using the proposed model, deep neural network applied with sentiment features outperformed other classifiers in terms of classification accuracy up to 97.2%.
Originality/value
This research is novel in this regard that no previous research focuses on sentiment analysis in conjunction with other email features for detection of spam emails.
Details
Keywords
Fung Yuen Chin, Kong Hoong Lem and Khye Mun Wong
The amount of features in handwritten digit data is often very large due to the different aspects in personal handwriting, leading to high-dimensional data. Therefore, the…
Abstract
Purpose
The amount of features in handwritten digit data is often very large due to the different aspects in personal handwriting, leading to high-dimensional data. Therefore, the employment of a feature selection algorithm becomes crucial for successful classification modeling, because the inclusion of irrelevant or redundant features can mislead the modeling algorithms, resulting in overfitting and decrease in efficiency.
Design/methodology/approach
The minimum redundancy and maximum relevance (mRMR) and the recursive feature elimination (RFE) are two frequently used feature selection algorithms. While mRMR is capable of identifying a subset of features that are highly relevant to the targeted classification variable, mRMR still carries the weakness of capturing redundant features along with the algorithm. On the other hand, RFE is flawed by the fact that those features selected by RFE are not ranked by importance, albeit RFE can effectively eliminate the less important features and exclude redundant features.
Findings
The hybrid method was exemplified in a binary classification between digits “4” and “9” and between digits “6” and “8” from a multiple features dataset. The result showed that the hybrid mRMR + support vector machine recursive feature elimination (SVMRFE) is better than both the sole support vector machine (SVM) and mRMR.
Originality/value
In view of the respective strength and deficiency mRMR and RFE, this study combined both these methods and used an SVM as the underlying classifier anticipating the mRMR to make an excellent complement to the SVMRFE.
Details
Keywords
Jiunn-Liang Guo, Hei-Chia Wang and Ming-Way Lai
The purpose of this paper is to develop a novel feature selection approach for automatic text classification of large digital documents – e-books of online library system. The…
Abstract
Purpose
The purpose of this paper is to develop a novel feature selection approach for automatic text classification of large digital documents – e-books of online library system. The main idea mainly aims on automatically identifying the discourse features in order to improving the feature selection process rather than focussing on the size of the corpus.
Design/methodology/approach
The proposed framework intends to automatically identify the discourse segments within e-books and capture proper discourse subtopics that are cohesively expressed in discourse segments and treating these subtopics as informative and prominent features. The selected set of features is then used to train and perform the e-book classification task based on the support vector machine technique.
Findings
The evaluation of the proposed framework shows that identifying discourse segments and capturing subtopic features leads to better performance, in comparison with two conventional feature selection techniques: TFIDF and mutual information. It also demonstrates that discourse features play important roles among textual features, especially for large documents such as e-books.
Research limitations/implications
Automatically extracted subtopic features cannot be directly entered into FS process but requires control of the threshold.
Practical implications
The proposed technique has demonstrated the promised application of using discourse analysis to enhance the classification of large digital documents – e-books as against to conventional techniques.
Originality/value
A new FS technique is proposed which can inspect the narrative structure of large documents and it is new to the text classification domain. The other contribution is that it inspires the consideration of discourse information in future text analysis, by providing more evidences through evaluation of the results. The proposed system can be integrated into other library management systems.
Details
Keywords
Ammara Zamir, Hikmat Ullah Khan, Tassawar Iqbal, Nazish Yousaf, Farah Aslam, Almas Anjum and Maryam Hamdani
This paper aims to present a framework to detect phishing websites using stacking model. Phishing is a type of fraud to access users’ credentials. The attackers access users’…
Abstract
Purpose
This paper aims to present a framework to detect phishing websites using stacking model. Phishing is a type of fraud to access users’ credentials. The attackers access users’ personal and sensitive information for monetary purposes. Phishing affects diverse fields, such as e-commerce, online business, banking and digital marketing, and is ordinarily carried out by sending spam emails and developing identical websites resembling the original websites. As people surf the targeted website, the phishers hijack their personal information.
Design/methodology/approach
Features of phishing data set are analysed by using feature selection techniques including information gain, gain ratio, Relief-F and recursive feature elimination (RFE) for feature selection. Two features are proposed combining the strongest and weakest attributes. Principal component analysis with diverse machine learning algorithms including (random forest [RF], neural network [NN], bagging, support vector machine, Naïve Bayes and k-nearest neighbour) is applied on proposed and remaining features. Afterwards, two stacking models: Stacking1 (RF + NN + Bagging) and Stacking2 (kNN + RF + Bagging) are applied by combining highest scoring classifiers to improve the classification accuracy.
Findings
The proposed features played an important role in improving the accuracy of all the classifiers. The results show that RFE plays an important role to remove the least important feature from the data set. Furthermore, Stacking1 (RF + NN + Bagging) outperformed all other classifiers in terms of classification accuracy to detect phishing website with 97.4% accuracy.
Originality/value
This research is novel in this regard that no previous research focusses on using feed forward NN and ensemble learners for detecting phishing websites.
Details
Keywords
Sandeep Kumar Hegde and Monica R. Mundada
Chronic diseases are considered as one of the serious concerns and threats to public health across the globe. Diseases such as chronic diabetes mellitus (CDM), cardio…
Abstract
Purpose
Chronic diseases are considered as one of the serious concerns and threats to public health across the globe. Diseases such as chronic diabetes mellitus (CDM), cardio vasculardisease (CVD) and chronic kidney disease (CKD) are major chronic diseases responsible for millions of death. Each of these diseases is considered as a risk factor for the other two diseases. Therefore, noteworthy attention is being paid to reduce the risk of these diseases. A gigantic amount of medical data is generated in digital form from smart healthcare appliances in the current era. Although numerous machine learning (ML) algorithms are proposed for the early prediction of chronic diseases, these algorithmic models are neither generalized nor adaptive when the model is imposed on new disease datasets. Hence, these algorithms have to process a huge amount of disease data iteratively until the model converges. This limitation may make it difficult for ML models to fit and produce imprecise results. A single algorithm may not yield accurate results. Nonetheless, an ensemble of classifiers built from multiple models, that works based on a voting principle has been successfully applied to solve many classification tasks. The purpose of this paper is to make early prediction of chronic diseases using hybrid generative regression based deep intelligence network (HGRDIN) model.
Design/methodology/approach
In the proposed paper generative regression (GR) model is used in combination with deep neural network (DNN) for the early prediction of chronic disease. The GR model will obtain prior knowledge about the labelled data by analyzing the correlation between features and class labels. Hence, the weight assignment process of DNN is influenced by the relationship between attributes rather than random assignment. The knowledge obtained through these processes is passed as input to the DNN network for further prediction. Since the inference about the input data instances is drawn at the DNN through the GR model, the model is named as hybrid generative regression-based deep intelligence network (HGRDIN).
Findings
The credibility of the implemented approach is rigorously validated using various parameters such as accuracy, precision, recall, F score and area under the curve (AUC) score. During the training phase, the proposed algorithm is constantly regularized using the elastic net regularization technique and also hyper-tuned using the various parameters such as momentum and learning rate to minimize the misprediction rate. The experimental results illustrate that the proposed approach predicted the chronic disease with a minimal error by avoiding the possible overfitting and local minima problems. The result obtained with the proposed approach is also compared with the various traditional approaches.
Research limitations/implications
Usually, the diagnostic data are multi-dimension in nature where the performance of the ML algorithm will degrade due to the data overfitting, curse of dimensionality issues. The result obtained through the experiment has achieved an average accuracy of 95%. Hence, analysis can be made further to improve predictive accuracy by overcoming the curse of dimensionality issues.
Practical implications
The proposed ML model can mimic the behavior of the doctor's brain. These algorithms have the capability to replace clinical tasks. The accurate result obtained through the innovative algorithms can free the physician from the mundane care and practices so that the physician can focus more on the complex issues.
Social implications
Utilizing the proposed predictive model at the decision-making level for the early prediction of the disease is considered as a promising change towards the healthcare sector. The global burden of chronic disease can be reduced at an exceptional level through these approaches.
Originality/value
In the proposed HGRDIN model, the concept of transfer learning approach is used where the knowledge acquired through the GR process is applied on DNN that identified the possible relationship between the dependent and independent feature variables by mapping the chronic data instances to its corresponding target class before it is being passed as input to the DNN network. Hence, the result of the experiments illustrated that the proposed approach obtained superior performance in terms of various validation parameters than the existing conventional techniques.
Details
Keywords
The purpose of this paper is to present research in the area of control method for the man‐machine systems with brain machine interface (BMI). Concrete target system is, for…
Abstract
Purpose
The purpose of this paper is to present research in the area of control method for the man‐machine systems with brain machine interface (BMI). Concrete target system is, for instance, a car cruising system and so on.
Design/methodology/approach
The improved receding horizon control (RHC) method for the sampled‐data systems and the adaptive digital‐to‐analog (DA) converter which has the way to switch the sampling functions according to the system status are used. The feature selection method based on the kernel support vector machines with the backward stepwise selection for the BMI signals are also used.
Findings
This paper proposes the new improved RHC method with the adaptive DA converter for the application of the BMI‐based systems. The proposed method is illustrated as useful and effective method for the systems to which switch of control laws is indispensable by the simulations.
Research limitations/implications
Although the proposed method is effective for the BMI‐based systems with switching of control laws, the faster algorithm for RHC will be need to apply to the man‐machine systems with the BMI in practical use.
Practical implications
The basic concept or framework of the proposed method can be used for the real man‐machine systems with the BMI, for examples, car crusing systems, wheel‐chaired systems and so on.
Originality/value
The paper contributes to the development of the new effective control method for the BMI‐based man‐machine systems.
Details
Keywords
Yuting Jiang, Shengli Deng, Hongxiu Li and Yong Liu
The purposes of this paper are to (1) explore how personality traits pertaining to the dominance influence steadiness compliance model manifest themselves in terms of user…
Abstract
Purpose
The purposes of this paper are to (1) explore how personality traits pertaining to the dominance influence steadiness compliance model manifest themselves in terms of user interaction behavior on social media and (2) examine whether social interaction data on social media platforms can predict user personality.
Design/methodology/approach
Social interaction data was collected from 198 users of Sina Weibo, a popular social media platform in China. Their personality traits were also measured via questionnaire. Machine learning techniques were applied to predict the personality traits based on the social interaction data.
Findings
The results demonstrated that the proposed classifiers had high prediction accuracy, indicating that our approach is reliable and can be used with social interaction data on social media platforms to predict user personality. “Reposting,” “being reposted,” “commenting” and “being commented on” were found to be the key interaction features that reflected Weibo users' personalities, whereas “liking” was not found to be a key feature.
Originality/value
The findings of this study are expected to enrich personality prediction research based on social media data and to provide insights into the potential of employing social media data for the purpose of personality prediction in the context of the Weibo social media platform in China.
Details
Keywords
Farshad Faezy Razi and Seyed Hooman Shariat
The purpose of this paper is twofold: the selection of project portfolios through hybrid artificial neural network algorithms, feature selection based on grey relational analysis…
Abstract
Purpose
The purpose of this paper is twofold: the selection of project portfolios through hybrid artificial neural network algorithms, feature selection based on grey relational analysis, decision tree and regression; and the identification of the features affecting project portfolio selection using the artificial neural network algorithm, decision tree and regression. The authors also aim to classify the available options using the decision tree algorithm.
Design/methodology/approach
In order to achieve the research goals, a project-oriented organization was selected and studied. In all, 49 project management indicators were chosen from A Guide to the Project Management Body of Knowledge (PMBOK Guide), and the most important indicators were identified using a feature selection algorithm and decision tree. After the extraction of rules, decision rule-based multi-criteria decision making matrices were produced. Each matrix was ranked through grey relational analysis, similarity to ideal solution method and multi-criteria optimization. Finally, a model for choosing the best ranking method was designed and implemented using the genetic algorithm. To analyze the responses, stability of the classes was investigated.
Findings
The results showed that projects ranked based on neural network weights by the grey relational analysis method prove to be better options for the selection of a project portfolio. The process of identification of the features affecting project portfolio selection resulted in the following factors: scope management, project charter, project management plan, stakeholders and risk.
Originality/value
This study presents the most effective features affecting project portfolio selection which is highly impressive in organizational decision making and must be considered seriously. Deploying sensitivity analysis, which is an innovation in such studies, played a constructive role in examining the accuracy and reliability of the proposed models, and it can be firmly argued that the results have had an important role in validating the findings of this study.
Details
Keywords
Gabrijela Dimic, Dejan Rancic, Nemanja Macek, Petar Spalevic and Vida Drasute
This paper aims to deal with the previously unknown prediction accuracy of students’ activity pattern in a blended learning environment.
Abstract
Purpose
This paper aims to deal with the previously unknown prediction accuracy of students’ activity pattern in a blended learning environment.
Design/methodology/approach
To extract the most relevant activity feature subset, different feature-selection methods were applied. For different cardinality subsets, classification models were used in the comparison.
Findings
Experimental evaluation oppose the hypothesis that feature vector dimensionality reduction leads to prediction accuracy increasing.
Research limitations/implications
Improving prediction accuracy in a described learning environment was based on applying synthetic minority oversampling technique, which had affected results on correlation-based feature-selection method.
Originality/value
The major contribution of the research is the proposed methodology for selecting the optimal low-cardinal subset of students’ activities and significant prediction accuracy improvement in a blended learning environment.
Details
Keywords
Tsung-Yi Chen, Meng-Che Tsai and Yuh-Min Chen
For an enterprise, it is essential to win as many customers as possible. The key to successfully winning customers is often determined by understanding the personality…
Abstract
Purpose
For an enterprise, it is essential to win as many customers as possible. The key to successfully winning customers is often determined by understanding the personality characteristics of the object of communication in order to employ an effective communication strategy. An enterprise needs to obtain the personality information of target or potential customers. However, the traditional method for personality evaluation is extremely costly in terms of time and labor, and it cannot acquire customer personality information without their awareness. Therefore, the manner in which to effectively conduct automated personality predictions for a large number of objects is an important issue. The paper aims to discuss these issues.
Design/methodology/approach
The diverse social media that have emerged in recent years represent a digital platform on which users can publicly deliver speeches and interact with others. Thus, social media may be able to serve the needs of automated personality predictions. Based on user data of Facebook, the main social media platform around the world, this research developed a method for predicting personality types based on interaction logs.
Findings
Experimental results show that the Naïve Bayes classification algorithm combined with a feature selection algorithm produces the best performance for predicting personality types, with 70-80 percent accuracy.
Research limitations/implications
In this research, the dominance, inducement, submission, and compliance (DISC) theory was used to determine personality types. Some specific limitations were encountered. As Facebook was used as the main data source, it was necessary to obtain related data via Facebook’s API (FB API). However, the data types accessible via FB API are very limited.
Practical implications
This research serves to build a universal model for social media interaction, and can be used to propose an efficient method for designing interaction features.
Originality/value
This research has developed an approach for automatically predicting the personality types of network users based on their Facebook interactions.