Karlo Puh and Marina Bagić Babac
As the tourism industry becomes more vital for the success of many economies around the world, the importance of technology in tourism grows daily. Alongside increasing tourism…
Abstract
Purpose
As the tourism industry becomes more vital for the success of many economies around the world, the importance of technology in tourism grows daily. Alongside increasing tourism importance and popularity, the amount of significant data grows, too. On daily basis, millions of people write their opinions, suggestions and views about accommodation, services, and much more on various websites. Well-processed and filtered data can provide a lot of useful information that can be used for making tourists' experiences much better and help us decide when selecting a hotel or a restaurant. Thus, the purpose of this study is to explore machine and deep learning models for predicting sentiment and rating from tourist reviews.
Design/methodology/approach
This paper used machine learning models such as Naïve Bayes, support vector machines (SVM), convolutional neural network (CNN), long short-term memory (LSTM) and bidirectional long short-term memory (BiLSTM) for extracting sentiment and ratings from tourist reviews. These models were trained to classify reviews into positive, negative, or neutral sentiment, and into one to five grades or stars. Data used for training the models were gathered from TripAdvisor, the world's largest travel platform. The models based on multinomial Naïve Bayes (MNB) and SVM were trained using the term frequency-inverse document frequency (TF-IDF) for word representations while deep learning models were trained using global vectors (GloVe) for word representation. The results from testing these models are presented, compared and discussed.
Findings
The performance of machine and learning models achieved high accuracy in predicting positive, negative, or neutral sentiments and ratings from tourist reviews. The optimal model architecture for both classification tasks was a deep learning model based on BiLSTM. The study’s results confirmed that deep learning models are more efficient and accurate than machine learning algorithms.
Practical implications
The proposed models allow for forecasting the number of tourist arrivals and expenditure, gaining insights into the tourists' profiles, improving overall customer experience, and upgrading marketing strategies. Different service sectors can use the implemented models to get insights into customer satisfaction with the products and services as well as to predict the opinions given a particular context.
Originality/value
This study developed and compared different machine learning models for classifying customer reviews as positive, negative, or neutral, as well as predicting ratings with one to five stars based on a TripAdvisor hotel reviews dataset that contains 20,491 unique hotel reviews.
Details
Keywords
This study aims to find suitable replacements for hypothesis testing and variable-importance measures.
Abstract
Purpose
This study aims to find suitable replacements for hypothesis testing and variable-importance measures.
Design/methodology/approach
This study explores under-used predictive methods.
Findings
The study's hypothesis testing can and should be replaced by predictive methods. It is the only way to know if models have any value.
Originality/value
This is the first time predictive methods have been used to demonstrate measure and variable importance. Hypothesis testing can never prove the goodness of models. Only predictive methods can.
Details
Keywords
Rahila Umer, Teo Susnjak, Anuradha Mathrani and Suriadi Suriadi
The purpose of this paper is to propose a process mining approach to help in making early predictions to improve students’ learning experience in massive open online courses…
Abstract
Purpose
The purpose of this paper is to propose a process mining approach to help in making early predictions to improve students’ learning experience in massive open online courses (MOOCs). It investigates the impact of various machine learning techniques in combination with process mining features to measure effectiveness of these techniques.
Design/methodology/approach
Student’s data (e.g. assessment grades, demographic information) and weekly interaction data based on event logs (e.g. video lecture interaction, solution submission time, time spent weekly) have guided this design. This study evaluates four machine learning classification techniques used in the literature (logistic regression (LR), Naïve Bayes (NB), random forest (RF) and K-nearest neighbor) to monitor weekly progression of students’ performance and to predict their overall performance outcome. Two data sets – one, with traditional features and second, with features obtained from process conformance testing – have been used.
Findings
The results show that techniques used in the study are able to make predictions on the performance of students. Overall accuracy (F1-score, area under curve) of machine learning techniques can be improved by integrating process mining features with standard features. Specifically, the use of LR and NB classifiers outperforms other techniques in a statistical significant way.
Practical implications
Although MOOCs provide a platform for learning in highly scalable and flexible manner, they are prone to early dropout and low completion rate. This study outlines a data-driven approach to improve students’ learning experience and decrease the dropout rate.
Social implications
Early predictions based on individual’s participation can help educators provide support to students who are struggling in the course.
Originality/value
This study outlines the innovative use of process mining techniques in education data mining to help educators gather data-driven insight on student performances in the enrolled courses.
Details
Keywords
Important research once thought unassailable has failed to replicate. Not just in economics, but in all science. The problem is therefore not in dispute nor are some of the…
Abstract
Purpose
Important research once thought unassailable has failed to replicate. Not just in economics, but in all science. The problem is therefore not in dispute nor are some of the causes, like low power, selective reporting, the file drawer effect, publicly unavailable data and so forth. Some partially worthy solutions have already been offered, like pre-registering hypotheses and data analysis plans.
Design/methodology/approach
This is a review paper on the replication crisis, which is by now very well known.
Findings
This study offers another partial solution, which is to remind researchers that correlation does not logically imply causation. The effect of this reminder is to eschew “significance” testing, whether in frequentist or Bayesian form (like Bayes factors) and to report models in predictive form, so that anybody can check the veracity of any model. In effect, all papers could undergo replication testing.
Originality/value
The author argues that this, or any solution, will never eliminate all errors.
Details
Keywords
Loan default risk or credit risk evaluation is important to financial institutions which provide loans to businesses and individuals. Loans carry the risk of being defaulted. To…
Abstract
Purpose
Loan default risk or credit risk evaluation is important to financial institutions which provide loans to businesses and individuals. Loans carry the risk of being defaulted. To understand the risk levels of credit users (corporations and individuals), credit providers (bankers) normally collect vast amounts of information on borrowers. Statistical predictive analytic techniques can be used to analyse or to determine the risk levels involved in loans. This paper aims to address the question of default prediction of short-term loans for a Tunisian commercial bank.
Design/methodology/approach
The authors have used a database of 924 files of credits granted to industrial Tunisian companies by a commercial bank in the years 2003, 2004, 2005 and 2006. The naive Bayesian classifier algorithm was used, and the results show that the good classification rate is of the order of 63.85 per cent. The default probability is explained by the variables measuring working capital, leverage, solvency, profitability and cash flow indicators.
Findings
The results of the validation test show that the good classification rate is of the order of 58.66 per cent; nevertheless, the error types I and II remain relatively high at 42.42 and 40.47 per cent, respectively. A receiver operating characteristic curve is plotted to evaluate the performance of the model. The result shows that the area under the curve criterion is of the order of 69 per cent.
Originality/value
The paper highlights the fact that the Tunisian central bank obliged all commercial banks to conduct a survey study to collect qualitative data for better credit notation of the borrowers.
Propósito
El riesgo de incumplimiento de préstamos o la evaluación del riesgo de crédito es importante para las instituciones financieras que otorgan préstamos a empresas e individuos. Existe el riesgo de que el pago de préstamos no se cumpla. Para entender los niveles de riesgo de los usuarios de crédito (corporaciones e individuos), los proveedores de crédito (banqueros) normalmente recogen gran cantidad de información sobre los prestatarios. Las técnicas analíticas predictivas estadísticas pueden utilizarse para analizar o determinar los niveles de riesgo involucrados en los préstamos. En este artículo abordamos la cuestión de la predicción por defecto de los préstamos a corto plazo para un banco comercial tunecino.
Diseño/metodología/enfoque
Utilizamos una base de datos de 924 archivos de créditos concedidos a empresas industriales tunecinas por un banco comercial en 2003, 2004, 2005 y 2006. El algoritmo bayesiano de clasificadores se llevó a cabo y los resultados muestran que la tasa de clasificación buena es del orden del 63.85%. La probabilidad de incumplimiento se explica por las variables que miden el capital de trabajo, el apalancamiento, la solvencia, la rentabilidad y los indicadores de flujo de efectivo.
Hallazgos
Los resultados de la prueba de validación muestran que la buena tasa de clasificación es del orden de 58.66% ; sin embargo, los errores tipo I y II permanecen relativamente altos, siendo de 42.42% y 40.47%, respectivamente. Se traza una curva ROC para evaluar el rendimiento del modelo. El resultado muestra que el criterio de área bajo curva (AUC, por sus siglas en inglés) es del orden del 69%.
Originalidad/valor
El documento destaca el hecho de que el Banco Central tunecino obligó a todas las entidades del sector llevar a cabo un estudio de encuesta para recopilar datos cualitativos para un mejor registro de crédito de los prestatarios.
Palabras clave
Curva ROC, Evaluación de riesgos, Riesgo de incumplimiento, Sector bancario, Algoritmo clasificador bayesiano.
Tipo de artículo
Artículo de investigación
Details
Keywords
Grace W.Y. Wang, Zhisen Yang, Di Zhang, Anqiang Huang and Zaili Yang
This study aims to develop an assessment methodology using a Bayesian network (BN) to predict the failure probability of oil tanker shipping firms.
Abstract
Purpose
This study aims to develop an assessment methodology using a Bayesian network (BN) to predict the failure probability of oil tanker shipping firms.
Design/methodology/approach
This paper proposes a bankruptcy prediction model by applying the hybrid of logistic regression and Bayesian probabilistic networks.
Findings
The proposed model shows its potential of contributing to a powerful tool to predict financial bankruptcy of shipping operators, and provides important insights to the maritime community as to what performance measures should be taken to ensure the shipping companies’ financial soundness under dynamic environments.
Research limitations/implications
The model and its associated variables can be expanded to include more factors for an in-depth analysis in future when the detailed information at firm level becomes available.
Practical implications
The results of this study can be implemented to oil tanker shipping firms as a prediction tool for bankruptcy rate.
Originality/value
Incorporating quantitative statistical measurement, the application of BN in financial risk management provides advantages to develop a powerful early warning system in shipping, which has unique characteristics such as capital intensive and mobile assets, possibly leading to catastrophic consequences.
Details
Keywords
Kumash Kapadia, Hussein Abdel-Jaber, Fadi Thabtah and Wael Hadi
Indian Premier League (IPL) is one of the more popular cricket world tournaments, and its financial is increasing each season, its viewership has increased markedly and the…
Abstract
Indian Premier League (IPL) is one of the more popular cricket world tournaments, and its financial is increasing each season, its viewership has increased markedly and the betting market for IPL is growing significantly every year. With cricket being a very dynamic game, bettors and bookies are incentivised to bet on the match results because it is a game that changes ball-by-ball. This paper investigates machine learning technology to deal with the problem of predicting cricket match results based on historical match data of the IPL. Influential features of the dataset have been identified using filter-based methods including Correlation-based Feature Selection, Information Gain (IG), ReliefF and Wrapper. More importantly, machine learning techniques including Naïve Bayes, Random Forest, K-Nearest Neighbour (KNN) and Model Trees (classification via regression) have been adopted to generate predictive models from distinctive feature sets derived by the filter-based methods. Two featured subsets were formulated, one based on home team advantage and other based on Toss decision. Selected machine learning techniques were applied on both feature sets to determine a predictive model. Experimental tests show that tree-based models particularly Random Forest performed better in terms of accuracy, precision and recall metrics when compared to probabilistic and statistical models. However, on the Toss featured subset, none of the considered machine learning algorithms performed well in producing accurate predictive models.
Details
Keywords
Khalid Iqbal and Muhammad Shehrayar Khan
In this digital era, email is the most pervasive form of communication between people. Many users become a victim of spam emails and their data have been exposed.
Abstract
Purpose
In this digital era, email is the most pervasive form of communication between people. Many users become a victim of spam emails and their data have been exposed.
Design/methodology/approach
Researchers contribute to solving this problem by a focus on advanced machine learning algorithms and improved models for detecting spam emails but there is still a gap in features. To achieve good results, features also play an important role. To evaluate the performance of applied classifiers, 10-fold cross-validation is used.
Findings
The results approve that the spam emails are correctly classified with the accuracy of 98.00% for the Support Vector Machine and 98.06% for the Artificial Neural Network as compared to other applied machine learning classifiers.
Originality/value
In this paper, Point-Biserial correlation is applied to each feature concerning the class label of the University of California Irvine (UCI) spambase email dataset to select the best features. Extensive experiments are conducted on selected features by training the different classifiers.
Details
Keywords
The purpose of this paper is to compare nine different models to evaluate consumer credit risk, which are the following: Logistic Regression (LR), Naive Bayes (NB), Linear…
Abstract
Purpose
The purpose of this paper is to compare nine different models to evaluate consumer credit risk, which are the following: Logistic Regression (LR), Naive Bayes (NB), Linear Discriminant Analysis (LDA), k-Nearest Neighbor (k-NN), Support Vector Machine (SVM), Classification and Regression Tree (CART), Artificial Neural Network (ANN), Random Forest (RF) and Gradient Boosting Decision Tree (GBDT) in Peer-to-Peer (P2P) Lending.
Design/methodology/approach
The author uses data from P2P Lending Club (LC) to assess the efficiency of a variety of classification models across different economic scenarios and to compare the ranking results of credit risk models in P2P lending through three families of evaluation metrics.
Findings
The results from this research indicate that the risk classification models in the 2013–2019 economic period show greater measurement efficiency than for the difficult 2007–2012 period. Besides, the results of ranking models for predicting default risk show that GBDT is the best model for most of the metrics or metric families included in the study. The findings of this study also support the results of Tsai et al. (2014) and Teplý and Polena (2019) that LR, ANN and LDA models classify loan applications quite stably and accurately, while CART, k-NN and NB show the worst performance when predicting borrower default risk on P2P loan data.
Originality/value
The main contributions of the research to the empirical literature review include: comparing nine prediction models of consumer loan application risk through statistical and machine learning algorithms evaluated by the performance measures according to three separate families of metrics (threshold, ranking and probabilistic metrics) that are consistent with the existing data characteristics of the LC lending platform through two periods of reviewing the current economic situation and platform development.
Details
Keywords
Ahmad Younso, Ziad Kanaya and Nour Azhari
We consider the kernel-based classifier proposed by Younso (2017). This nonparametric classifier allows for the classification of missing spatially dependent data. The weak…
Abstract
We consider the kernel-based classifier proposed by Younso (2017). This nonparametric classifier allows for the classification of missing spatially dependent data. The weak consistency of the classifier has been studied by Younso (2017). The purpose of this paper is to establish strong consistency of this classifier under mild conditions. The classifier is discussed in a multi-class case. The results are illustrated with simulation studies and real applications.