Search results
1 – 10 of 10Zhenyuan Wang, Chih-Fong Tsai and Wei-Chao Lin
Class imbalance learning, which exists in many domain problem datasets, is an important research topic in data mining and machine learning. One-class classification techniques…
Abstract
Purpose
Class imbalance learning, which exists in many domain problem datasets, is an important research topic in data mining and machine learning. One-class classification techniques, which aim to identify anomalies as the minority class from the normal data as the majority class, are one representative solution for class imbalanced datasets. Since one-class classifiers are trained using only normal data to create a decision boundary for later anomaly detection, the quality of the training set, i.e. the majority class, is one key factor that affects the performance of one-class classifiers.
Design/methodology/approach
In this paper, we focus on two data cleaning or preprocessing methods to address class imbalanced datasets. The first method examines whether performing instance selection to remove some noisy data from the majority class can improve the performance of one-class classifiers. The second method combines instance selection and missing value imputation, where the latter is used to handle incomplete datasets that contain missing values.
Findings
The experimental results are based on 44 class imbalanced datasets; three instance selection algorithms, including IB3, DROP3 and the GA, the CART decision tree for missing value imputation, and three one-class classifiers, which include OCSVM, IFOREST and LOF, show that if the instance selection algorithm is carefully chosen, performing this step could improve the quality of the training data, which makes one-class classifiers outperform the baselines without instance selection. Moreover, when class imbalanced datasets contain some missing values, combining missing value imputation and instance selection, regardless of which step is first performed, can maintain similar data quality as datasets without missing values.
Originality/value
The novelty of this paper is to investigate the effect of performing instance selection on the performance of one-class classifiers, which has never been done before. Moreover, this study is the first attempt to consider the scenario of missing values that exist in the training set for training one-class classifiers. In this case, performing missing value imputation and instance selection with different orders are compared.
Details
Keywords
Wei-Chao Lin, Shih-Wen Ke and Chih-Fong Tsai
Data mining is widely considered necessary in many business applications for effective decision-making. The importance of business data mining is reflected by the existence of…
Abstract
Purpose
Data mining is widely considered necessary in many business applications for effective decision-making. The importance of business data mining is reflected by the existence of numerous surveys in the literature focusing on the investigation of related works using data mining techniques for solving specific business problems. The purpose of this paper is to answer the following question: What are the widely used data mining techniques in business applications?
Design/methodology/approach
The aim of this paper is to examine related surveys in the literature and thus to identify the frequently applied data mining techniques. To ensure the recent relevance and quality of the conclusions, the criterion for selecting related studies are that the works be published in reputed journals within the past 10 years.
Findings
There are 33 different data mining techniques employed in eight different application areas. Most of them are supervised learning techniques and the application area where such techniques are most often seen is bankruptcy prediction, followed by the areas of customer relationship management, fraud detection, intrusion detection and recommender systems. Furthermore, the widely used ten data mining techniques for business applications are the decision tree (including C4.5 decision tree and classification and regression tree), genetic algorithm, k-nearest neighbor, multilayer perceptron neural network, naïve Bayes and support vector machine as the supervised learning techniques and association rule, expectation maximization and k-means as the unsupervised learning techniques.
Originality/value
The originality of this paper is to survey the recent 10 years of related survey and review articles about data mining in business applications to identify the most popular techniques.
Details
Keywords
Wei-Chao Lin, Shih-Wen Ke and Chih-Fong Tsai
This paper aims to introduce a prototype system called SAFQuery (Simple And Flexible Query interface). In many existing Web search interfaces, simple and advanced query processes…
Abstract
Purpose
This paper aims to introduce a prototype system called SAFQuery (Simple And Flexible Query interface). In many existing Web search interfaces, simple and advanced query processes are treated separately that cannot be issued interchangeably. In addition, after several rounds of queries for specific information need(s), it is possible that users might wish to re-examine the retrieval results corresponding to some previous queries or to slightly modify some of the specific queries issued before. However, it is often hard to remember what queries have been issued. These factors make the current Web search process not very simple or flexible.
Design/methodology/approach
In SAFQuery, the simple and advanced query strategies are integrated into a single interface, which can easily formulate query specifications when needed in the same interface. Moreover, query history information is provided that displays the past query specifications, which can help with the memory load.
Findings
The authors' experiments by user evaluation show that most users had a positive experience when using SAFQuery. Specifically, it is easy to use and can simplify the Web search task.
Originality/value
The proposed prototype system provides simple and flexible Web search strategies. Particularly, it allows users to easily issue simple and advanced queries based on one single query interface, interchangeably. In addition, users can easily input previously issued queries without spending time to recall what the queries are and/or to re-type previous queries.
Details
Keywords
Chih‐Fong Tsai and Wei‐Chao Lin
Content‐based image retrieval suffers from the semantic gap problem: that images are represented by low‐level visual features, which are difficult to directly match to high‐level…
Abstract
Purpose
Content‐based image retrieval suffers from the semantic gap problem: that images are represented by low‐level visual features, which are difficult to directly match to high‐level concepts in the user's mind during retrieval. To date, visual feature representation is still limited in its ability to represent semantic image content accurately. This paper seeks to address these issues.
Design/methodology/approach
In this paper the authors propose a novel meta‐feature feature representation method for scenery image retrieval. In particular some class‐specific distances (namely meta‐features) between low‐level image features are measured. For example the distance between an image and its class centre, and the distances between the image and its nearest and farthest images in the same class, etc.
Findings
Three experiments based on 190 concrete, 130 abstract, and 610 categories in the Corel dataset show that the meta‐features extracted from both global and local visual features significantly outperform the original visual features in terms of mean average precision.
Originality/value
Compared with traditional local and global low‐level features, the proposed meta‐features have higher discriminative power for distinguishing a large number of conceptual categories for scenery image retrieval. In addition the meta‐features can be directly applied to other image descriptors, such as bag‐of‐words and contextual features.
Details
Keywords
Cheng-Che Shen, Ya-Han Hu, Wei-Chao Lin, Chih-Fong Tsai and Shih-Wen Ke
The purpose of this paper is to focus on examining the research impact of papers written with and without funding. Specifically, the citation analysis method is used to compare…
Abstract
Purpose
The purpose of this paper is to focus on examining the research impact of papers written with and without funding. Specifically, the citation analysis method is used to compare the general and funded papers published in two leading international conferences, which are ACM SIGIR and ACM SIGKDD.
Design/methodology/approach
The authors investigate the number of general and funded papers to see whether the number of funded papers is larger than the number of general papers. In addition, the total citations and the number of highly cited papers with and without funding are also compared.
Findings
The analysis results of ACM SIGIR papers show that in most cases the number of funded papers is larger than the number of general papers. Moreover, the total captions, the average number of citations per paper, and the number of highly cited papers all reveal the superiority of funded papers over general papers. However, the findings are somewhat different for the ACM SIGKDD papers. This may be because ACM SIGIR began much earlier than ACM SIGKDD, which relates to the maturity of the research problems addressed in these two conferences.
Originality/value
The value of this paper is the first attempt at examining the research impact of general and funded research papers by the citation analysis method. The research impact of other research areas can be further investigated by other analysis methods.
Details
Keywords
Shih-Wen Ke, Wei-Chao Lin, Chih-Fong Tsai and Ya-Han Hu
Conference publications are an important aspect of research activities. There are generally both oral presentations and poster sessions at large international conferences. One can…
Abstract
Purpose
Conference publications are an important aspect of research activities. There are generally both oral presentations and poster sessions at large international conferences. One can hypothesise that, for the same conferences, the papers presented in oral sessions should have a higher research impact than the papers presented in poster sessions. However, there has been no related study examining the validity of this hypothesis. In other words, the difference of research impact between papers presented orally or during poster sessions has not been discussed in literature. Therefore, the purpose of this paper is to conduct a citation analysis to compare the research impact of papers presented in oral and poster sessions.
Design/methodology/approach
In this paper, data from three leading conferences in the field of computer vision are examined, namely CVPR (2011 and 2012), ICCV (2011) and ECCV (2012). Several types of citation-related statistics are collected, including the number of highly cited papers (i.e. high number of citations) presented in oral and poster sessions, the total citations of both types of papers, the average citations of oral and poster papers, and the average citations of each frequently cited paper of both types.
Findings
There are three main findings. First, a larger proportion of highly cited papers are from oral sessions than poster sessions. Second, the average number of citations per paper is larger for those presented in oral sessions than poster sessions. Third, the average number of citations for highly cited papers presented in oral sessions is not necessarily greater than for the ones presented in poster sessions.
Originality/value
The originality of this paper is that it is the first attempt to examine the differences of citation impacts of conference papers presented in oral and poster sessions. The findings of this study will allow future bibliometrics research to further explore this related issue for longer periods and different fields.
Details
Keywords
Wei-Chao Lin, Chih-Fong Tsai and Shih-Wen Ke
In many research areas, there are a variety of different types of academic publications, including journals, magazines and conferences, which provide outlets for researchers to…
Abstract
Purpose
In many research areas, there are a variety of different types of academic publications, including journals, magazines and conferences, which provide outlets for researchers to present their findings. Generally speaking, although there are differences in the reviewing criteria and publication processes of different publication types, in the same research area, there is certainly overlap in terms of the problems addressed and the audience for different publication types. Therefore, the research impacts of different publication types in the same research area should be moderately or highly correlated. The paper aims to discuss these issues.
Design/methodology/approach
To prove this hypothesis, the authors examine the correlation coefficient of citation impacts for different types of publications, in seven research areas of computer science, from 2000 to 2013. In particular, four related citation statistics are examined for each publication type, which are average citations per paper, average citations per year, average annual increase in individual h-index, and h-index.
Findings
The analysis results show only a partial correlation in terms of several specific citation measures for different publication types in the same research area. Moreover, the level of correlation of the citation impact between different publication types is different, depending on the research area.
Originality/value
The contribution of this paper is to investigate whether the research impact of different types of publications in the same area is correlated. The findings can help guide researchers and academics choose the most appropriate publication outlets.
Details
Keywords
Wei-Chao Lin, Chih-Fong Tsai and Shih-Wen Ke
Churn prediction is a very important task for successful customer relationship management. In general, churn prediction can be achieved by many data mining techniques. However…
Abstract
Purpose
Churn prediction is a very important task for successful customer relationship management. In general, churn prediction can be achieved by many data mining techniques. However, during data mining, dimensionality reduction (or feature selection) and data reduction are the two important data preprocessing steps. In particular, the aims of feature selection and data reduction are to filter out irrelevant features and noisy data samples, respectively. The purpose of this paper, performing these data preprocessing tasks, is to make the mining algorithm produce good quality mining results.
Design/methodology/approach
Based on a real telecom customer churn data set, seven different preprocessed data sets based on performing feature selection and data reduction by different priorities are used to train the artificial neural network as the churn prediction model.
Findings
The results show that performing data reduction first by self-organizing maps and feature selection second by principal component analysis can allow the prediction model to provide the highest prediction accuracy. In addition, this priority allows the prediction model for more efficient learning since 66 and 62 percent of the original features and data samples are reduced, respectively.
Originality/value
The contribution of this paper is to understand the better procedure of performing the two important data preprocessing steps for telecom churn prediction.
Details
Keywords
Wei-Chao Yang, Guo-Zhi Li, E Deng, De-Hui Ouyang and Zhi-Peng Lu
Sustainable urban rail transit requires noise barriers. However, these barriers’ durability varies due to the differing aerodynamic impacts they experience. The purpose of this…
Abstract
Purpose
Sustainable urban rail transit requires noise barriers. However, these barriers’ durability varies due to the differing aerodynamic impacts they experience. The purpose of this paper is to investigate the aerodynamic discrepancies of trains when they meet within two types of rectangular noise barriers: fully enclosed (FERNB) and semi-enclosed with vertical plates (SERNBVB). The research also considers the sensitivity of the scale ratio in these scenarios.
Design/methodology/approach
A 1:16 scaled moving model test analyzed spatiotemporal patterns and discrepancies in aerodynamic pressures during train meetings. Three-dimensional computational fluid dynamics models, with scale ratios of 1:1, 1:8 and 1:16, used the improved delayed detached eddy simulation turbulence model and slip grid technique. Comparing scale ratios on aerodynamic pressure discrepancies between the two types of noise barriers and revealing the flow field mechanism were done. The goal is to establish the relationship between aerodynamic pressure at scale and in full scale.
Findings
The aerodynamic pressure on SERNBVB is influenced by the train’s head and tail waves, whereas for FERNB, it is affected by pressure wave and head-tail waves. Notably, SERNBVB's aerodynamic pressure is more sensitive to changes in scale ratio. As the scale ratio decreases, the aerodynamic pressure on the noise barrier gradually increases.
Originality/value
A train-meeting moving model test is conducted within the noise barrier. Comparison of aerodynamic discrepancies during train meets between two types of rectangular noise barriers and the relationship between the scale and the full scale are established considering the modeling scale ratio.
Details
Keywords
DULCY M. ABRAHAM and M.H. JOANNE YEH
The Environmental Protection Bureau of Taiwan established the South Star Project in Kaohsiung, Taiwan, as a solution to two problems facing the city—the urgent need to dispose of…
Abstract
The Environmental Protection Bureau of Taiwan established the South Star Project in Kaohsiung, Taiwan, as a solution to two problems facing the city—the urgent need to dispose of industrial wastes and the need to increase land for the city. To embank land from the sea, breakwaters were constructed. The material used to construct breakwaters was a mixture of furnace slag (waste from the steel industry) and fly ash (waste from power plants). After constructing the breakwaters, the ‘reclaimed land’ was used as a landfill for construction and public waste. In the future, these reclaimed lands will be used for the development of a deepwater port or sea airport. Construction of breakwaters is a very repetitive process, and any improvements made would help contractors reduce the duration of the operation, improve efficiency in the process and thereby reduce costs. This paper discusses the process of breakwater construction and the utilization of industrial wastes for the concrete work on the project. Data collected from the first stage of the South Star Project is used in the modelling, simulation and analysis of the process, in order to examine the interaction between different resources.
Details