Search results | Emerald Insight

Open Access

Article

Publication date: 2 April 2024

Automated Dewey Decimal Classification of Swedish library metadata using Annif software

Koraljka Golub, Osma Suominen, Ahmed Taiye Mohammed, Harriet Aagaard and Olof Osterman

In order to estimate the value of semi-automated subject indexing in operative library catalogues, the study aimed to investigate five different automated implementations of an…

HTML

PDF (187 KB)

Downloads

1618

Abstract

Purpose

In order to estimate the value of semi-automated subject indexing in operative library catalogues, the study aimed to investigate five different automated implementations of an open source software package on a large set of Swedish union catalogue metadata records, with Dewey Decimal Classification (DDC) as the target classification system. It also aimed to contribute to the body of research on aboutness and related challenges in automated subject indexing and evaluation.

Design/methodology/approach

On a sample of over 230,000 records with close to 12,000 distinct DDC classes, an open source tool Annif, developed by the National Library of Finland, was applied in the following implementations: lexical algorithm, support vector classifier, fastText, Omikuji Bonsai and an ensemble approach combing the former four. A qualitative study involving two senior catalogue librarians and three students of library and information studies was also conducted to investigate the value and inter-rater agreement of automatically assigned classes, on a sample of 60 records.

Findings

The best results were achieved using the ensemble approach that achieved 66.82% accuracy on the three-digit DDC classification task. The qualitative study confirmed earlier studies reporting low inter-rater agreement but also pointed to the potential value of automatically assigned classes as additional access points in information retrieval.

Originality/value

The paper presents an extensive study of automated classification in an operative library catalogue, accompanied by a qualitative study of automated classes. It demonstrates the value of applying semi-automated indexing in operative information retrieval systems.

Details

Journal of Documentation, vol. 80 no. 5

Type: Research Article

DOI:

ISSN: 0022-0418

Keywords

View access options

Article

Publication date: 23 August 2019

Artificial bee colony algorithm for feature selection and improved support vector machine for text classification

Janani Balakumar and S. Vijayarani Mohan

Owing to the huge volume of documents available on the internet, text classification becomes a necessary task to handle these documents. To achieve optimal text classification…

HTML

PDF (1.9 MB)

Downloads

229

Abstract

Purpose

Owing to the huge volume of documents available on the internet, text classification becomes a necessary task to handle these documents. To achieve optimal text classification results, feature selection, an important stage, is used to curtail the dimensionality of text documents by choosing suitable features. The main purpose of this research work is to classify the personal computer documents based on their content.

Design/methodology/approach

This paper proposes a new algorithm for feature selection based on artificial bee colony (ABCFS) to enhance the text classification accuracy. The proposed algorithm (ABCFS) is scrutinized with the real and benchmark data sets, which is contrary to the other existing feature selection approaches such as information gain and χ² statistic. To justify the efficiency of the proposed algorithm, the support vector machine (SVM) and improved SVM classifier are used in this paper.

Findings

The experiment was conducted on real and benchmark data sets. The real data set was collected in the form of documents that were stored in the personal computer, and the benchmark data set was collected from Reuters and 20 Newsgroups corpus. The results prove the performance of the proposed feature selection algorithm by enhancing the text document classification accuracy.

Originality/value

This paper proposes a new ABCFS algorithm for feature selection, evaluates the efficiency of the ABCFS algorithm and improves the support vector machine. In this paper, the ABCFS algorithm is used to select the features from text (unstructured) documents. Although, there is no text feature selection algorithm in the existing work, the ABCFS algorithm is used to select the data (structured) features. The proposed algorithm will classify the documents automatically based on their content.

Details

Information Discovery and Delivery, vol. 47 no. 3

Type: Research Article

DOI:

ISSN: 2398-6247

Keywords

View access options

Article

Publication date: 6 February 2017

Hybrid supervised clustering based ensemble scheme for text classification

Aytug Onan

The immense quantity of available unstructured text documents serve as one of the largest source of information. Text classification can be an essential task for many purposes in…

HTML

PDF (234 KB)

Downloads

551

Abstract

Purpose

The immense quantity of available unstructured text documents serve as one of the largest source of information. Text classification can be an essential task for many purposes in information retrieval, such as document organization, text filtering and sentiment analysis. Ensemble learning has been extensively studied to construct efficient text classification schemes with higher predictive performance and generalization ability. The purpose of this paper is to provide diversity among the classification algorithms of ensemble, which is a key issue in the ensemble design.

Design/methodology/approach

An ensemble scheme based on hybrid supervised clustering is presented for text classification. In the presented scheme, supervised hybrid clustering, which is based on cuckoo search algorithm and k-means, is introduced to partition the data samples of each class into clusters so that training subsets with higher diversities can be provided. Each classifier is trained on the diversified training subsets and the predictions of individual classifiers are combined by the majority voting rule. The predictive performance of the proposed classifier ensemble is compared to conventional classification algorithms (such as Naïve Bayes, logistic regression, support vector machines and C4.5 algorithm) and ensemble learning methods (such as AdaBoost, bagging and random subspace) using 11 text benchmarks.

Findings

The experimental results indicate that the presented classifier ensemble outperforms the conventional classification algorithms and ensemble learning methods for text classification.

Originality/value

The presented ensemble scheme is the first to use supervised clustering to obtain diverse ensemble for text classification

Details

Kybernetes, vol. 46 no. 2

Type: Research Article

DOI:

ISSN: 0368-492X

Keywords

View access options

Article

Publication date: 28 February 2023

A comparative analysis of text representation, classification and clustering methods over real project proposals

Meltem Aksoy, Seda Yanık and Mehmet Fatih Amasyali

When a large number of project proposals are evaluated to allocate available funds, grouping them based on their similarities is beneficial. Current approaches to group proposals…

HTML

PDF (5.3 MB)

Downloads

392

Abstract

Purpose

When a large number of project proposals are evaluated to allocate available funds, grouping them based on their similarities is beneficial. Current approaches to group proposals are primarily based on manual matching of similar topics, discipline areas and keywords declared by project applicants. When the number of proposals increases, this task becomes complex and requires excessive time. This paper aims to demonstrate how to effectively use the rich information in the titles and abstracts of Turkish project proposals to group them automatically.

Design/methodology/approach

This study proposes a model that effectively groups Turkish project proposals by combining word embedding, clustering and classification techniques. The proposed model uses FastText, BERT and term frequency/inverse document frequency (TF/IDF) word-embedding techniques to extract terms from the titles and abstracts of project proposals in Turkish. The extracted terms were grouped using both the clustering and classification techniques. Natural groups contained within the corpus were discovered using k-means, k-means++, k-medoids and agglomerative clustering algorithms. Additionally, this study employs classification approaches to predict the target class for each document in the corpus. To classify project proposals, various classifiers, including k-nearest neighbors (KNN), support vector machines (SVM), artificial neural networks (ANN), classification and regression trees (CART) and random forest (RF), are used. Empirical experiments were conducted to validate the effectiveness of the proposed method by using real data from the Istanbul Development Agency.

Findings

The results show that the generated word embeddings can effectively represent proposal texts as vectors, and can be used as inputs for clustering or classification algorithms. Using clustering algorithms, the document corpus is divided into five groups. In addition, the results demonstrate that the proposals can easily be categorized into predefined categories using classification algorithms. SVM-Linear achieved the highest prediction accuracy (89.2%) with the FastText word embedding method. A comparison of manual grouping with automatic classification and clustering results revealed that both classification and clustering techniques have a high success rate.

Research limitations/implications

The proposed model automatically benefits from the rich information in project proposals and significantly reduces numerous time-consuming tasks that managers must perform manually. Thus, it eliminates the drawbacks of the current manual methods and yields significantly more accurate results. In the future, additional experiments should be conducted to validate the proposed method using data from other funding organizations.

Originality/value

This study presents the application of word embedding methods to effectively use the rich information in the titles and abstracts of Turkish project proposals. Existing research studies focus on the automatic grouping of proposals; traditional frequency-based word embedding methods are used for feature extraction methods to represent project proposals. Unlike previous research, this study employs two outperforming neural network-based textual feature extraction techniques to obtain terms representing the proposals: BERT as a contextual word embedding method and FastText as a static word embedding method. Moreover, to the best of our knowledge, there has been no research conducted on the grouping of project proposals in Turkish.

Details

International Journal of Intelligent Computing and Cybernetics, vol. 16 no. 3

Type: Research Article

DOI:

ISSN: 1756-378X

Keywords

View access options

Article

Publication date: 27 April 2020

The classification of the image segmentation algorithms

Igor Georgievich Khanykov, Ivan Mikhajlovich Tolstoj and Dmitriy Konstantinovich Levonevskiy

The purpose of this paper is the image segmentation algorithms (ISA) classification analysis, providing for advanced research and design of new computer vision algorithms.

HTML

PDF (1.6 MB)

Downloads

147

Abstract

Purpose

The purpose of this paper is the image segmentation algorithms (ISA) classification analysis, providing for advanced research and design of new computer vision algorithms.

Design/methodology/approach

For the development of the required algorithms a three-stage flowchart is suggested. An algorithm of quasi-optimal segmentation is discussed as a possible implementation of the suggested flowchart. A new attribute is introduced reflecting the specific hierarchical algorithm group, which the proposed algorithm belongs to. The introduced attribute refines the overall classification scheme and the requirements for the algorithms under development.

Findings

Optimal approximation generation is a computationally intensive task. The computational complexity can be reduced using a hierarchical data framework and a set of auxiliary algorithms, contributing to overall quality improvement. Because hierarchical solutions usually are distinctively suboptimal, further optimization to them was applied. A new classification attribute, proposed in this paper allows to discover previously hidden «blank spots», having decomposed the two-tier ISA classification scheme. The new classification attribute allows to aggregate algorithms, yielding multiple partitions at output and assign them to a dedicated group.

Originality/value

The originality of the paper consists in development of a high-level ISA classification, as well in introduction of a new classification attribute, pertinent to iterative algorithm groups and to hierarchically structured data presentation algorithms.

Details

International Journal of Intelligent Unmanned Systems, vol. 8 no. 2

Type: Research Article

DOI:

ISSN: 2049-6427

Keywords

View access options

Book part

Publication date: 30 September 2020

Use of Classification Algorithms in Health Care

Hera Khan, Ayush Srivastav and Amit Kumar Mishra

A detailed description will be provided of all the classification algorithms that have been widely used in the domain of medical science. The foundation will be laid by giving a…

HTML

PDF (852 KB)

EPUB (1 MB)

Abstract

A detailed description will be provided of all the classification algorithms that have been widely used in the domain of medical science. The foundation will be laid by giving a comprehensive overview pertaining to the background and history of the classification algorithms. This will be followed by an extensive discussion regarding various techniques of classification algorithm in machine learning (ML) hence concluding with their relevant applications in data analysis in medical science and health care. To begin with, the initials of this chapter will deal with the basic fundamentals required for a profound understanding of the classification techniques in ML which will comprise of the underlying differences between Unsupervised and Supervised Learning followed by the basic terminologies of classification and its history. Further, it will include the types of classification algorithms ranging from linear classifiers like Logistic Regression, Naïve Bayes to Nearest Neighbour, Support Vector Machine, Tree-based Classifiers, and Neural Networks, and their respective mathematics. Ensemble algorithms such as Majority Voting, Boosting, Bagging, Stacking will also be discussed at great length along with their relevant applications. Furthermore, this chapter will also incorporate comprehensive elucidation regarding the areas of application of such classification algorithms in the field of biomedicine and health care and their contribution to decision-making systems and predictive analysis. To conclude, this chapter will devote highly in the field of research and development as it will provide a thorough insight to the classification algorithms and their relevant applications used in the cases of the healthcare development sector.

Details

Big Data Analytics and Intelligence: A Perspective for Health Care

Type: Book

DOI:

ISBN: 978-1-83909-099-8

Keywords

View access options

Article

Publication date: 12 June 2017

Empirical study on the effect of using synthetic attributes on classification algorithms

Ali Hasan Alsaffar

The purpose of this paper is to present an empirical study on the effect of two synthetic attributes to popular classification algorithms on data originating from student…

HTML

PDF (523 KB)

Downloads

267

Abstract

Purpose

The purpose of this paper is to present an empirical study on the effect of two synthetic attributes to popular classification algorithms on data originating from student transcripts. The attributes represent past performance achievements in a course, which are defined as global performance (GP) and local performance (LP). GP of a course is an aggregated performance achieved by all students who have taken this course, and LP of a course is an aggregated performance achieved in the prerequisite courses by the student taking the course.

Design/methodology/approach

The paper uses Educational Data Mining techniques to predict student performance in courses, where it identifies the relevant attributes that are the most key influencers for predicting the final grade (performance) and reports the effect of the two suggested attributes on the classification algorithms. As a research paradigm, the paper follows Cross-Industry Standard Process for Data Mining using RapidMiner Studio software tool. Six classification algorithms are experimented: C4.5 and CART Decision Trees, Naive Bayes, k-neighboring, rule-based induction and support vector machines.

Findings

The outcomes of the paper show that the synthetic attributes have positively improved the performance of the classification algorithms, and also they have been highly ranked according to their influence to the target variable.

Originality/value

This paper proposes two synthetic attributes that are integrated into real data set. The key motivation is to improve the quality of the data and make classification algorithms perform better. The paper also presents empirical results showing the effect of these attributes on selected classification algorithms.

Details

International Journal of Intelligent Computing and Cybernetics, vol. 10 no. 2

Type: Research Article

DOI:

ISSN: 1756-378X

Keywords

View access options

Article

Publication date: 3 May 2019

Computer vision for automatic detection and classification of fabric defect employing deep learning algorithm

Pandia Rajan Jeyaraj and Edward Rajan Samuel Nadar

The purpose of this paper is to focus on the design and development of computer-aided fabric defect detection and classification employing advanced learning algorithm.

HTML

PDF (656 KB)

Downloads

1330

Abstract

Purpose

The purpose of this paper is to focus on the design and development of computer-aided fabric defect detection and classification employing advanced learning algorithm.

Design/methodology/approach

To make a fast and effective classification of fabric defect, the authors have considered a characteristic of texture, namely its colour. A deep convolutional neural network is formed to learn from the training phase of various defect data sets. In the testing phase, the authors have utilised a learning feature for defect classification.

Findings

The improvement in the defect classification accuracy has been achieved by employing deep learning algorithm. The authors have tested the defect classification accuracy on six different fabric materials and have obtained an average accuracy of 96.55 per cent with 96.4 per cent sensitivity and 0.94 success rate.

Practical implications

The authors had evaluated the method by using 20 different data sets collected from different raw fabrics. Also, the authors have tested the algorithm in standard data set provided by Ministry of Textile. In the testing task, the authors have obtained an average accuracy of 94.85 per cent, with six defects being successfully recognised by the proposed algorithm.

Originality/value

The quantitative value of performance index shows the effectiveness of developed classification algorithm. Moreover, the computational time for different fabric processing was presented to verify the computational range of proposed algorithm with the conventional fabric processing techniques. Hence, this proposed computer vision-based fabric defects detection system is used for an accurate defect detection and computer-aided analysis system.

Details

International Journal of Clothing Science and Technology, vol. 31 no. 4

Type: Research Article

DOI:

ISSN: 0955-6222

Keywords

View access options

Article

Publication date: 31 July 2019

Combination classification method for customer relationship management

Zhe Zhang and Yue Dai

For classification problems of customer relationship management (CRM), the purpose of this paper is to propose a method with interpretability of the classification results that…

HTML

PDF (1.4 MB)

Downloads

594

Abstract

Purpose

For classification problems of customer relationship management (CRM), the purpose of this paper is to propose a method with interpretability of the classification results that combines multiple decision trees based on a genetic algorithm.

Design/methodology/approach

In the proposed method, multiple decision trees are combined in parallel. Subsequently, a genetic algorithm is used to optimize the weight matrix in the combination algorithm.

Findings

The method is applied to customer credit rating assessment and customer response behavior pattern recognition. The results demonstrate that compared to a single decision tree, the proposed combination method improves the predictive accuracy and optimizes the classification rules, while maintaining interpretability of the classification results.

Originality/value

The findings of this study contribute to research methodologies in CRM. It specifically focuses on a new method with interpretability by combining multiple decision trees based on genetic algorithms for customer classification.

Details

Asia Pacific Journal of Marketing and Logistics, vol. 32 no. 5

Type: Research Article

DOI:

ISSN: 1355-5855

Keywords

View access options

Article

Publication date: 29 December 2023

ImageNet classification with Raspberry Pis: federated learning algorithms of local classifiers

Thanh-Nghi Do and Minh-Thu Tran-Nguyen

This study aims to propose novel edge device-tailored federated learning algorithms of local classifiers (stochastic gradient descent, support vector machines), namely, FL-lSGD…

HTML

PDF (1001 KB)

Downloads

77

Abstract

Purpose

This study aims to propose novel edge device-tailored federated learning algorithms of local classifiers (stochastic gradient descent, support vector machines), namely, FL-lSGD and FL-lSVM. These algorithms are designed to address the challenge of large-scale ImageNet classification.

Design/methodology/approach

The authors’ FL-lSGD and FL-lSVM trains in a parallel and incremental manner to build an ensemble local classifier on Raspberry Pis without requiring data exchange. The algorithms load small data blocks of the local training subset stored on the Raspberry Pi sequentially to train the local classifiers. The data block is split into k partitions using the k-means algorithm, and models are trained in parallel on each data partition to enable local data classification.

Findings

Empirical test results on the ImageNet data set show that the authors’ FL-lSGD and FL-lSVM algorithms with 4 Raspberry Pis (Quad core Cortex-A72, ARM v8, 64-bit SoC @ 1.5GHz, 4GB RAM) are faster than the state-of-the-art LIBLINEAR algorithm run on a PC (Intel(R) Core i7-4790 CPU, 3.6 GHz, 4 cores, 32GB RAM).

Originality/value

Efficiently addressing the challenge of large-scale ImageNet classification, the authors’ novel federated learning algorithms of local classifiers have been tailored to work on the Raspberry Pi. These algorithms can handle 1,281,167 images and 1,000 classes effectively.

Details

International Journal of Web Information Systems, vol. 20 no. 1

Type: Research Article

DOI:

ISSN: 1744-0084

Abstract

Purpose

Design/methodology/approach

Findings

Originality/value

Details

Keywords

Abstract

Purpose

Design/methodology/approach

Findings

Originality/value

Details

Keywords

Abstract

Purpose

Design/methodology/approach

Findings

Originality/value

Details

Keywords

Abstract

Purpose

Design/methodology/approach

Findings

Research limitations/implications

Originality/value

Details

Keywords

Abstract

Purpose

Design/methodology/approach

Findings

Originality/value

Details

Keywords

Abstract

Details

Keywords

Abstract

Purpose

Design/methodology/approach

Findings

Originality/value

Details

Keywords

Abstract

Purpose

Design/methodology/approach

Findings

Practical implications

Originality/value

Details

Keywords

Abstract

Purpose

Design/methodology/approach

Findings

Originality/value

Details

Keywords

Abstract

Purpose

Design/methodology/approach

Findings

Originality/value

Details

Keywords

Access

Year

Content type

All feedback is valuable

Report an issue or find answers to frequently asked questions