Kamlesh Kumar Pandey and Diwakar Shukla
The K-means (KM) clustering algorithm is extremely responsive to the selection of initial centroids since the initial centroid of clusters determines computational effectiveness…
Abstract
Purpose
The K-means (KM) clustering algorithm is extremely responsive to the selection of initial centroids since the initial centroid of clusters determines computational effectiveness, efficiency and local optima issues. Numerous initialization strategies are to overcome these problems through the random and deterministic selection of initial centroids. The random initialization strategy suffers from local optimization issues with the worst clustering performance, while the deterministic initialization strategy achieves high computational cost. Big data clustering aims to reduce computation costs and improve cluster efficiency. The objective of this study is to achieve a better initial centroid for big data clustering on business management data without using random and deterministic initialization that avoids local optima and improves clustering efficiency with effectiveness in terms of cluster quality, computation cost, data comparisons and iterations on a single machine.
Design/methodology/approach
This study presents the Normal Distribution Probability Density (NDPD) algorithm for big data clustering on a single machine to solve business management-related clustering issues. The NDPDKM algorithm resolves the KM clustering problem by probability density of each data point. The NDPDKM algorithm first identifies the most probable density data points by using the mean and standard deviation of the datasets through normal probability density. Thereafter, the NDPDKM determines K initial centroid by using sorting and linear systematic sampling heuristics.
Findings
The performance of the proposed algorithm is compared with KM, KM++, Var-Part, Murat-KM, Mean-KM and Sort-KM algorithms through Davies Bouldin score, Silhouette coefficient, SD Validity, S_Dbw Validity, Number of Iterations and CPU time validation indices on eight real business datasets. The experimental evaluation demonstrates that the NDPDKM algorithm reduces iterations, local optima, computing costs, and improves cluster performance, effectiveness, efficiency with stable convergence as compared to other algorithms. The NDPDKM algorithm minimizes the average computing time up to 34.83%, 90.28%, 71.83%, 92.67%, 69.53% and 76.03%, and reduces the average iterations up to 40.32%, 44.06%, 32.02%, 62.78%, 19.07% and 36.74% with reference to KM, KM++, Var-Part, Murat-KM, Mean-KM and Sort-KM algorithms.
Originality/value
The KM algorithm is the most widely used partitional clustering approach in data mining techniques that extract hidden knowledge, patterns and trends for decision-making strategies in business data. Business analytics is one of the applications of big data clustering where KM clustering is useful for the various subcategories of business analytics such as customer segmentation analysis, employee salary and performance analysis, document searching, delivery optimization, discount and offer analysis, chaplain management, manufacturing analysis, productivity analysis, specialized employee and investor searching and other decision-making strategies in business.
Details
Keywords
Elham Amirizadeh and Reza Boostani
The aim of this study is to propose a deep neural network (DNN) method that uses side information to improve clustering results for big datasets; also, the authors show that…
Abstract
Purpose
The aim of this study is to propose a deep neural network (DNN) method that uses side information to improve clustering results for big datasets; also, the authors show that applying this information improves the performance of clustering and also increase the speed of the network training convergence.
Design/methodology/approach
In data mining, semisupervised learning is an interesting approach because good performance can be achieved with a small subset of labeled data; one reason is that the data labeling is expensive, and semisupervised learning does not need all labels. One type of semisupervised learning is constrained clustering; this type of learning does not use class labels for clustering. Instead, it uses information of some pairs of instances (side information), and these instances maybe are in the same cluster (must-link [ML]) or in different clusters (cannot-link [CL]). Constrained clustering was studied extensively; however, little works have focused on constrained clustering for big datasets. In this paper, the authors have presented a constrained clustering for big datasets, and the method uses a DNN. The authors inject the constraints (ML and CL) to this DNN to promote the clustering performance and call it constrained deep embedded clustering (CDEC). In this manner, an autoencoder was implemented to elicit informative low dimensional features in the latent space and then retrain the encoder network using a proposed Kullback–Leibler divergence objective function, which captures the constraints in order to cluster the projected samples. The proposed CDEC has been compared with the adversarial autoencoder, constrained 1-spectral clustering and autoencoder + k-means was applied to the known MNIST, Reuters-10k and USPS datasets, and their performance were assessed in terms of clustering accuracy. Empirical results confirmed the statistical superiority of CDEC in terms of clustering accuracy to the counterparts.
Findings
First of all, this is the first DNN-constrained clustering that uses side information to improve the performance of clustering without using labels in big datasets with high dimension. Second, the author defined a formula to inject side information to the DNN. Third, the proposed method improves clustering performance and network convergence speed.
Originality/value
Little works have focused on constrained clustering for big datasets; also, the studies in DNNs for clustering, with specific loss function that simultaneously extract features and clustering the data, are rare. The method improves the performance of big data clustering without using labels, and it is important because the data labeling is expensive and time-consuming, especially for big datasets.
Details
Keywords
Nicola Castellano, Roberto Del Gobbo and Lorenzo Leto
The concept of productivity is central to performance management and decision-making, although it is complex and multifaceted. This paper aims to describe a methodology based on…
Abstract
Purpose
The concept of productivity is central to performance management and decision-making, although it is complex and multifaceted. This paper aims to describe a methodology based on the use of Big Data in a cluster analysis combined with a data envelopment analysis (DEA) that provides accurate and reliable productivity measures in a large network of retailers.
Design/methodology/approach
The methodology is described using a case study of a leading kitchen furniture producer. More specifically, Big Data is used in a two-step analysis prior to the DEA to automatically cluster a large number of retailers into groups that are homogeneous in terms of structural and environmental factors and assess a within-the-group level of productivity of the retailers.
Findings
The proposed methodology helps reduce the heterogeneity among the units analysed, which is a major concern in DEA applications. The data-driven factorial and clustering technique allows for maximum within-group homogeneity and between-group heterogeneity by reducing subjective bias and dimensionality, which is embedded with the use of Big Data.
Practical implications
The use of Big Data in clustering applied to productivity analysis can provide managers with data-driven information about the structural and socio-economic characteristics of retailers' catchment areas, which is important in establishing potential productivity performance and optimizing resource allocation. The improved productivity indexes enable the setting of targets that are coherent with retailers' potential, which increases motivation and commitment.
Originality/value
This article proposes an innovative technique to enhance the accuracy of productivity measures through the use of Big Data clustering and DEA. To the best of the authors’ knowledge, no attempts have been made to benefit from the use of Big Data in the literature on retail store productivity.
Details
Keywords
Runhai Jiao, Shaolong Liu, Wu Wen and Biying Lin
The large volume of big data makes it impractical for traditional clustering algorithms which are usually designed for entire data set. The purpose of this paper is to focus on…
Abstract
Purpose
The large volume of big data makes it impractical for traditional clustering algorithms which are usually designed for entire data set. The purpose of this paper is to focus on incremental clustering which divides data into series of data chunks and only a small amount of data need to be clustered at each time. Few researches on incremental clustering algorithm address the problem of optimizing cluster center initialization for each data chunk and selecting multiple passing points for each cluster.
Design/methodology/approach
Through optimizing initial cluster centers, quality of clustering results is improved for each data chunk and then quality of final clustering results is enhanced. Moreover, through selecting multiple passing points, more accurate information is passed down to improve the final clustering results. The method has been proposed to solve those two problems and is applied in the proposed algorithm based on streaming kernel fuzzy c-means (stKFCM) algorithm.
Findings
Experimental results show that the proposed algorithm demonstrates more accuracy and better performance than streaming kernel stKFCM algorithm.
Originality/value
This paper addresses the problem of improving the performance of increment clustering through optimizing cluster center initialization and selecting multiple passing points. The paper analyzed the performance of the proposed scheme and proved its effectiveness.
Details
Keywords
Philipp Max Hartmann, Mohamed Zaki, Niels Feldmann and Andy Neely
The purpose of this paper is to derive a taxonomy of business models used by start-up firms that rely on data as a key resource for business, namely data-driven business models…
Abstract
Purpose
The purpose of this paper is to derive a taxonomy of business models used by start-up firms that rely on data as a key resource for business, namely data-driven business models (DDBMs). By providing a framework to systematically analyse DDBMs, the study provides an introduction to DDBM as a field of study.
Design/methodology/approach
To develop the taxonomy of DDBMs, business model descriptions of 100 randomly chosen start-up firms were coded using a DDBM framework derived from literature, comprising six dimensions with 35 features. Subsequent application of clustering algorithms produced six different types of DDBM, validated by case studies from the study’s sample.
Findings
The taxonomy derived from the research consists of six different types of DDBM among start-ups. These types are characterised by a subset of six of nine clustering variables from the DDBM framework.
Practical implications
A major contribution of the paper is the designed framework, which stimulates thinking about the nature and future of DDBMs. The proposed taxonomy will help organisations to position their activities in the current DDBM landscape. Moreover, framework and taxonomy may lead to a DDBM design toolbox.
Originality/value
This paper develops a basis for understanding how start-ups build business models capture value from data as a key resource, adding a business perspective to the discussion of big data. By offering the scientific community a specific framework of business model features and a subsequent taxonomy, the paper provides reference points and serves as a foundation for future studies of DDBMs.
Details
Keywords
Narasimhulu K, Meena Abarna KT and Sivakumar B
The purpose of the paper is to study multiple viewpoints which are required to access the more informative similarity features among the tweets documents, which is useful for…
Abstract
Purpose
The purpose of the paper is to study multiple viewpoints which are required to access the more informative similarity features among the tweets documents, which is useful for achieving the robust tweets data clustering results.
Design/methodology/approach
Let “N” be the number of tweets documents for the topics extraction. Unwanted texts, punctuations and other symbols are removed, tokenization and stemming operations are performed in the initial tweets pre-processing step. Bag-of-features are determined for the tweets; later tweets are modelled with the obtained bag-of-features during the process of topics extraction. Approximation of topics features are extracted for every tweet document. These set of topics features of N documents are treated as multi-viewpoints. The key idea of the proposed work is to use multi-viewpoints in the similarity features computation. The following figure illustrates multi-viewpoints based cosine similarity computation of the five tweets documents (here N = 5) and corresponding documents are defined in projected space with five viewpoints, say, v1,v2, v3, v4, and v5. For example, similarity features between two documents (viewpoints v1, and v2) are computed concerning the other three multi-viewpoints (v3, v4, and v5), unlike a single viewpoint in traditional cosine metric.
Findings
Healthcare problems with tweets data. Topic models play a crucial role in the classification of health-related tweets with finding topics (or health clusters) instead of finding term frequency and inverse document frequency (TF–IDF) for unlabelled tweets.
Originality/value
Topic models play a crucial role in the classification of health-related tweets with finding topics (or health clusters) instead of finding TF-IDF for unlabelled tweets.
Details
Keywords
Jianfang Qi, Yue Li, Haibin Jin, Jianying Feng and Weisong Mu
The purpose of this study is to propose a new consumer value segmentation method for low-dimensional dense market datasets to quickly detect and cluster the most profitable…
Abstract
Purpose
The purpose of this study is to propose a new consumer value segmentation method for low-dimensional dense market datasets to quickly detect and cluster the most profitable customers for the enterprises.
Design/methodology/approach
In this study, the comprehensive segmentation bases (CSB) with richer meanings were obtained by introducing the weighted recency-frequency-monetary (RFM) model into the common segmentation bases (SB). Further, a new market segmentation method, the CSB-MBK algorithm was proposed by integrating the CSB model and the mini-batch k-means (MBK) clustering algorithm.
Findings
The results show that our proposed CSB model can reflect consumers' contributions to a market, as well as improve the clustering performance. Moreover, the proposed CSB-MBK algorithm is demonstrably superior to the SB-MBK, CSB-KMA and CSB-Chameleon algorithms with respect to the Silhouette Coefficient (SC), the Calinski-Harabasz (CH) Index , the average running time and superior to the SB-MBK, RFM-MBK and WRFM-MBK algorithms in terms of the inter-market value and characteristic differentiation.
Practical implications
This paper provides a tool for decision-makers and marketers to segment a market quickly, which can help them grasp consumers' activity, loyalty, purchasing power and other characteristics in a target market timely and achieve the precision marketing.
Originality/value
This study is the first to introduce the CSB-MBK algorithm for identifying valuable customers through the comprehensive consideration of the clustering quality, consumer value and segmentation speed. Moreover, the CSB-MBK algorithm can be considered for applications in other markets.
Details
Keywords
Pethmi De Silva, Nuwan Gunarathne and Satish Kumar
The purpose of this study is to perform bibliometric analysis to systematically and comprehensively examine the current landscape of digital knowledge, integration and performance…
Abstract
Purpose
The purpose of this study is to perform bibliometric analysis to systematically and comprehensively examine the current landscape of digital knowledge, integration and performance in the transformation of sustainability accounting, reporting and assurance.
Design/methodology/approach
This research uses a systematic literature review, following the Scientific Procedures and Rationales for Systematic Literature Review protocol and uses various bibliometric and performance analytical methods. These include annual scientific production analysis, journal analysis, keyword cooccurrence analysis, keyword clustering, knowledge gap analysis and future research direction identification to evaluate the existing literature thoroughly.
Findings
The analysis reveals significant insights into the transformative impact of digital technologies on sustainability practices. Annual scientific production and journal analyses highlight key contributors to the adoption of digital technologies in sustainability accounting, reporting and assurance. Keyword cooccurrence analyses have identified key themes in sustainability accounting, reporting and assurance, highlighting the transformative role of digital technologies such as artificial intelligence (AI), blockchain, Internet of Things (IoT) and big data. These technologies enhance corporate accountability, transparency and sustainability by automating processes and improving data accuracy. The integration of these technologies supports environmental, social and governance (ESG) reporting, circular economy initiatives and strategic decision-making, fostering economic, social and environmental sustainability. Cluster-by-coupling analyses delve into nine broader revealing that IoT improves ESG report accuracy, eXtensible Business Reporting Language structures ESG data and AI enhances life cycle assessments and reporting authenticity. In addition, digital transformation impacts environmental performance, big data optimizes resource use and edge computing improves eco-efficiency. Furthermore, this study identifies avenues for future research to advance the understanding and implementation of digital technology in sustainability accounting, reporting and assurance practices.
Research limitations/implications
Academically, this research enriches the understanding of how digital technologies shape sustainability practices and identifies gaps in digital knowledge and integration. Practically, it provides actionable insights for organizations to improve sustainability reporting and performance by effectively leveraging these technologies. Policy-wise, the findings advocate for frameworks supporting the effective implementation of these technologies, ensuring alignment with global sustainability goals.
Originality/value
This study offers a detailed analysis of the performance and intellectual framework of research on implementing digital technology in sustainability accounting, reporting and assurance. It highlights the evolving research landscape and emphasizes the need for further investigation into how emerging technologies can be leveraged to achieve sustainability goals.
Details
Keywords
Gianluca Solazzo, Gianluca Elia and Giuseppina Passiante
This study aims to investigate the Big Social Data (BSD) paradigm, which still lacks a clear and shared definition, and causes a lack of clarity and understanding about its…
Abstract
Purpose
This study aims to investigate the Big Social Data (BSD) paradigm, which still lacks a clear and shared definition, and causes a lack of clarity and understanding about its beneficial opportunities for practitioners. In the knowledge management (KM) domain, a clear characterization of the BSD paradigm can lead to more effective and efficient KM strategies, processes and systems that leverage a huge amount of structured and unstructured data sources.
Design/methodology/approach
The study adopts a systematic literature review (SLR) methodology based on a mixed analysis approach (unsupervised machine learning and human-based) applied to 199 research articles on BSD topics extracted from Scopus and Web of Science. In particular, machine learning processing has been implemented by using topic extraction and hierarchical clustering techniques.
Findings
The paper provides a threefold contribution: a conceptualization and a consensual definition of the BSD paradigm through the identification of four key conceptual pillars (i.e. sources, properties, technology and value exploitation); a characterization of the taxonomy of BSD data type that extends previous works on this topic; a research agenda for future research studies on BSD and its applications along with a KM perspective.
Research limitations/implications
The main limits of the research rely on the list of articles considered for the literature review that could be enlarged by considering further sources (in addition to Scopus and Web of Science) and/or further languages (in addition to English) and/or further years (the review considers papers published until 2018). Research implications concern the development of a research agenda organized along with five thematic issues, which can feed future research to deepen the paradigm of BSD and explore linkages with the KM field.
Practical implications
Practical implications concern the usage of the proposed definition of BSD to purposefully design applications and services based on BSD in knowledge-intensive domains to generate value for citizens, individuals, companies and territories.
Originality/value
The original contribution concerns the definition of the big data social paradigm built through an SLR the combines machine learning processing and human-based processing. Moreover, the research agenda deriving from the study contributes to investigate the BSD paradigm in the wider domain of KM.
Details
Keywords
Manuel Pedro Rodríguez Bolívar and Laura Alcaide Muñoz
This study aims to conduct performance and clustering analyses with the help of Digital Government Reference Library (DGRL) v16.6 database examining the role of emerging…
Abstract
Purpose
This study aims to conduct performance and clustering analyses with the help of Digital Government Reference Library (DGRL) v16.6 database examining the role of emerging technologies (ETs) in public services delivery.
Design/methodology/approach
VOSviewer and SciMAT techniques were used for clustering and mapping the use of ETs in the public services delivery. Collecting documents from the DGRL v16.6 database, the paper uses text mining analysis for identifying key terms and trends in e-Government research regarding ETs and public services.
Findings
The analysis indicates that all ETs are strongly linked to each other, except for blockchain technologies (due to its disruptive nature), which indicate that ETs can be, therefore, seen as accumulative knowledge. In addition, on the whole, findings identify four stages in the evolution of ETs and their application to public services: the “electronic administration” stage, the “technological baseline” stage, the “managerial” stage and the “disruptive technological” stage.
Practical implications
The output of the present research will help to orient policymakers in the implementation and use of ETs, evaluating the influence of these technologies on public services.
Social implications
The research helps researchers to track research trends and uncover new paths on ETs and its implementation in public services.
Originality/value
Recent research has focused on the need of implementing ETs for improving public services, which could help cities to improve the citizens’ quality of life in urban areas. This paper contributes to expanding the knowledge about ETs and its implementation in public services, identifying trends and networks in the research about these issues.