Fast semi-supervised evidential clustering
Introduction
Clustering is a knowledge discovery approach that aims at grouping objects according to a notion of similarity based on the characteristics of the objects. Clustering algorithms are divided into two families: hierarchical clustering that builds a hierarchy of clusters and partitional clustering that generates disjoint subsets of the data. Partitional clustering methods include k-means and its variants that try to minimize an intraclass inertia with respect to constraints on the partition. The k-means algorithm, which assigns an object to a unique cluster, has been shown to be useful in various domains such as Internet of Things [1], time series [2], recommender systems [3], [4], among others. However, the obtained partition, called hard or crisp partition, is unable to express uncertainties regarding the class membership of an object. Such information is particularly interesting in case, for instance, of overlapped classes. Thus, modifications of the k-means algorithm have been proposed in order to generate a soft partition. The most popular extension corresponds to the Fuzzy c-means (FCM) that produces a probabilistic partition. It has been applied in many applications [5], [6], [7], [8]. Other variants, such as Possibilistic c-means and Rough k-means, use possibilities and rough set theories respectively to handle more precisely uncertainties. A more general variant, called evidential c-means (ECM) [9], is based on the theory of belief functions. It generates a credal partition that encompasses hard, probabilistic, and rough partitions [10]. The algorithm allows to obtain a rich representation of the uncertainties related to the data. As a consequence, it has been used in various applications [11], [12], [13] and several extensions of ECM have been proposed ever since. The RECM algorithm has been developed to handle dissimilarity data [14]. The ECMdd is a medoid-based variant of ECM [15]. The CCM method considers meta-clusters to reduce misclassification [16] and the DEC algorithm extends CCM by providing a dynamic edited framework [17]. Evidential clustering also comprises EVCLUS [18], a method that searches for a credal partition minimizing the discrepancy between the object pairwise distances and the conflict obtained by their mass functions. A faster optimization of EVCLUS has been proposed in [19].
Clustering algorithms create groups with no other information than the characteristics of the data. However, it has been shown that introducing some background knowledge in a clustering process can highly improve the clustering solution [20], [21], [22], [23]. Such methods, called semi-supervised clustering or constrained clustering, express prior information as constraints. The most famous types of constraints correspond to instance-level constraints: the must-link constraint, which specifies that two objects should be in the same class, the cannot-link constraint, which indicates that two objects are in different classes, and finally, the label constraint, which directly assigns an object into a class. Recently, various semi-supervised evidential clustering algorithms have been proposed for pairwise (i.e. must-link and cannot-link) constraints [24], [25], [26], [27], [28] and for label constraints [29], [30], [27], [28]. The evidential framework is used to express imprecision for label constraints and allows for any type of instance-level constraints, as for evidential clustering, to generate a credal partition. However, the integration of instance-level constraints implies a greater optimization complexity. Thus, [26] proposes a new optimization scheme on CEVCLUS [25], a version of EVCLUS handling pairwise constraints. This method, named k-CEVCLUS, iteratively optimizes for each instance their mass function using a quadratic programming solver. k-CEVCLUS has been generalized by NN-EVCLUS [28] which uses a neural network trained to find the mass functions which minimize the difference between conflicts and the pairwise distances. The NN-EVCLUS algorithm copes with both pairwise and label constraints.
Although the pairwise constraints are more general, the label constraint is often available and provides more information. Thus, many k-means variants using label constraints have been proposed [20], [31], [29]. Amongst them, the SECM algorithm [29], [30] corresponds to the extension of ECM for fuzzy labels. Its goal is twofold: guide the clustering algorithm towards a better solution and take advantage of the richness of information available with credal partition to make decisions.
Similarly to any semi-supervised variants of k-means [20], the SECM algorithm adds a penalty term into the objective function in order to take into account the label constraints. Such penalty term complicates the criterion and leads to an increase in the computing time spent for its resolution. Thus, we propose to relax some constraints of the SECM objective function in order to create a new heuristic. The new algorithm, called SECM-h, increases the convergence speed of minimization while keeping a clustering solution close to the solution produced by the exact approach, i.e. SECM.
The rest of the paper is organized as follows. Section 2 recalls the necessary background on the theory of belief functions and the extensions of k-means leading to the evidential c-means. Then, the SECM algorithm is detailed. Section 3 depicts the SECM-h algorithm, a new approach to minimize SECM. The interest of the method is presented in Section 4 with some experiences on real data sets. Section 5 concludes the paper and gives some perspectives.
Section snippets
Preliminaries
Evidential clustering is based on the Dempster-Shafer theory or theory of belief functions. Then, to make this paper self-contained, a reminder of some important definitions and results of the Dempster-Shafer theory is presented in this section.
Optimization scheme
The SECM problem discussed in the previous section can also be solved using a heuristic approach. Indeed, heuristic methods have been used in optimization problems to improve the algorithm's execution time [20], [21]. Such methods usually result in a trade off between execution time and optimality, completeness and/or accuracy.
As discussed in section 2.3, the SECM problem has been solved using a Gauss-Seidel optimization method to find the 3-tuple () that minimizes the cost function (13).
Experimental protocol
To illustrate the effectiveness of the proposed clustering method, we carried out a series of experiments using data sets coming from the UCI repository [38]. The characteristics of the data sets are shown in Table 3.
Real labels are known for all data sets. LettersIJL refers to the Letters data set where only three classes are kept and the number of instances is reduced by randomly selecting 10% of them per class, as it has been done in [22]. Since Column, Wdbc and Wine are a data sets
Conclusion
The SECM algorithm is a semi-supervised clustering variant of ECM based on the theory of belief functions. As such, it is capable of handling uncertainty and imprecision of the background knowledge and provides a credal partition that allows expressing imprecision and uncertainty regarding the assignment of objects to clusters. The SECM method, which is based on the minimization of an objective function, introduces prior knowledge in the form of label constraints. The optimization problem is
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
References (42)
Seira: an effective algorithm for iot resource allocation problem
Comput. Commun.
(2018)- et al.
A decomposition-clustering-ensemble learning approach for solar radiation forecasting
Sol. Energy
(2018) - et al.
A scalable privacy-preserving recommendation scheme via bisecting k-means clustering
Inf. Process. Manag.
(2013) - et al.
Novel centroid selection approaches for kmeans-clustering based recommender systems
Inf. Sci.
(2015) - et al.
Application of fuzzy c-means clustering to prtr chemicals uncovering their release and toxicity characteristics
Sci. Total Environ.
(2018) - et al.
Semantic clustering fuzzy c means spectral model based comparative analysis of cardiac color ultrasound and electrocardiogram in patients with left ventricular heart failure and cardiomyopathy
Future Gener. Comput. Syst.
(2019) - et al.
An improved fast fuzzy c-means using crow search optimization algorithm for crop identification in agricultural
Expert Syst. Appl.
(2019) - et al.
ECM: an evidential version of the fuzzy c-means algorithm
Pattern Recognit.
(2008) - et al.
RECM: relational evidential c-means algorithm
Pattern Recognit. Lett.
(2009) - et al.
ECMdd: evidential c-medoids clustering with multiple prototypes
Pattern Recognit.
(2016)
Dynamic evidential clustering algorithm
Knowl.-Based Syst.
Evidential clustering of large dissimilarity data
Knowl.-Based Syst.
CECM: constrained evidential c-means algorithm
Comput. Stat. Data Anal.
k-CEVCLUS: constrained evidential clustering of large dissimilarity data
Knowl.-Based Syst.
The transferable belief model
Artif. Intell.
3d magnetization inversion using fuzzy c-means clustering with application to geology differentiation
Geophysics
Evaluating and comparing soft partitions: an approach based on Dempster–Shafer theory
IEEE Trans. Fuzzy Syst.
Spatial evidential clustering with adaptive distance metric for tumor segmentation in fdg-pet images
IEEE Trans. Biomed. Eng.
ECTD: evidential clustering and case types detection for case base maintenance
An evidential clustering for collaborative filtering based on users' preferences
Credal c-means clustering method based on belief functions
Knowl.-Based Syst.
Cited by (14)
Uncertainty quantification in logistic regression using random fuzzy sets and belief functions
2024, International Journal of Approximate ReasoningRepresenting uncertainty and imprecision in machine learning: A survey on belief functions
2024, Journal of King Saud University - Computer and Information SciencesA GMDA clustering algorithm based on evidential reasoning architecture
2024, Chinese Journal of AeronauticsA distributional framework for evaluation, comparison and uncertainty quantification in soft clustering
2023, International Journal of Approximate ReasoningEvidential prototype-based clustering based on transfer learning
2022, International Journal of Approximate ReasoningCitation Excerpt :Antoine et al. [29] introduced a semi-supervised version of ECM algorithm called CECM, taking pairwise constraints into account. A new heuristic algorithm is presented in [30], which relaxes the constraints of semi-supervised evidential clustering in such a way that the optimization problem can be solved by using the Lagrangian method. Denœux [31] presented a method to construct mass functions for representing the cluster-membership uncertainty, by bootstrapping finite mixture models.
Possibilistic fuzzy c-means with partial supervision
2022, Fuzzy Sets and SystemsCitation Excerpt :The two methods are dedicated to image segmentation and take into account the neighborhood of the pixels in the objective functions. In [44,45] SECM, an extension of the evidential c-means algorithm (ECM) [46] to take into account labeled patterns, is presented. SECM allows the extraction of much richer partition information at the price of a much higher computational complexity.