Fast semi-supervised evidential clustering

https://doi.org/10.1016/j.ijar.2021.03.008Get rights and content

Abstract

Semi-supervised clustering is a constrained clustering technique that organizes a collection of unlabeled data into homogeneous subgroups with the help of domain knowledge expressed as constraints. These methods are, most of the time, variants of the popular k-means clustering algorithm. As such, they are based on a criterion to minimize. Amongst existing semi-supervised clusterings, Semi-supervised Evidential Clustering (SECM) deals with the problem of uncertain/imprecise labels and creates a credal partition. In this work, a new heuristic algorithm, called SECM-h, is presented. The proposed algorithm relaxes the constraints of SECM in such a way that the optimization problem is solved using the Lagrangian method. Experimental results show that the proposed algorithm largely improves execution time while accuracy is maintained.

Introduction

Clustering is a knowledge discovery approach that aims at grouping objects according to a notion of similarity based on the characteristics of the objects. Clustering algorithms are divided into two families: hierarchical clustering that builds a hierarchy of clusters and partitional clustering that generates disjoint subsets of the data. Partitional clustering methods include k-means and its variants that try to minimize an intraclass inertia with respect to constraints on the partition. The k-means algorithm, which assigns an object to a unique cluster, has been shown to be useful in various domains such as Internet of Things [1], time series [2], recommender systems [3], [4], among others. However, the obtained partition, called hard or crisp partition, is unable to express uncertainties regarding the class membership of an object. Such information is particularly interesting in case, for instance, of overlapped classes. Thus, modifications of the k-means algorithm have been proposed in order to generate a soft partition. The most popular extension corresponds to the Fuzzy c-means (FCM) that produces a probabilistic partition. It has been applied in many applications [5], [6], [7], [8]. Other variants, such as Possibilistic c-means and Rough k-means, use possibilities and rough set theories respectively to handle more precisely uncertainties. A more general variant, called evidential c-means (ECM) [9], is based on the theory of belief functions. It generates a credal partition that encompasses hard, probabilistic, and rough partitions [10]. The algorithm allows to obtain a rich representation of the uncertainties related to the data. As a consequence, it has been used in various applications [11], [12], [13] and several extensions of ECM have been proposed ever since. The RECM algorithm has been developed to handle dissimilarity data [14]. The ECMdd is a medoid-based variant of ECM [15]. The CCM method considers meta-clusters to reduce misclassification [16] and the DEC algorithm extends CCM by providing a dynamic edited framework [17]. Evidential clustering also comprises EVCLUS [18], a method that searches for a credal partition minimizing the discrepancy between the object pairwise distances and the conflict obtained by their mass functions. A faster optimization of EVCLUS has been proposed in [19].

Clustering algorithms create groups with no other information than the characteristics of the data. However, it has been shown that introducing some background knowledge in a clustering process can highly improve the clustering solution [20], [21], [22], [23]. Such methods, called semi-supervised clustering or constrained clustering, express prior information as constraints. The most famous types of constraints correspond to instance-level constraints: the must-link constraint, which specifies that two objects should be in the same class, the cannot-link constraint, which indicates that two objects are in different classes, and finally, the label constraint, which directly assigns an object into a class. Recently, various semi-supervised evidential clustering algorithms have been proposed for pairwise (i.e. must-link and cannot-link) constraints [24], [25], [26], [27], [28] and for label constraints [29], [30], [27], [28]. The evidential framework is used to express imprecision for label constraints and allows for any type of instance-level constraints, as for evidential clustering, to generate a credal partition. However, the integration of instance-level constraints implies a greater optimization complexity. Thus, [26] proposes a new optimization scheme on CEVCLUS [25], a version of EVCLUS handling pairwise constraints. This method, named k-CEVCLUS, iteratively optimizes for each instance their mass function using a quadratic programming solver. k-CEVCLUS has been generalized by NN-EVCLUS [28] which uses a neural network trained to find the mass functions which minimize the difference between conflicts and the pairwise distances. The NN-EVCLUS algorithm copes with both pairwise and label constraints.

Although the pairwise constraints are more general, the label constraint is often available and provides more information. Thus, many k-means variants using label constraints have been proposed [20], [31], [29]. Amongst them, the SECM algorithm [29], [30] corresponds to the extension of ECM for fuzzy labels. Its goal is twofold: guide the clustering algorithm towards a better solution and take advantage of the richness of information available with credal partition to make decisions.

Similarly to any semi-supervised variants of k-means [20], the SECM algorithm adds a penalty term into the objective function in order to take into account the label constraints. Such penalty term complicates the criterion and leads to an increase in the computing time spent for its resolution. Thus, we propose to relax some constraints of the SECM objective function in order to create a new heuristic. The new algorithm, called SECM-h, increases the convergence speed of minimization while keeping a clustering solution close to the solution produced by the exact approach, i.e. SECM.

The rest of the paper is organized as follows. Section 2 recalls the necessary background on the theory of belief functions and the extensions of k-means leading to the evidential c-means. Then, the SECM algorithm is detailed. Section 3 depicts the SECM-h algorithm, a new approach to minimize SECM. The interest of the method is presented in Section 4 with some experiences on real data sets. Section 5 concludes the paper and gives some perspectives.

Section snippets

Preliminaries

Evidential clustering is based on the Dempster-Shafer theory or theory of belief functions. Then, to make this paper self-contained, a reminder of some important definitions and results of the Dempster-Shafer theory is presented in this section.

Optimization scheme

The SECM problem discussed in the previous section can also be solved using a heuristic approach. Indeed, heuristic methods have been used in optimization problems to improve the algorithm's execution time [20], [21]. Such methods usually result in a trade off between execution time and optimality, completeness and/or accuracy.

As discussed in section 2.3, the SECM problem has been solved using a Gauss-Seidel optimization method to find the 3-tuple (V,S,M) that minimizes the cost function (13).

Experimental protocol

To illustrate the effectiveness of the proposed clustering method, we carried out a series of experiments using data sets coming from the UCI repository [38]. The characteristics of the data sets are shown in Table 3.

Real labels are known for all data sets. LettersIJL refers to the Letters data set where only three classes are kept and the number of instances is reduced by randomly selecting 10% of them per class, as it has been done in [22]. Since Column, Wdbc and Wine are a data sets

Conclusion

The SECM algorithm is a semi-supervised clustering variant of ECM based on the theory of belief functions. As such, it is capable of handling uncertainty and imprecision of the background knowledge and provides a credal partition that allows expressing imprecision and uncertainty regarding the assignment of objects to clusters. The SECM method, which is based on the minimization of an objective function, introduces prior knowledge in the form of label constraints. The optimization problem is

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

References (42)

  • Z. Zhang et al.

    Dynamic evidential clustering algorithm

    Knowl.-Based Syst.

    (2021)
  • T. Denœux et al.

    Evidential clustering of large dissimilarity data

    Knowl.-Based Syst.

    (2016)
  • V. Antoine et al.

    CECM: constrained evidential c-means algorithm

    Comput. Stat. Data Anal.

    (2012)
  • F. Li et al.

    k-CEVCLUS: constrained evidential clustering of large dissimilarity data

    Knowl.-Based Syst.

    (2018)
  • P. Smets et al.

    The transferable belief model

    Artif. Intell.

    (1994)
  • Y. Li et al.

    3d magnetization inversion using fuzzy c-means clustering with application to geology differentiation

    Geophysics

    (2016)
  • T. Denœux et al.

    Evaluating and comparing soft partitions: an approach based on Dempster–Shafer theory

    IEEE Trans. Fuzzy Syst.

    (2017)
  • C. Lian et al.

    Spatial evidential clustering with adaptive distance metric for tumor segmentation in fdg-pet images

    IEEE Trans. Biomed. Eng.

    (2017)
  • S.B. Ayed et al.

    ECTD: evidential clustering and case types detection for case base maintenance

  • R. Abdelkhalek et al.

    An evidential clustering for collaborative filtering based on users' preferences

  • Z. Lui et al.

    Credal c-means clustering method based on belief functions

    Knowl.-Based Syst.

    (2015)
  • Cited by (14)

    • Representing uncertainty and imprecision in machine learning: A survey on belief functions

      2024, Journal of King Saud University - Computer and Information Sciences
    • Evidential prototype-based clustering based on transfer learning

      2022, International Journal of Approximate Reasoning
      Citation Excerpt :

      Antoine et al. [29] introduced a semi-supervised version of ECM algorithm called CECM, taking pairwise constraints into account. A new heuristic algorithm is presented in [30], which relaxes the constraints of semi-supervised evidential clustering in such a way that the optimization problem can be solved by using the Lagrangian method. Denœux [31] presented a method to construct mass functions for representing the cluster-membership uncertainty, by bootstrapping finite mixture models.

    • Possibilistic fuzzy c-means with partial supervision

      2022, Fuzzy Sets and Systems
      Citation Excerpt :

      The two methods are dedicated to image segmentation and take into account the neighborhood of the pixels in the objective functions. In [44,45] SECM, an extension of the evidential c-means algorithm (ECM) [46] to take into account labeled patterns, is presented. SECM allows the extraction of much richer partition information at the price of a much higher computational complexity.

    View all citing articles on Scopus
    View full text