Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 14-Day Trial for You or Your Team.

Learn More →

Feature Selection and Semisupervised Fuzzy Clustering

Feature Selection and Semisupervised Fuzzy Clustering Fuzzy Inf. Eng. (2009)2:179-190 DOI 10.1007/s12543-009-0014-0 ORIGINAL ARTICLE Yi-qing Kong · Shi-tong Wang Received: 18 December 2008/ Revised: 9 April 2009/ Accepted: 10 May 2009/ © Springer and Fuzzy Information and Engineering Branch of the Operations Research Society of China Abstract Semisupervised fuzzy clustering plays an important role in discovering structure in data set with both labelled and unlabelled data. The proposed method learns the task of classification and feature selection through the generalized form of Fuzzy C-means. Experimental results illustrate appropriate feature selection and classification accuracy with both synthetic and benchmark data sets. Keywords Semisupervised clustering · Feature selection 1. Introduction The widespread use of computers and information technology has made data col- lection of routine task in various areas. The primary challenge of fields like pattern recognition, data mining, decision support, and knowledge discovery is the extrac- tion of potentially useful information by careful processing and analysis. The core of the process in many pattern recognition and practical data mining applications lies in building of classifiers, which are usually developed from expert knowledge or from training data. To design, such classifiers, feasible machine learning techniques are utilized to applying learning algorithms that could make effective use of the training data to give out optimal classification accuracy. Typically and traditionally, the ma- chine learning process could be divided into two mechanisms: supervised learning and unsupervised learning. 1.1 Supervised Learning Supervised learning assumes the class labels of the data objects being analyzed are known. The process extracts rules that identify groups of objects with the same class label while differentiate among the objects that have different labels. The rules then form the basis for classifying new data observations. Under such definition, super- vised learning can be formalized as the problem of inferring a function y = f (x), based on a training set D = {(x , y ),..., (x, y ),..., (x , y )}. Usually, the inputs are 1 1 i i n n Yi-qing Kong () · Shi-tong Wang School of Information Technology, Jiangnan University, Wuxi 214122, P.R.China e-mail: kongyiqing@163.com 180 Yi-qing Kong · Shi-tong Wang (2009) T d d-dimensional vectors, x = [x ,..., x ] ∈R . Supervised learning could be used i i,1 i,d for regression or for classification; in the latter problems, y is of categorical nature (e.g., binary, y ∈{1, 0}). The obtained function is evaluated by how accurately it performs on new data assumed to follow the same distribution of the training data. Usually, the function f (·) is assumed to have a fixed structure and to depend on a set of parametersθ, so we write y = f (x,θ), and the goal becomes to estimateθ from the training data. Supervised learning can be formulated using either a generative approach or a dis- criminative approach. In the discriminative approach, the focus is on learning the function y = f (x,θ) directly from the training set. Well known discriminative ap- proaches include linear and generalized linear models, k-nearest neighbor classifiers, tree classifiers, feed-forward neural networks, support vector machines (SVM), and other kernel-based methods. On the contrary, the generative approach involves estimating the joint probability density function p(x, y) from training data set D. In classification, this is usually done by learning the so-called class-conditional densities, p(x|y), and the probability of each class, p(y), [2]. This can be done for example, by representing the joint density using a kernel method or a Gaussian mixture, from which such joint probability func- tion could estimate, after which optimal Bayesian decision rules could be derived by the standard Bayesian decision theory machinery. 1.2 Unsupervised Learning Unsupervised learning (sometimes, called clustering) makes no assumptions about the category structure. The goal of unsupervised learning is to discover a natural grouping in a set of objects, without knowledge of any class labels, that is, unsuper- vised learning can be formalized as the problem of inferring a function y = f (x), based on a training set D = {x ,..., x,..., x }. Cluster analysis is for partitioning 1 i n a given collection of patterns into groups of similar individuals and does not require any knowledge about the data. Unsupervised learning is prevalent in any discipline that involves analysis of mul- tivariate data. For process, they use objective criterion functions to define similarity or dissimilarity among objects. The way is to find structure to partition the data into groups where objects that are more similar tend to fall into the same group and ob- jects that are relatively distinct tend to separate into different groups. In principle, the more information we have about the problem, the better the clustering is expected to perform. 1.3 Semisupervised Learning and Related Work As mentioned above, training data can be either labelled or unlabelled. The most ex- isting classification learning algorithms of supervised learning methods require suf- ficient training data so that the obtained classification model can generalize well. When the number of training data in each class decreases, the classification accuracy of classification algorithms degrades. Unfortunately, data in many domains are un- labelled, such as, text classification, medical image, and remote sensing (land use) data, etc. Hence, it is largely useless of standard supervised methods, which require a Fuzzy Inf. Eng. (2009) 2:179-190 181 supervising label for training set. Providing labels for even a significant sub-sample of a large database may be impractical - for labelling is usually done by using hu- man expertise, which is expensive, time consuming, and error prone, making labelled training data often very sparse. Whereas, obtaining unlabelled data is rather easier since it involves collecting data that are known to belong to one of the classes with- out having to label it, and making unlabelled ones are often abundant. For example, in gene recognition problems, it is easy to collect genes displaying information, but it is very tedious and difficult to label the gene to the corresponding class. As a result, how to exploit these unlabelled data in classification is much favored and has become a hot topic and an active research problem in recent years. The general problem of exploiting unlabelled data in supervised learning leads to a semisupervised learning, which utilize both labelled and unlabelled examples in building a classifier [10-13]. The motivation behind semisupervised learning is that, unlabelled samples may help improve estimation of the class-conditional densities. There is some theoretical analysis of this approach that suggests unlabelled data may be helpful in classification [14-15]. A general analysis of semisupervised learning for probabilistic classifiers is given in [13]. Although several semisupervised learning techniques have shown some advantage for very few labelled data, they may not always outperform traditional su- pervised learning methods, e.g., SVM, when the amount of labelled data is increased. And researchers have reported cases where the addition of unlabelled data degraded the performance of the classifiers when compared to the case in which unlabelled data is not used. So we should still notice the problem that unlabelled data is not always helpful, and may degrade performance in some cases [12] and [15-17], which reveals some puzzling aspects. Moreover, there is still not a very clear answer about the generalization performance of semisupervised learning techniques. Many current semisupervised learning algorithms use a generative model for the classifier and employ expectation-maximization (EM) to model the label estimation or parameter estimation process. For example, mixture of Gaussians, mixture of experts, and naive Bayes have been respectively used as the generative model, while EM is used to combine labelled and unlabelled data for classification. There are also many other algorithms such as using transductive inference for support vector machines to optimize performance on a specific test set, constructing a graph on the examples such that the minimum cut on the graph yields an optimal labelling of the unlabelled examples according to certain optimization functions, etc. However, the various mentioned methods of semisupervised learning algorithms are all considered from the point of view of traditional supervised learning classifica- tions. Or, in other words, they are traditional supervised learning classifications with the consideration of adding unlabelled data. Then we can also think about the semisu- pervised learning by the unsupervised clustering way. We consider about classifier design with both labelled and unlabelled data, but mainly from the unlabelled data. To consider about the labels of the individual data points, there comes the method of semisupervised clustering, or called, partial supervision [1-3]. Semisupervised clustering processes with both labelled and unlabelled data at the same time. It takes advantage of clustering mechanisms and uses the knowledge of the labelled patterns to guide the process of unsupervised learning. In the approach, 182 Yi-qing Kong · Shi-tong Wang (2009) we should not only choose a suitable similarity measure but also be constrained by the labelled data. In this paper, we are concerned with semisupervised clustering which is based on objective function optimization. The approach assumes that some patterns are labelled and these clustering hints are brought into the objective function. As the use of fuzzy clustering [4] becomes more and more popular, we give out the proposed algorithm referring [5], [6], [23] and as an extension of the Fuzzy C-means (FCM). The successful use of FCM helps us concentrate on the essence of the problem; in this paper, we will first overview this well-known technique. The objective function of the proposed clustering algorithm consists of two components being concerned with mechanisms of supervised and unsupervised. 1.4 Feature Selection Consideration The input of machine learning algorithms can be a matrix containing the similarities and dissimilarities among pairs of points, and each is described by a vector of at- tributes, called features. In this paper, we shall also focus on the question of feature selection. In general, the more information we have about the features of each pat- tern, the better a machine learning algorithm is expected to perform. This seems that we should use as many features as possible. However, it is not like this in practice. Some features are just noise, thus not contributing to, and sometimes even degrading the machine learning process. The task of selecting the suitable features is known as feature selection. Feature selection has been widely used in supervised learning [18-21], and the goal is to select features that can achieve the highest accuracy on test set data. Feature selection is not widely used in unsupervised learning or clustering. One important reason is that we do not have class labels to assess the similarities and dissimilarities of features. In this paper, we are concerned with feature selection of adding a feature weight variable. The similar idea that feature weight variable introduced into the objective function is in [22]. Both feature selection with feature weight and semisupervised based on FCM were proposed by the other researchers. However, combining both these methods in machine learning has no been used. That is why we use this for this research. In this paper, we are concerned with the task of both classification and feature selection application through the generalized form of Fuzzy C-means. 1.5 Organization of the Paper The material is arranged into five sections. The problem formulation based on a demonstration is presented in Section 2. Section 3 includes a detailed derivation of the algorithm and discusses an optimization scheme. Section 4 gives out experimental results carried out for synthetic two-dimensional data and the benchmark data sets. Conclusions are covered in Section 5. 2. The Problem Description The notation to be used throughout this study is fairly standard used in fuzzy cluster- ing. Table 1 demonstrates the detailed description. Fuzzy Inf. Eng. (2009) 2:179-190 183 Table 1: Summary of Notations Used Notations Descriptions A = [a ] transformation matrix ij C×K a the number of x , for x ∈ k and x ∈ c ij n n j n i B = [b ] binary cluster-class corresponding matrix ij C×K = 1, IFc ∈ k ⎨ i j ij ⎪ = 0, ELS E C the number of clusters c the set prototypes D the number of features d the distance between x and c in n i K the number of classes k the set classes ⎨ = 1, IFx labelled n ⎪ = 0, ELS E m the difference between the binary membership degree{0, 1} jn of labelled x to k and the sum of the membership degrees of n j x to all clusters belonging to k n j N the number of data points U = [μ ] partition matrix in C×N μ the membership degree of x to c , the higher the value, in n i the more significant the pattern in the cluster (i) (i) V = [υ ] partition matrix C×N in (i) υ the cluster membership degree of x to c by the labelled data n i in ( j) ( j) ( j) T (i) V = [υ ] partition matrix, V = B · V K×N jn ( j) υ the class membership degree of x to k by the labelled data n j jn x the patterns = 1, IFx labelledANDx ∈ k ⎪ n n j y = 0, IFx labelledANDx  k jn ⎪ n n j ⎩ 1 = , IFx unlabelled γ scaling factor helps establish a balance between the supervised and unsupervised components of the objective function γ learning rates control the process of updating membership degrees during stepwise optimization of the objective function 1.2 1.2 1 1 0.8 0.8 0.6 0.6 0.4 0.4 0.2 0.2 0 0 −0.2 −0.2 −1.5 −1 −0.5 0 0.5 1 −1.5 −1 −0.5 0 0.5 1 Fig.1: Ripley train data set Fig.2: FCM result on two clusters 184 Yi-qing Kong · Shi-tong Wang (2009) Given N data points to be clustered into C clusters, the usual choice is FCM, which comes to the problem of optimization the objective function. We know that FCM is typical for unsupervised learning - we search for structure in the data by minimizing the certain objective function. To clearly demonstrate such a problem, we first look at a simple example. The two-class problem constitutes an important category in pattern recognition; as such we first give that problem an analysis. We first use Ripley’s synthetic data set [16]; it has two real-valued co-ordinates in two classes; and each class is a bimodal mixture of two Gaussians. The data set consists of two classes with 125 points in each class. We use this data set to give out a demonstration. Figure 1 gives out the Ripley train data set of two classes. Figure 2 gives out FCM clustering results on two clusters. We can clearly see that FCM performs bad on the data with two clusters; there comes the question of why. Remember, cluster analy- sis is a method for partitioning a given collection of patterns into groups of similar individuals. The structure is discovered by similarity based upon a certain distance measure. In tradition, clustering does not require any knowledge about the data, and people like that for not requiring of labelled information. However, from this exam- ple, we clearly find that, unsupervised learning clustering does not perform well in discovering the ”hidden” structure of the data set. Then comes the question of utiliz- ing the labelled information. From recent researches, semisupervised learning is in much concern for learning from both labelled and unlabelled data. We perform clas- sification through semisupervised clustering based on the known class labels - just as stated in [5]; and from that, we will give an extension for classification together with feature weights. We formulate the problem of admitting information about class of patterns. The class labelling of individual patterns is denoted by y . To accommo- jn date the information about class membership, we generalize the objective function by capturing possible deviations between the labelling discovered by the clustering method (the result of clustering) and the class labels provided by y . The objective in function bringing together both labelled and unlabelled patterns assumes the form: C N C N 2 2 (i) 2 2 J(U, C) = μ d +γ (μ −υ ) d , (1) 1 in in in in in i=1 n=1 i=1 n=1 2 2 2 d = α (x − c ) , (2) nd id in d d=1 C N D s.t. μ = 1∀n, 0 < μ <N∀i,υ ∈ [0, 1], α = 1. (3) in in in d i=1 n=1 d=1 Proceeding with the objective function, the first component of the above expression Eq.(1) is just the same as the standard FCM, so we will not focus on it. The second one requires some explanation, it is concerned with partial supervision; scaling factor γ helps establish a balance between the supervised and unsupervised data, the terms (i) υ help optimize the membership of those labelled samples. μ means the cluster in in (i) membership degrees of x to c discovered by the clustering process, and the υ the n i in cluster membership degrees of x to c provided by optimizing Q(V) which will be n i (i) described below. In purpose, the terms μ − υ in Eq.(1) aims at the difference in in Fuzzy Inf. Eng. (2009) 2:179-190 185 between the labelling results given by the clustering process, and the class labels provided by labelled data. Eq.(2) gives out the distance between x and c , however, n i we add the scaling coefficient termα here to accomplish the task of feature selection. From intuition, different values of α imply different importance of features. The greater the value, the more important the feature; and α = 0 just means considering nothing of the d-th feature. From Eq.(1), we could see that, if we did not consider (i) C N 2 2 about the labelled data term (μ −υ ) d , and the scaling coefficient term in i=1 n=1 in in α , Eq.(1) is just the same as the FCM. Or, feature selection and semisupervised fuzzy clustering is the generalized FCM, and FCM is a special form of it in the case (i) ofγ = 0 andα = . The termsυ in Eq.(1) could be obtained through minimizing 1 d in the objective function Q : K N Q(V) = l m , (4) jn j=1 n=1 ( j) m = y −υ . (5) jn jn jn Eq.(5) determines the difference between the class labels provided by y and the jn ( j) labelling results υ got by the proposed clustering. To minimize Eq.(4), gradient jn method is used: 1, if c ∈ k , ⎨ i j (i)(t+1) (i)(t) υ = υ + 2γ l m · (6) 2 n jn ⎪ in in ⎪ 0, else. j=1 The proposed algorithm consists of two phases, that is: Phase 1: Optimize Q(V). (i) Phase 2: Optimize J(U, C) withυ optimized by Phase 1. in 3. The Clustering Algorithm The problem splits into three parts, that is, the calculations of the partition matrix, the calculations of the prototypes, and a determination of which clusters belong to a certain class. The minimization of J(U, C) is realized with respect to the family of the partition matrix U and prototypes C. The optimization of the partition matrix has to involve the constraints imposed on it. These are incorporated by using Lagrange multipliers. Using multipliers for each data point, we covert the problem into its ∂J(U, C) unconstraint optimization. By setting = 0 for a given cluster s, a data point ∂u st t and a feature p, where s = 1, 2,..., C , t = 1, 2,..., N and p = 1, 2,..., D,weget (i) 2 2 2 α (μ +γ (μ −υ ) )x 1 sn sn np p sn n=1 C = , (7) sp (i) 2 2 2 α (μ +γ (μ −υ ) ) 1 sn sn p sn n=1 D C N 2 2 2 2 2 −1 −1 ( ( (μ +γ (μ −υ ) )(x − c ) ) ) 1 nd id in in in d=1 i=1 n=1 α = , (8) C N 2 2 2 2 2 (μ +γ (μ −υ ) )(x − c ) 1 np ip in in in i=1 n=1 186 Yi-qing Kong · Shi-tong Wang (2009) Table 2: The Proposed Algorithm (0) Step 1 Compute U by FCM (i) (0) (0) (0) (0) Step 2 (V ) = U , Compute A and B , ( j) (0) (0) T (i) (0) Compute (V ) = (B ) · (V ) (i) (t) (t +1) (t ) 1 1 Step 3 REPEAT Compute (V ) UNTIL Q − Q <ε (t ) (t ) (t ) (t +1) (t ) 2 2 2 2 2 Step 4 REPEAT Compute C ,α ,U UNTIL J − J <ε (t ) Step 5 Compute B 1− υ it 1+γ γ υ 1 st i=1 2 2 2 μ = + , whered = α (x − c ) . (9) st nd id in d 1+γ 2 1 d st d=1 it i=1 In Table 1, two matrices are very important to the proposed algorithm-transformation matrix A and corresponding matrix B. The elements a in matrix A reflect the num- ij ber of data points of class j that appear in cluster i. The expression i = arg(max(μ )) in tells us which cluster data x is in, and that is, cluster i. It contains computations of the maximum over the membership grades (elements of the partition matrix), and determines for which this maximum occurs and returns the index i. This assignment is based on an intuitive argument that the maximal membership value implies the resulting cluster assignment (which is also used in FCM). For a labelled data point x , the expression j = arg(y = 1) can simply tell us which class data x is in, n jn n and that is, class j. The index of the cluster and the index of the class give us the information that data point n of class j belongs to cluster i. Through this way, the elements a could be simply got one by one. From transformation matrix A, the ex- ij pression j = arg(max(a )) tells us cluster i belongs to class j. Then we could get ij the binary cluster-class corresponding matrix B. The optimization is completed in an iterative manner by updating the partition matrix U, the prototypes matrix C, the feature-importance matrix α, and using a standard termination criterion based on the maximal norm of the difference between the successive partition matrices. We have the proposed algorithm in Table 2. 4. Experimental Results 4.1 Synthetic Data We first still use Ripley’s synthetic data set. The result of FCM could not perform well; however, through combining 4 clusters into 2 classes, the result could perform better. Figure 4 gives out the proposed clustering result on four clusters, and then combines the four related clusters into two classes based on the given 20 percent labelled data. We could clearly find that the result of Figure 4 is much better than that of the FCM in 2 classes of Figure 2. The proposed algorithm is based on the Fuzzy Inf. Eng. (2009) 2:179-190 187 1.2 1.2 1 1 0.8 0.8 0.6 0.6 0.4 0.4 0.2 0.2 0 0 −0.2 −0.2 −1.5 −1 −0.5 0 0.5 1 −1.5 −1 −0.5 0 0.5 1 Fig.3: The proposed algorithm clustering Fig.4: The proposed algorithm classification result 12.5 11.5 10.5 9.5 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 γ 1 Fig.5: The rate of misclassification in percent viewed as a function of γ objective function bringing together with both labelled and unlabelled patterns; for the unsupervised term helps consider the structure of the data, and the supervised term helps consider the classification information about the data. Next, in particular, we control the impact of the level of supervision reflected by the values ofγ . From Eq.(1), we could observe that the result of the proposed algorithm is effected by the factors γ of Eq. (1) and γ of Eq.(6). Scaling factor γ is the 1 2 1 balance between the supervised and unsupervised data. The value of the learning rate γ is usually in [0, 0.1]. In general, low values allow smooth optimization of Eq.(6). We use γ = 0.06 in the experiment; and the labelled rate of the predefined data set is 20 percent. Figure 5 gives out the relationship between γ and the rate of misclassification in percent. For the 20 percent labelled data, if γ > 1, the different γ does not make much difference in the classification result. 4.2 Benchmark Data For additional tests, we use the three well-known benchmark real-data problems: crabs, wine, and wdbc. The crabs data set comes from the book “Pattern Recognition and Neural Networks” [16]; wine and wdbc data sets from UCI Machine Learning repository [17]. We test our algorithm on these data sets with different characteristics (Table 3). The Leptograpsus crabs data set (crabs) consists in determining the sex of crabs, based on geometric measurements. The wine recognition data set (wine) contains results of chemical analysis of wines grown in different cultivars. The goal is to pre- Error Rate in % 188 Yi-qing Kong · Shi-tong Wang (2009) Table 3: Real World Data Sets Name Full Name N D K crabs Leptograpsus crabs 200 5 2 wine Wine recognition 178 13 2 wdbc Wisconsin diagnostic breast cancer 569 30 2 Table 4: The Error Rate Results of the Proposed Algorithm Name Percent of labelled data The proposed algorithm Using all the features crabs 80 14.32(7.87) 15.53(9.31) 60 22.93(6.37) 23.77(8.43) 40 30.24(4.50) 31.56(5.60) 20 36.67(3.30) 38.61(3.78) wine 80 5.42(1.24) 13.79(0.92) 60 12.19(4.02) 17.67(3.12) 40 19.78(4.12) 22.75(4.06) 20 25.01(2.97) 26.76(2.68) wdbc 80 2.26(0.49) 4.27(0.43) 60 4.44(0.76) 6.09(0.53) 40 6.62(0.52) 7.79(0.45) 20 8.79(0.50) 9.57(0.43) dict the type of a wine based on its chemical composition. The Wisconsin diagnostic breast cancer data set (wdbc) was used to obtain a binary diagnosis (benign or malig- nant) based on features extracted from cell nuclei presented in an image. Since these data sets are collected for supervised classification, the class labels here are partially involved in our experiment, and as the evaluation of the clustering results. As stated before, if γ > 1, the different γ does not make much difference in the 1 1 classification result. So we use γ = 1 and γ = 0.06 in the experiment. The error 1 2 rate results in percent and its standard deviation achieved by the proposed method for theses data sets with some levels of labelled data are shown in Table 4. The results reported here are obtained by averaging over 20 random partitions with labelled and unlabelled samples. The error rate results correspond to the mean of the error rates on the data set when the results are compared with the ground truth labels. The numbers in parenthesis are the standard deviation of the corresponding quantities. From Table 4, we can see that the proposed algorithm reduces the error rates when compared with using all the features for all these data sets. For feature selection, our algorithm selected three out of five variables in the Crabs data set; results being similar to reported in [7]. Fuzzy Inf. Eng. (2009) 2:179-190 189 5. Conclusions In this study, we have discussed design of semisupervised clusters handling both labelled and unlabelled data in the process of constructing pattern classification sys- tems. We have addressed issues of adding feature weight term of clustering and the classification of the unlabelled data via clustering. From the analysis of the use of FCM with unsuitable cluster numbers, the semisupervised clustering utilizing both labelled and unlabelled data have been shown to offer improvements for such cases that natural clusters are present in the considered problem. We combine feature selection with feature weight and semisupervised clustering based on FCM in this research. In this paper, we are concerned with the task of both classification and feature selection application through the generalized form of FCM. The semisupervised clustering helps consider the structural and classification information about the data. However, what should be still noticed is that if only a very limited amount of labelled data is available, the results may degraded, for the performance is more dependant on the very few labelled data samples. Acknowledgments Thanks to the support by 2007 Cultivation Fund of the Key Scientific and Technical Innovation Project of Ministry of Education of China. References 1. Bensaid AM, Hall LO, Bezdek JC, Clarke LP (1996) Partially supervised clustering for image seg- mentation. Pattern Recognition 29:859-871 2. Pedrycz W (1985) Algorithms of fuzzy clustering with partial supervision. Pattern Recognition Let- ters 3:13-20 3. Pedrycz W, Waletzky J (1997) Fuzzy clustering with partial supervision. IEEE Transactions on Sys- tems, Man and Cybernetics-Part B: Cybernetics 27:787-795 4. Bezdek JC (1981) Pattern recognition with fuzzy objective function algorithms. Plenum, New York 5. Bouchachia A, Pedrycz W (2006) Enhancement of fuzzy clustering by mechanisms of partial super- vision. Fuzzy Sets and Systems 157:1733-1759 6. Pedrycz W, Vukovich G (2004) Fuzzy clustering with supervision. Pattern Recognition 37:1339-1349 7. Krishnapuram B, Hartemink AJ, Carin L, Figueiredo MAT (2004) A bayesian approach to joint fea- ture selection and classifier design. IEEE Transactions on Pattern Analysis and Machine Intelligence 26:1105-1111 8. Blum A, Mitchell T (1998) Combined labeled and unlabeled data with co-training. Proceedings of Conference of Computational Learning Theory 92-100 9. Miller DJ, Uyar H (1997) A mixture of experts classifier with learning based on both labelled and unlabelled data. Proceedings of Conference of Neural Information Processing Systems 571-577 10. Nigam K, McCallum A, Thrun S, Mitchell T (2000) Text classification from labeled and unlabeled documents using EM. Machine Learning 1-34 11. Shashahani B, Landgrebe D (1994) The effect of unlabeled samples in reducing the small sample size problem and mitigating the hughes phenomenon. IEEE Transactions on Geoscience and Remote Sensing 32:1087-1095 12. Castelli V, Cover T (1995) On the exponential value of labeled samples. Pattern Recognition Letters 16:105-111 13. Cohen I, Cozman FG, Sebe N, Cirelo MC, Huang TS (2004) Semisupervised learning of classifiers: theory, algorithms, and their application to human-computer interaction. IEEE Transactions on Pat- tern Analysis and Machine Intelligence 26:1553-1567 190 Yi-qing Kong · Shi-tong Wang (2009) 14. Cozman FG, Cohen I (2001) Unlabeled data can degrade classitication performance of generative classifiers. Hewlett Packard Technical Report 1-16 15. Inoue M, Ueda N (2001) HMMs for both labeled and unlabeled time series data. Proceedings of IEEE Workshop Neural Networks for Signal Processing 93-102 16. Ripley B (1996) Pattern recognition and neural networks. Cambridge, U.K.: Cambridge University Press 17. UCI machine learning repository. http://www.ics.uci.edu/mlearn 18. Blum A, Langley P (1997) Selection of relevant features and examples in machine learning. Artificial Intelligence 97:245-271 19. Jain A, Zongker D (1997) Feature selection: evaluation, application, and small sample performance. IEEE Transactions on Pattern Analysis and Machine Intelligence 19:153-157 20. Kohavi R, John G (1997) Wrappers for feature subset selection. Artificial Intelligence 97:273-324 21. Koller D, Sahami M (1996) Toward optimal feature selection. Proceedings of 13th International Conference on Machine Learning 284-292 22. Huang J, Ng M, Rong H, Li Z (2005) Automated variable weighting in k-means type clustering. IEEE Transactions on Pattern Analysis and Machine Intelligence 27:1-12 23. Bouchachia A, Pedrycz W (2003) A semisupervised clustering algorithm for data exploration. Pro- ceedings of Fuzzy Systems Association World Congress 328-337 http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Fuzzy Information and Engineering Taylor & Francis

Feature Selection and Semisupervised Fuzzy Clustering

Fuzzy Information and Engineering , Volume 1 (2): 12 – Jun 1, 2009

Loading next page...
 
/lp/taylor-francis/feature-selection-and-semisupervised-fuzzy-clustering-C3toAYgUn9

References

References for this paper are not available at this time. We will be adding them shortly, thank you for your patience.

Publisher
Taylor & Francis
Copyright
© 2009 Taylor and Francis Group, LLC
ISSN
1616-8666
eISSN
1616-8658
DOI
10.1007/s12543-009-0014-0
Publisher site
See Article on Publisher Site

Abstract

Fuzzy Inf. Eng. (2009)2:179-190 DOI 10.1007/s12543-009-0014-0 ORIGINAL ARTICLE Yi-qing Kong · Shi-tong Wang Received: 18 December 2008/ Revised: 9 April 2009/ Accepted: 10 May 2009/ © Springer and Fuzzy Information and Engineering Branch of the Operations Research Society of China Abstract Semisupervised fuzzy clustering plays an important role in discovering structure in data set with both labelled and unlabelled data. The proposed method learns the task of classification and feature selection through the generalized form of Fuzzy C-means. Experimental results illustrate appropriate feature selection and classification accuracy with both synthetic and benchmark data sets. Keywords Semisupervised clustering · Feature selection 1. Introduction The widespread use of computers and information technology has made data col- lection of routine task in various areas. The primary challenge of fields like pattern recognition, data mining, decision support, and knowledge discovery is the extrac- tion of potentially useful information by careful processing and analysis. The core of the process in many pattern recognition and practical data mining applications lies in building of classifiers, which are usually developed from expert knowledge or from training data. To design, such classifiers, feasible machine learning techniques are utilized to applying learning algorithms that could make effective use of the training data to give out optimal classification accuracy. Typically and traditionally, the ma- chine learning process could be divided into two mechanisms: supervised learning and unsupervised learning. 1.1 Supervised Learning Supervised learning assumes the class labels of the data objects being analyzed are known. The process extracts rules that identify groups of objects with the same class label while differentiate among the objects that have different labels. The rules then form the basis for classifying new data observations. Under such definition, super- vised learning can be formalized as the problem of inferring a function y = f (x), based on a training set D = {(x , y ),..., (x, y ),..., (x , y )}. Usually, the inputs are 1 1 i i n n Yi-qing Kong () · Shi-tong Wang School of Information Technology, Jiangnan University, Wuxi 214122, P.R.China e-mail: kongyiqing@163.com 180 Yi-qing Kong · Shi-tong Wang (2009) T d d-dimensional vectors, x = [x ,..., x ] ∈R . Supervised learning could be used i i,1 i,d for regression or for classification; in the latter problems, y is of categorical nature (e.g., binary, y ∈{1, 0}). The obtained function is evaluated by how accurately it performs on new data assumed to follow the same distribution of the training data. Usually, the function f (·) is assumed to have a fixed structure and to depend on a set of parametersθ, so we write y = f (x,θ), and the goal becomes to estimateθ from the training data. Supervised learning can be formulated using either a generative approach or a dis- criminative approach. In the discriminative approach, the focus is on learning the function y = f (x,θ) directly from the training set. Well known discriminative ap- proaches include linear and generalized linear models, k-nearest neighbor classifiers, tree classifiers, feed-forward neural networks, support vector machines (SVM), and other kernel-based methods. On the contrary, the generative approach involves estimating the joint probability density function p(x, y) from training data set D. In classification, this is usually done by learning the so-called class-conditional densities, p(x|y), and the probability of each class, p(y), [2]. This can be done for example, by representing the joint density using a kernel method or a Gaussian mixture, from which such joint probability func- tion could estimate, after which optimal Bayesian decision rules could be derived by the standard Bayesian decision theory machinery. 1.2 Unsupervised Learning Unsupervised learning (sometimes, called clustering) makes no assumptions about the category structure. The goal of unsupervised learning is to discover a natural grouping in a set of objects, without knowledge of any class labels, that is, unsuper- vised learning can be formalized as the problem of inferring a function y = f (x), based on a training set D = {x ,..., x,..., x }. Cluster analysis is for partitioning 1 i n a given collection of patterns into groups of similar individuals and does not require any knowledge about the data. Unsupervised learning is prevalent in any discipline that involves analysis of mul- tivariate data. For process, they use objective criterion functions to define similarity or dissimilarity among objects. The way is to find structure to partition the data into groups where objects that are more similar tend to fall into the same group and ob- jects that are relatively distinct tend to separate into different groups. In principle, the more information we have about the problem, the better the clustering is expected to perform. 1.3 Semisupervised Learning and Related Work As mentioned above, training data can be either labelled or unlabelled. The most ex- isting classification learning algorithms of supervised learning methods require suf- ficient training data so that the obtained classification model can generalize well. When the number of training data in each class decreases, the classification accuracy of classification algorithms degrades. Unfortunately, data in many domains are un- labelled, such as, text classification, medical image, and remote sensing (land use) data, etc. Hence, it is largely useless of standard supervised methods, which require a Fuzzy Inf. Eng. (2009) 2:179-190 181 supervising label for training set. Providing labels for even a significant sub-sample of a large database may be impractical - for labelling is usually done by using hu- man expertise, which is expensive, time consuming, and error prone, making labelled training data often very sparse. Whereas, obtaining unlabelled data is rather easier since it involves collecting data that are known to belong to one of the classes with- out having to label it, and making unlabelled ones are often abundant. For example, in gene recognition problems, it is easy to collect genes displaying information, but it is very tedious and difficult to label the gene to the corresponding class. As a result, how to exploit these unlabelled data in classification is much favored and has become a hot topic and an active research problem in recent years. The general problem of exploiting unlabelled data in supervised learning leads to a semisupervised learning, which utilize both labelled and unlabelled examples in building a classifier [10-13]. The motivation behind semisupervised learning is that, unlabelled samples may help improve estimation of the class-conditional densities. There is some theoretical analysis of this approach that suggests unlabelled data may be helpful in classification [14-15]. A general analysis of semisupervised learning for probabilistic classifiers is given in [13]. Although several semisupervised learning techniques have shown some advantage for very few labelled data, they may not always outperform traditional su- pervised learning methods, e.g., SVM, when the amount of labelled data is increased. And researchers have reported cases where the addition of unlabelled data degraded the performance of the classifiers when compared to the case in which unlabelled data is not used. So we should still notice the problem that unlabelled data is not always helpful, and may degrade performance in some cases [12] and [15-17], which reveals some puzzling aspects. Moreover, there is still not a very clear answer about the generalization performance of semisupervised learning techniques. Many current semisupervised learning algorithms use a generative model for the classifier and employ expectation-maximization (EM) to model the label estimation or parameter estimation process. For example, mixture of Gaussians, mixture of experts, and naive Bayes have been respectively used as the generative model, while EM is used to combine labelled and unlabelled data for classification. There are also many other algorithms such as using transductive inference for support vector machines to optimize performance on a specific test set, constructing a graph on the examples such that the minimum cut on the graph yields an optimal labelling of the unlabelled examples according to certain optimization functions, etc. However, the various mentioned methods of semisupervised learning algorithms are all considered from the point of view of traditional supervised learning classifica- tions. Or, in other words, they are traditional supervised learning classifications with the consideration of adding unlabelled data. Then we can also think about the semisu- pervised learning by the unsupervised clustering way. We consider about classifier design with both labelled and unlabelled data, but mainly from the unlabelled data. To consider about the labels of the individual data points, there comes the method of semisupervised clustering, or called, partial supervision [1-3]. Semisupervised clustering processes with both labelled and unlabelled data at the same time. It takes advantage of clustering mechanisms and uses the knowledge of the labelled patterns to guide the process of unsupervised learning. In the approach, 182 Yi-qing Kong · Shi-tong Wang (2009) we should not only choose a suitable similarity measure but also be constrained by the labelled data. In this paper, we are concerned with semisupervised clustering which is based on objective function optimization. The approach assumes that some patterns are labelled and these clustering hints are brought into the objective function. As the use of fuzzy clustering [4] becomes more and more popular, we give out the proposed algorithm referring [5], [6], [23] and as an extension of the Fuzzy C-means (FCM). The successful use of FCM helps us concentrate on the essence of the problem; in this paper, we will first overview this well-known technique. The objective function of the proposed clustering algorithm consists of two components being concerned with mechanisms of supervised and unsupervised. 1.4 Feature Selection Consideration The input of machine learning algorithms can be a matrix containing the similarities and dissimilarities among pairs of points, and each is described by a vector of at- tributes, called features. In this paper, we shall also focus on the question of feature selection. In general, the more information we have about the features of each pat- tern, the better a machine learning algorithm is expected to perform. This seems that we should use as many features as possible. However, it is not like this in practice. Some features are just noise, thus not contributing to, and sometimes even degrading the machine learning process. The task of selecting the suitable features is known as feature selection. Feature selection has been widely used in supervised learning [18-21], and the goal is to select features that can achieve the highest accuracy on test set data. Feature selection is not widely used in unsupervised learning or clustering. One important reason is that we do not have class labels to assess the similarities and dissimilarities of features. In this paper, we are concerned with feature selection of adding a feature weight variable. The similar idea that feature weight variable introduced into the objective function is in [22]. Both feature selection with feature weight and semisupervised based on FCM were proposed by the other researchers. However, combining both these methods in machine learning has no been used. That is why we use this for this research. In this paper, we are concerned with the task of both classification and feature selection application through the generalized form of Fuzzy C-means. 1.5 Organization of the Paper The material is arranged into five sections. The problem formulation based on a demonstration is presented in Section 2. Section 3 includes a detailed derivation of the algorithm and discusses an optimization scheme. Section 4 gives out experimental results carried out for synthetic two-dimensional data and the benchmark data sets. Conclusions are covered in Section 5. 2. The Problem Description The notation to be used throughout this study is fairly standard used in fuzzy cluster- ing. Table 1 demonstrates the detailed description. Fuzzy Inf. Eng. (2009) 2:179-190 183 Table 1: Summary of Notations Used Notations Descriptions A = [a ] transformation matrix ij C×K a the number of x , for x ∈ k and x ∈ c ij n n j n i B = [b ] binary cluster-class corresponding matrix ij C×K = 1, IFc ∈ k ⎨ i j ij ⎪ = 0, ELS E C the number of clusters c the set prototypes D the number of features d the distance between x and c in n i K the number of classes k the set classes ⎨ = 1, IFx labelled n ⎪ = 0, ELS E m the difference between the binary membership degree{0, 1} jn of labelled x to k and the sum of the membership degrees of n j x to all clusters belonging to k n j N the number of data points U = [μ ] partition matrix in C×N μ the membership degree of x to c , the higher the value, in n i the more significant the pattern in the cluster (i) (i) V = [υ ] partition matrix C×N in (i) υ the cluster membership degree of x to c by the labelled data n i in ( j) ( j) ( j) T (i) V = [υ ] partition matrix, V = B · V K×N jn ( j) υ the class membership degree of x to k by the labelled data n j jn x the patterns = 1, IFx labelledANDx ∈ k ⎪ n n j y = 0, IFx labelledANDx  k jn ⎪ n n j ⎩ 1 = , IFx unlabelled γ scaling factor helps establish a balance between the supervised and unsupervised components of the objective function γ learning rates control the process of updating membership degrees during stepwise optimization of the objective function 1.2 1.2 1 1 0.8 0.8 0.6 0.6 0.4 0.4 0.2 0.2 0 0 −0.2 −0.2 −1.5 −1 −0.5 0 0.5 1 −1.5 −1 −0.5 0 0.5 1 Fig.1: Ripley train data set Fig.2: FCM result on two clusters 184 Yi-qing Kong · Shi-tong Wang (2009) Given N data points to be clustered into C clusters, the usual choice is FCM, which comes to the problem of optimization the objective function. We know that FCM is typical for unsupervised learning - we search for structure in the data by minimizing the certain objective function. To clearly demonstrate such a problem, we first look at a simple example. The two-class problem constitutes an important category in pattern recognition; as such we first give that problem an analysis. We first use Ripley’s synthetic data set [16]; it has two real-valued co-ordinates in two classes; and each class is a bimodal mixture of two Gaussians. The data set consists of two classes with 125 points in each class. We use this data set to give out a demonstration. Figure 1 gives out the Ripley train data set of two classes. Figure 2 gives out FCM clustering results on two clusters. We can clearly see that FCM performs bad on the data with two clusters; there comes the question of why. Remember, cluster analy- sis is a method for partitioning a given collection of patterns into groups of similar individuals. The structure is discovered by similarity based upon a certain distance measure. In tradition, clustering does not require any knowledge about the data, and people like that for not requiring of labelled information. However, from this exam- ple, we clearly find that, unsupervised learning clustering does not perform well in discovering the ”hidden” structure of the data set. Then comes the question of utiliz- ing the labelled information. From recent researches, semisupervised learning is in much concern for learning from both labelled and unlabelled data. We perform clas- sification through semisupervised clustering based on the known class labels - just as stated in [5]; and from that, we will give an extension for classification together with feature weights. We formulate the problem of admitting information about class of patterns. The class labelling of individual patterns is denoted by y . To accommo- jn date the information about class membership, we generalize the objective function by capturing possible deviations between the labelling discovered by the clustering method (the result of clustering) and the class labels provided by y . The objective in function bringing together both labelled and unlabelled patterns assumes the form: C N C N 2 2 (i) 2 2 J(U, C) = μ d +γ (μ −υ ) d , (1) 1 in in in in in i=1 n=1 i=1 n=1 2 2 2 d = α (x − c ) , (2) nd id in d d=1 C N D s.t. μ = 1∀n, 0 < μ <N∀i,υ ∈ [0, 1], α = 1. (3) in in in d i=1 n=1 d=1 Proceeding with the objective function, the first component of the above expression Eq.(1) is just the same as the standard FCM, so we will not focus on it. The second one requires some explanation, it is concerned with partial supervision; scaling factor γ helps establish a balance between the supervised and unsupervised data, the terms (i) υ help optimize the membership of those labelled samples. μ means the cluster in in (i) membership degrees of x to c discovered by the clustering process, and the υ the n i in cluster membership degrees of x to c provided by optimizing Q(V) which will be n i (i) described below. In purpose, the terms μ − υ in Eq.(1) aims at the difference in in Fuzzy Inf. Eng. (2009) 2:179-190 185 between the labelling results given by the clustering process, and the class labels provided by labelled data. Eq.(2) gives out the distance between x and c , however, n i we add the scaling coefficient termα here to accomplish the task of feature selection. From intuition, different values of α imply different importance of features. The greater the value, the more important the feature; and α = 0 just means considering nothing of the d-th feature. From Eq.(1), we could see that, if we did not consider (i) C N 2 2 about the labelled data term (μ −υ ) d , and the scaling coefficient term in i=1 n=1 in in α , Eq.(1) is just the same as the FCM. Or, feature selection and semisupervised fuzzy clustering is the generalized FCM, and FCM is a special form of it in the case (i) ofγ = 0 andα = . The termsυ in Eq.(1) could be obtained through minimizing 1 d in the objective function Q : K N Q(V) = l m , (4) jn j=1 n=1 ( j) m = y −υ . (5) jn jn jn Eq.(5) determines the difference between the class labels provided by y and the jn ( j) labelling results υ got by the proposed clustering. To minimize Eq.(4), gradient jn method is used: 1, if c ∈ k , ⎨ i j (i)(t+1) (i)(t) υ = υ + 2γ l m · (6) 2 n jn ⎪ in in ⎪ 0, else. j=1 The proposed algorithm consists of two phases, that is: Phase 1: Optimize Q(V). (i) Phase 2: Optimize J(U, C) withυ optimized by Phase 1. in 3. The Clustering Algorithm The problem splits into three parts, that is, the calculations of the partition matrix, the calculations of the prototypes, and a determination of which clusters belong to a certain class. The minimization of J(U, C) is realized with respect to the family of the partition matrix U and prototypes C. The optimization of the partition matrix has to involve the constraints imposed on it. These are incorporated by using Lagrange multipliers. Using multipliers for each data point, we covert the problem into its ∂J(U, C) unconstraint optimization. By setting = 0 for a given cluster s, a data point ∂u st t and a feature p, where s = 1, 2,..., C , t = 1, 2,..., N and p = 1, 2,..., D,weget (i) 2 2 2 α (μ +γ (μ −υ ) )x 1 sn sn np p sn n=1 C = , (7) sp (i) 2 2 2 α (μ +γ (μ −υ ) ) 1 sn sn p sn n=1 D C N 2 2 2 2 2 −1 −1 ( ( (μ +γ (μ −υ ) )(x − c ) ) ) 1 nd id in in in d=1 i=1 n=1 α = , (8) C N 2 2 2 2 2 (μ +γ (μ −υ ) )(x − c ) 1 np ip in in in i=1 n=1 186 Yi-qing Kong · Shi-tong Wang (2009) Table 2: The Proposed Algorithm (0) Step 1 Compute U by FCM (i) (0) (0) (0) (0) Step 2 (V ) = U , Compute A and B , ( j) (0) (0) T (i) (0) Compute (V ) = (B ) · (V ) (i) (t) (t +1) (t ) 1 1 Step 3 REPEAT Compute (V ) UNTIL Q − Q <ε (t ) (t ) (t ) (t +1) (t ) 2 2 2 2 2 Step 4 REPEAT Compute C ,α ,U UNTIL J − J <ε (t ) Step 5 Compute B 1− υ it 1+γ γ υ 1 st i=1 2 2 2 μ = + , whered = α (x − c ) . (9) st nd id in d 1+γ 2 1 d st d=1 it i=1 In Table 1, two matrices are very important to the proposed algorithm-transformation matrix A and corresponding matrix B. The elements a in matrix A reflect the num- ij ber of data points of class j that appear in cluster i. The expression i = arg(max(μ )) in tells us which cluster data x is in, and that is, cluster i. It contains computations of the maximum over the membership grades (elements of the partition matrix), and determines for which this maximum occurs and returns the index i. This assignment is based on an intuitive argument that the maximal membership value implies the resulting cluster assignment (which is also used in FCM). For a labelled data point x , the expression j = arg(y = 1) can simply tell us which class data x is in, n jn n and that is, class j. The index of the cluster and the index of the class give us the information that data point n of class j belongs to cluster i. Through this way, the elements a could be simply got one by one. From transformation matrix A, the ex- ij pression j = arg(max(a )) tells us cluster i belongs to class j. Then we could get ij the binary cluster-class corresponding matrix B. The optimization is completed in an iterative manner by updating the partition matrix U, the prototypes matrix C, the feature-importance matrix α, and using a standard termination criterion based on the maximal norm of the difference between the successive partition matrices. We have the proposed algorithm in Table 2. 4. Experimental Results 4.1 Synthetic Data We first still use Ripley’s synthetic data set. The result of FCM could not perform well; however, through combining 4 clusters into 2 classes, the result could perform better. Figure 4 gives out the proposed clustering result on four clusters, and then combines the four related clusters into two classes based on the given 20 percent labelled data. We could clearly find that the result of Figure 4 is much better than that of the FCM in 2 classes of Figure 2. The proposed algorithm is based on the Fuzzy Inf. Eng. (2009) 2:179-190 187 1.2 1.2 1 1 0.8 0.8 0.6 0.6 0.4 0.4 0.2 0.2 0 0 −0.2 −0.2 −1.5 −1 −0.5 0 0.5 1 −1.5 −1 −0.5 0 0.5 1 Fig.3: The proposed algorithm clustering Fig.4: The proposed algorithm classification result 12.5 11.5 10.5 9.5 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 γ 1 Fig.5: The rate of misclassification in percent viewed as a function of γ objective function bringing together with both labelled and unlabelled patterns; for the unsupervised term helps consider the structure of the data, and the supervised term helps consider the classification information about the data. Next, in particular, we control the impact of the level of supervision reflected by the values ofγ . From Eq.(1), we could observe that the result of the proposed algorithm is effected by the factors γ of Eq. (1) and γ of Eq.(6). Scaling factor γ is the 1 2 1 balance between the supervised and unsupervised data. The value of the learning rate γ is usually in [0, 0.1]. In general, low values allow smooth optimization of Eq.(6). We use γ = 0.06 in the experiment; and the labelled rate of the predefined data set is 20 percent. Figure 5 gives out the relationship between γ and the rate of misclassification in percent. For the 20 percent labelled data, if γ > 1, the different γ does not make much difference in the classification result. 4.2 Benchmark Data For additional tests, we use the three well-known benchmark real-data problems: crabs, wine, and wdbc. The crabs data set comes from the book “Pattern Recognition and Neural Networks” [16]; wine and wdbc data sets from UCI Machine Learning repository [17]. We test our algorithm on these data sets with different characteristics (Table 3). The Leptograpsus crabs data set (crabs) consists in determining the sex of crabs, based on geometric measurements. The wine recognition data set (wine) contains results of chemical analysis of wines grown in different cultivars. The goal is to pre- Error Rate in % 188 Yi-qing Kong · Shi-tong Wang (2009) Table 3: Real World Data Sets Name Full Name N D K crabs Leptograpsus crabs 200 5 2 wine Wine recognition 178 13 2 wdbc Wisconsin diagnostic breast cancer 569 30 2 Table 4: The Error Rate Results of the Proposed Algorithm Name Percent of labelled data The proposed algorithm Using all the features crabs 80 14.32(7.87) 15.53(9.31) 60 22.93(6.37) 23.77(8.43) 40 30.24(4.50) 31.56(5.60) 20 36.67(3.30) 38.61(3.78) wine 80 5.42(1.24) 13.79(0.92) 60 12.19(4.02) 17.67(3.12) 40 19.78(4.12) 22.75(4.06) 20 25.01(2.97) 26.76(2.68) wdbc 80 2.26(0.49) 4.27(0.43) 60 4.44(0.76) 6.09(0.53) 40 6.62(0.52) 7.79(0.45) 20 8.79(0.50) 9.57(0.43) dict the type of a wine based on its chemical composition. The Wisconsin diagnostic breast cancer data set (wdbc) was used to obtain a binary diagnosis (benign or malig- nant) based on features extracted from cell nuclei presented in an image. Since these data sets are collected for supervised classification, the class labels here are partially involved in our experiment, and as the evaluation of the clustering results. As stated before, if γ > 1, the different γ does not make much difference in the 1 1 classification result. So we use γ = 1 and γ = 0.06 in the experiment. The error 1 2 rate results in percent and its standard deviation achieved by the proposed method for theses data sets with some levels of labelled data are shown in Table 4. The results reported here are obtained by averaging over 20 random partitions with labelled and unlabelled samples. The error rate results correspond to the mean of the error rates on the data set when the results are compared with the ground truth labels. The numbers in parenthesis are the standard deviation of the corresponding quantities. From Table 4, we can see that the proposed algorithm reduces the error rates when compared with using all the features for all these data sets. For feature selection, our algorithm selected three out of five variables in the Crabs data set; results being similar to reported in [7]. Fuzzy Inf. Eng. (2009) 2:179-190 189 5. Conclusions In this study, we have discussed design of semisupervised clusters handling both labelled and unlabelled data in the process of constructing pattern classification sys- tems. We have addressed issues of adding feature weight term of clustering and the classification of the unlabelled data via clustering. From the analysis of the use of FCM with unsuitable cluster numbers, the semisupervised clustering utilizing both labelled and unlabelled data have been shown to offer improvements for such cases that natural clusters are present in the considered problem. We combine feature selection with feature weight and semisupervised clustering based on FCM in this research. In this paper, we are concerned with the task of both classification and feature selection application through the generalized form of FCM. The semisupervised clustering helps consider the structural and classification information about the data. However, what should be still noticed is that if only a very limited amount of labelled data is available, the results may degraded, for the performance is more dependant on the very few labelled data samples. Acknowledgments Thanks to the support by 2007 Cultivation Fund of the Key Scientific and Technical Innovation Project of Ministry of Education of China. References 1. Bensaid AM, Hall LO, Bezdek JC, Clarke LP (1996) Partially supervised clustering for image seg- mentation. Pattern Recognition 29:859-871 2. Pedrycz W (1985) Algorithms of fuzzy clustering with partial supervision. Pattern Recognition Let- ters 3:13-20 3. Pedrycz W, Waletzky J (1997) Fuzzy clustering with partial supervision. IEEE Transactions on Sys- tems, Man and Cybernetics-Part B: Cybernetics 27:787-795 4. Bezdek JC (1981) Pattern recognition with fuzzy objective function algorithms. Plenum, New York 5. Bouchachia A, Pedrycz W (2006) Enhancement of fuzzy clustering by mechanisms of partial super- vision. Fuzzy Sets and Systems 157:1733-1759 6. Pedrycz W, Vukovich G (2004) Fuzzy clustering with supervision. Pattern Recognition 37:1339-1349 7. Krishnapuram B, Hartemink AJ, Carin L, Figueiredo MAT (2004) A bayesian approach to joint fea- ture selection and classifier design. IEEE Transactions on Pattern Analysis and Machine Intelligence 26:1105-1111 8. Blum A, Mitchell T (1998) Combined labeled and unlabeled data with co-training. Proceedings of Conference of Computational Learning Theory 92-100 9. Miller DJ, Uyar H (1997) A mixture of experts classifier with learning based on both labelled and unlabelled data. Proceedings of Conference of Neural Information Processing Systems 571-577 10. Nigam K, McCallum A, Thrun S, Mitchell T (2000) Text classification from labeled and unlabeled documents using EM. Machine Learning 1-34 11. Shashahani B, Landgrebe D (1994) The effect of unlabeled samples in reducing the small sample size problem and mitigating the hughes phenomenon. IEEE Transactions on Geoscience and Remote Sensing 32:1087-1095 12. Castelli V, Cover T (1995) On the exponential value of labeled samples. Pattern Recognition Letters 16:105-111 13. Cohen I, Cozman FG, Sebe N, Cirelo MC, Huang TS (2004) Semisupervised learning of classifiers: theory, algorithms, and their application to human-computer interaction. IEEE Transactions on Pat- tern Analysis and Machine Intelligence 26:1553-1567 190 Yi-qing Kong · Shi-tong Wang (2009) 14. Cozman FG, Cohen I (2001) Unlabeled data can degrade classitication performance of generative classifiers. Hewlett Packard Technical Report 1-16 15. Inoue M, Ueda N (2001) HMMs for both labeled and unlabeled time series data. Proceedings of IEEE Workshop Neural Networks for Signal Processing 93-102 16. Ripley B (1996) Pattern recognition and neural networks. Cambridge, U.K.: Cambridge University Press 17. UCI machine learning repository. http://www.ics.uci.edu/mlearn 18. Blum A, Langley P (1997) Selection of relevant features and examples in machine learning. Artificial Intelligence 97:245-271 19. Jain A, Zongker D (1997) Feature selection: evaluation, application, and small sample performance. IEEE Transactions on Pattern Analysis and Machine Intelligence 19:153-157 20. Kohavi R, John G (1997) Wrappers for feature subset selection. Artificial Intelligence 97:273-324 21. Koller D, Sahami M (1996) Toward optimal feature selection. Proceedings of 13th International Conference on Machine Learning 284-292 22. Huang J, Ng M, Rong H, Li Z (2005) Automated variable weighting in k-means type clustering. IEEE Transactions on Pattern Analysis and Machine Intelligence 27:1-12 23. Bouchachia A, Pedrycz W (2003) A semisupervised clustering algorithm for data exploration. Pro- ceedings of Fuzzy Systems Association World Congress 328-337

Journal

Fuzzy Information and EngineeringTaylor & Francis

Published: Jun 1, 2009

Keywords: Semisupervised clustering; Feature selection

References