Get 20M+ Full-Text Papers For Less Than $1.50/day. Subscribe now for You or Your Team.

Learn More →

Iterative Cluster Analysis of Protein Interaction Data

Iterative Cluster Analysis of Protein Interaction Data Vol. 21 no. 3 2005, pages 364–378 BIOINFORMATICS ORIGINAL PAPER doi:10.1093/bioinformatics/bti021 Iterative Cluster Analysis of Protein Interaction Data 1 2 2,∗ Vicente Arnau , Sergio Mars and Ignacio Marín 1 2 Departamento de Informática and Departamento de Genética, Universidad de Valencia, Burjassot 46100, Valencia, Spain Received on March 12, 2004; revised on September 7, 2004; accepted on September 7, 2004 Advance Access publication September 16, 2004 ABSTRACT protein–protein interaction data has created a similar need Motivation: Generation of fast tools of hierarchical clustering for methods to efficiently explore the complex graphs of to be applied when distances among elements of a set are con- interconnected proteins that often faithfully represent com- strained, causing frequent distance ties, as happens in protein plex metabolic functions of a cell (Schwikowski et al., 2000; interaction data. Drees et al., 2001; reviewed in Salwinski and Eisenberg, 2003; Results: We present in this work the program UVCLUSTER, Bader et al., 2003). that iteratively explores distance datasets using hierarch- Among the best-known and most powerful methods of ical clustering. Once the user selects a group of proteins, classification are those involving hierarchical clustering UVCLUSTER converts the set of primary distances among (reviewed in Everitt et al., 2001). Elements are progressi- them (i.e. the minimum number of steps, or interactions, vely classified into sets either by sequentially putting them required to connect two proteins) into secondary distances together in non-overlapping classes (agglomerative methods) that measure the strength of the connection between each or by progressively dividing the full set of elements into pair of proteins when the interactions for all the proteins in the smaller groups (divisive methods). Many different types of group are considered. We show that this novel strategy has hierarchical clustering methods exist that depend on how the advantages over conventional clustering methods to explore distances among elements are evaluated to build (or split) protein–protein interaction data. UVCLUSTER easily incorpor- the groups, and there is considerable interest in determin- ates the information of the largest available interaction data- ing when some of those methods perform better than others sets to generate comprehensive primary distance tables. The (e.g. Gibbons and Roth, 2002). Hierarchical clustering is one versatility, simplicity of use and high speed of UVCLUSTER of the most common methods of classification used in biology. on standard personal computers suggest that it can be a Among many others, common uses of hierarchical cluster- benchmark analytical tool for interactome data analysis. ing techniques are the classification of organisms of different Availability: The program is available upon request from populations or species according to quantitative similarities the authors, free for academic users. Additional information (numerical taxonomy), the ordering, according to sequence available at http://www.uv.es/∼genomica/UVCLUSTER similarity, of sets of genes or proteins and, more recently, the Contact: ignacio.marin@uv.es determination of sets of genes with similar profiles of expres- sion according to microarray-derived data (reviewed in Nei 1 INTRODUCTION and Kumar, 2000; Quackenbush, 2001; Felsenstein, 2004). In this study, we explore the use of hierarchical clustering for The extraction of relevant information from massive amounts the functional classification of proteins using protein–protein of biological data is becoming crucial in the post-genomic interaction data. Our approach has two steps. First, we meas- era. Current efforts are focused on the generation of tools ure the distance among any two proteins in a protein–protein able to allow the classification of large amounts of similar, interaction network according to the minimum number of correlated or interconnected elements and retrieve from that steps required to connect them, where each step is a known, classification useful patterns or regularities that can be later physical protein–protein interaction. Second, we use those explored either in the laboratory or in silico. Thus, in the distances to classify the proteins in groups using hierarchical last years we have witnessed an ever-increasing interest in clustering. This approach is thus in principle very similar to the development of classification tools for gene expression those used for other types of data and would seem quite easy data obtained from microarray analysis (Quackenbush, 2001; to implement. However, protein–protein interaction data have Gibbons and Roth, 2002). The recent generation of massive special features that make the use of hierarchical clustering methods particularly problematic. To whom correspondence should be addressed. 364 Bioinformatics vol. 21 issue 3 © Oxford University Press 2004; all rights reserved. Hierarchical clustering in proteomics An often overlooked aspect of hierarchical cluster analysis of 7.14. This distance should become smaller when more is the fact that it has serious intrinsic problems when used interactions are determined. Thus, recently Wilhelm et al. with datasets in which the distances among many elements (2003) described a S.cerevisiae dataset involving about 20 000 are identical (the ‘ties in proximity’ problem; Backeljau et al., interactions and with an average distance for the proteins of the 1996; Takezaki, 1998; MacCuish et al., 2001). Ties generate largest connected component equal to only 2.57. Similarly, in multiple mathematically equivalent solutions when hierarch- a recent analysis in which 15 210 interactions were considered ical clustering is performed. In favorable cases, when ties are [data obtained from the January 2004 release of the DIP data- very rare, we would anyway expect all alternative clustering base; Xenarios et al. (2002)], we found that all proteins that solutions to be very similar. The simplest solution in those can be connected have distances that range from 1 to 12. When cases is to solve the ties arbitrarily, for example, in ways that we assigned a conventional value of distance equal to 24 (i.e. are dependent on the order of data input. However, when ties twice the largest detected for any connected proteins) to all are frequent, the number of alternative solutions and the dif- those pairs of proteins that cannot be connected by any path, ferences among those alternatives increase. Backeljau et al. we found an average distance of 4.97 for 4721 S.cerevisiae (1996) summarized the effect of ties on the performance of proteins. This value is of course an overestimate, because in several commonly used computer programs, concluding that the future more/all proteins are expected to be found connec- none of them was able to correctly confront this problem. The ted, and with shorter distances. In any case, for DIP data, we relative importance of ties was further analyzed by Takezaki found that there are 4721 × 4720/2 = 11.1 millions of dis- (1998), which determined the likelihood of ties and their effect tances among proteins that are constrained to a short range of on bootstrap tests for some small simulated datasets, conclud- possible distance values, mostly between 1 and 12, so it can ing that ties have only occasionally a significant influence. be deduced that the S.cerevisiae interactome dataset includes However, these two works analyzed small distance matrices millions of identical distances. Very recent data for animal based on nucleotide sequences, in which ties are relatively species, such as Drosophila melanogaster or Caenorhabditis rare and often caused by the effect of rounding decimal num- elegans are more limited (i.e. the average number of inter- bers. Ties are much more frequent in other types of data, most actions per protein is lower than in S.cerevisiae). However, especially when distances are constrained to a narrow range the general pattern of small world properties with constrained of values. Most especially, protein–protein interaction data distance values is similarly observed (Giot et al., 2003; Li generate one of the datasets in which ties are most prominent. et al., 2004). Currently, the most complete interactome data available in The ties in proximity problem has led most authors inter- eukaryotes involves the protein interaction network of the ested in analyzing protein interaction data to focus their efforts yeast Saccharomyces cerevisiae. In addition to results derived on generating tools, which are based on graph theory and from a large number of small-scale, directed studies, massive designed to detect clusters of highly or similarly connec- non-directed protein–protein interaction data for S.cerevisiae ted proteins (e.g. Bader and Hogue, 2003; Bu et al., 2003; were generated either by using two-hybrid assays (Uetz et al., Goldberg and Roth, 2003; Spirin and Mirny, 2003; Gagneur 2000; Ito et al., 2001) or by pulling down protein complexes et al., 2004; Pereira-Leal et al., 2004; Przulj et al., 2004). using a tagged subunit as bait (Gavin et al., 2002; Ho et al., These approaches are very interesting because they may effi- 2002). Analysis of the large datasets generated using these ciently detect functionally relevant modules. However, they approaches demonstrated that the S.cerevisiae interactome has have the obvious limitation of not allowing a general view ‘small world’ properties (Wagner, 2001; see review by Albert of the relationships among proteins selected with criteria and Barabási, 2002). We can define the distance between different from their degree of connection. In this study, we two proteins as the minimum number of direct interactions describe a novel, alternative strategy of analysis that allows among proteins in the dataset that are required in order to to explore the characteristic type of complex data that suffers connect them (a parameter often called ‘minimal path length’ the ties in proximity problem using hierarchical clustering. or simply ‘path length’; see Barabási and Oltvai, 2004). For In a significant precedent to our work, Rives and Galitski a ‘small world’ interaction dataset, it is found that many dis- (2003) circumvented the problem by using not the distances tances are identical because most proteins are either directly among elements but the correlation coefficient obtained when connected (i.e. they are part of the same protein complex or the distances of two proteins are measured against all the interact in two-hybrid assays; distance = 1) or separated by members in a database (see also Prinz et al., 2004 for a just a few steps, each one involving a physical interaction related procedure). Our method is different, being based on between two proteins or protein complexes. That interactome obtaining multiple, equally valid solutions and evaluating the data have small world properties was already anticipated distances among elements according to those multiple solu- using a small fraction of the information currently available tions. This strategy is implemented in a program that we have (just 899 interactions) by Wagner (2001), which described called UVCLUSTER, which is extremely simple to use and in the S.cerevisiae protein network a group of 466 pro- fast enough to analyze large datasets involving hundreds of teins connected by an average distance (average path length) proteins on standard PC computers. UVCLUSTER allows the 365 V.Arnau et al. determination of the relative degree of proximity of groups of proteins that can be defined by the user in several ways, generating outputs that can be easily explored or converted into dendrograms. We demonstrate that UVCLUSTER-based analyses have obvious advantages over conventional hierarch- ical clustering and that UVCLUSTER can be used to explore interactome data in order to detect new proteins relevant to any chosen biological process. 2 SYSTEM AND METHODS The UVCLUSTER program (available upon request from the authors, free for academic users) is written in C and its versions have been compiled and tested on Windows and Linux operating systems. The flow chart of the program is shown in Figure 1. UVCLUSTER analyses begin by import- ing a text file containing a dataset of direct protein–protein interactions. This dataset can be then filtered in two different ways. First, the user can create a list of proteins and then select one of two options: (1) use only those interactions between pairs of proteins in the list; or (2) exclude all interactions involving any protein in the list. After this first filter, a table with the selected group of interactions can be saved for further analyses. Then the user can apply a second filter by selecting a cutoff value for the maximum/minimum number of interac- tions that a protein may have to be included in the analysis. This second filter allows to exclude proteins with a number of interactions lower than a particular threshold (i.e. poorly connected proteins) or higher than the cutoff value (i.e. it eliminates ‘hubs’, highly connected proteins). After these two filters, the UVCLUSTER program generates with the remain- ing proteins a matrix of primary distances (d ; equivalent to their shortest path or minimal path length) among them. If two proteins cannot be connected directly or indirectly, we have followed the convention of assigning a distance that is twice the longest distance among connected proteins in the dataset Fig. 1. UVCLUSTER flow chart. (modifying this convention does not alter any of the results that we will present below). The tables of primary distances gener- ated by UVCLUSTER can be saved for further analyses. Thus, The second alternative involves providing the program with in other rounds of analysis, the user may decide whether to a list of proteins. Then, the program generates a subtable of import a new file to generate another primary distance table or primary distances, but only including the proteins in the list to directly use a table that is already available. This is useful, for which interaction data are available. because a primary distance table obtained from all the interac- Once the subtable of primary distances has been generated tion data that exist for a species is generated that can be used by any of these two methods, UVCLUSTER iteratively uses many times in different analyses. agglomerative hierarchical clustering on the primary distances Once a primary distance table has been generated or simply dataset to generate N alternative, equally valid, clustering loaded, UVCLUSTER allows the user to further select groups solutions. The value of N must be chosen by the user before of proteins to be analyzed, again in two different ways. A first starting the analysis after considering the speed of the com- option is to choose a single protein (‘network center’) while puter, number of proteins to analyze and the precision of establishing a cutoff distance value. In this case, the program analysis required (see below for a discussion). The solutions explores the primary distance table, by selecting all the pro- are generated in three steps: (1) random sampling of the ele- teins with primary distances from the selected one that are ments of the dataset; (2) elements in the dataset are clustered equal or lower than the cutoff value and generates a subtable according to group average linkage (Everitt et al., 2001); and containing only those proteins together with their distances. (3) the agglomerative process is finished at a certain point, 366 Hierarchical clustering in proteomics defined by the user that must select the value for a global exploration of the results, especially for those users that are ‘stopping rule’ parameter, the Affinity Coefficient (AC). AC is not familiar with MEGA or other similar packages. Finally, defined as follows: the third UVCLUSTER output file contains a graphical rep- resentation of the data in PGM (Portable GreyMap) format. To AC = 100[(P − C )/(P − 1)], (1) m m m generate the PGM file, proteins are ordered according to the where C (cluster mean) is the average of the distances for results described in the second UVCLUSTER output file. To all elements included in the clusters and P (partition mean) facilitate the analysis of the PGM figure, a fourth output file is is the average value of distances for the whole set of selected generated that contains a list of names that corresponds to the proteins. If AC = 100, then C = 1, meaning that only pro- order in which proteins are shown in that figure. The PGM rep- teins with distance equal to 1 can be clustered together. Using resentation (see Arnau and Marín, 2003) is a square formed AC < 100 relaxes the conditions, allowing that proteins sep- by K smaller color-coded squares, where K is the num- arated by distances higher than 1 be put together in the same ber of proteins analyzed. Shades of gray indicate the degree cluster. It therefore favors the generation of larger clusters, of interaction between each pair of proteins, with light gray although including proteins often indirectly connected. We corresponding to close proteins and black to distantly con- generally use values of AC ranging from 50 (highly relaxed) nected proteins. PGM format files can be read using freeware to 100 (maximally strict) in our analyses. If many proteins are programs such as IrfanView 3.85 (www.irfanview.com). directly connected, it may be useful to increase the AC value, while if distances among proteins are large, AC values may be 3 ALGORITHM decreased to facilitate the detection of potentially interesting The dataset of direct protein–protein interactions must be clusters. written as a text file, in such a way that each line of the Once the dataset of N alternative solutions is obtained, file contains the names of two interacting proteins separ- UVCLUSTER evaluates in how many of them each pair of ated by a tab. This is the format most commonly used elements appear together in the same cluster and generates to express protein–protein interaction data in public data- an output file containing a table with secondary distances (d ) bases. Therefore, massive amounts of data can be very among the elements. The secondary distance between two ele- quickly imported into UVCLUSTER. Primary distances are ments is defined as the number of solutions in which those two calculated from this file of direct interactions using Floyd’s elements do not appear together in the same cluster divided algorithm (Floyd, 1962). by the total number of solutions (N ). Therefore, iterative res- The main algorithm of the program, that characterizes the ampling of the primary distance data allows to establish how iterative clustering method, can be described as follows: likely it is for each pair of elements to be clustered together when many alternative, equally good, clustering solutions are generated. This is the key of the strategy implemented Import table with d values in UVCLUSTER. Secondary distances establish the strength Select N , AC values of the connection between two elements, relative to all the Repeat_from k = 1 elements in the analyzed dataset [an idea first developed Random ordering of elements; in Arnau and Marín (2003), although using a less appropriate Agglomerative hierarchical clustering using group clustering method]. Ties in secondary distances will be very average linkage (d , AC); rare, due to the complex nature of the connections among ele- Increment counters according to the solution found; ments. This means that the secondary distance dataset may k = k + 1; then be analyzed by conventional (i.e. non-iterative) methods To k = N to obtain a rigorous classification of the elements. Generate d values corrected by N To facilitate the exploration of the results, UVCLUSTER Export files containing: (a) tables of d , d values; generates four output files. The first one contains the tables of (b) UPGMA clustering with d values and primary and secondary distances among the chosen elements (c) graphical output in PGM format. plus the values of several significant parameters used in the analyses, such as AC, C and P . This first file also contains a m m table of secondary distances suitable to be copied to a text file and directly imported into MEGA 2.1 (Kumar et al., 2001). In 4 IMPLEMENTATION MEGA, the secondary distance data can be used to generate 4.1 Speed and performance dendrograms, using conventional methods of clustering such as UPGMA or Neighbor-joining. In the second UVCLUSTER No matter how complex or extensive the raw interaction data output file, the results of an agglomerative hierarchical clus- are, they can be quickly converted into primary distance tering using UPGMA performed with the secondary distance data by UVCLUSTER and then saved for later analyses. In data are detailed. This file may be useful for a preliminary addition, the speed of generation of primary distance tables 367 V.Arnau et al. allows us to easily update any comprehensive species-specific the solution generated will necessarily fail to detect the two table, every time a new information is available. For example, natural clusters. Figure 2C shows a UPGMA tree obtained the S.cerevisiae data available in the January 2004 release of using the secondary distances generated by UVCLUSTER the DIP database (4721 proteins, 15 210 interactions) can be with N = 10 000 and AC = 100. In this case, distances converted into a primary distance table in about 14 min on a among units 1–3 or 9–11 are equal to zero. It can be seen that standard PC computer (Intel Pentium IV 2.8 GHz processor Figure 2C closely corresponds to Figure 2A: the two clusters with 512 MB RAM memory). appear as discrete entities with units 4 and 8 being the ones Once a table with primary distances has been generated, most closely connected to the rest. secondary distance data are also obtained very quickly. Time It is interesting to develop an index to summarize the validity depends on the selected AC value. Smaller AC values that pro- of the clusters detected, in order to be able to compare differ- long the clustering process, require longer time. Therefore, to ent cluster results. Such an index would help in determining establish the speed of UVCLUSTER analyses, we have per- that the solution shown in Figure 2C is better than that shown formed tests with AC values ranging from 50 to 100. First, in Figure 2B. Many rules have been proposed to evaluate the by using a set of 34 elements and 561 primary distances different partitions of a dataset into clusters [see review by (see Section 4.3), and running on the same standard com- Gordon, 1999, pp. 60–65 and 185–204). However, we think puter detailed above, UVCLUSTER obtained the values of that none of these methods is fully convenient when applied secondary distances with a number of iterations, N = 10 000 to proteomic interaction data. They all use the whole distance in less than 2 s. Time increased linearly with the number of dataset for cluster validation (i.e. they take into account all iterations, so a similar analysis with N = 100 000 required distances, irrespectively of them implying direct interactions 9(AC = 100)to13(AC = 50) s. Similarly, for a set of or simply shortest paths among distant elements). Evaluation 150 randomly chosen elements (11 175 primary distance val- of the whole distance dataset is essential for clustering, but ues), an analysis with N = 10 000 took from 9 (AC = 100) validation of the clusters should take into account two fea- to 125 (AC = 50) s. Finally, the largest dataset that we tures of this particular type of information, which demonstrate have used composed of 500 randomly chosen proteins (i.e. that a distance equal to one is qualitatively different from the 124 750 distances), which was analyzed (N = 10 000), in 23 other distances. First, distances equal to one imply a strong to 160 min, again depending on the AC value used. functional link: two proteins directly interact in the cell, and UVCLUSTER results can be directly explored, considering therefore must be at the same time in the same place and the files that the program automatically generates containing often functioning together. On the contrary, when larger than either a UPGMA cluster analysis of secondary distances or a one, two identical distances may be radically different from graphical PGM representation. However, for advanced users, a cellular point of view. For example, a distance of two may and especially for analyses including up to a few hundreds mean that two proteins are part of a complex together with a of proteins, we think that the simplest and most efficient way third protein that physically interacts with both of them, or, to explore the results involves the generation of dendrograms alternatively, may mean that in different moments of the cell based on secondary distances. Ties in secondary distances are cycle (or in different tissues, if we refer to a multicellular very rare, facilitating the generation of non-ambiguous trees. organism), there are transient interactions of both proteins For this reason, UVCLUSTER also generates outputs com- with a third one. In the first case, all three proteins may patible with MEGA 2.1 (Kumar et al., 2001). This is one of function as a single unit, while in the second, the cellular the most used program packages in the molecular evolution roles of the two proteins may be radically different. A second and phylogenetics fields and includes several standard meth- consideration is that the protein interaction datasets are incom- ods to build dendrograms. Moreover, it can be obtained free plete. As we already commented above for S.cerevisiae, each from the authors (www.megasoftware.net). new experiment diminishes the distance among proteins. In fact, comparisons among the experiments performed so far 4.2 UVCLUSTER analysis of synthetic graphs in S.cerevisiae found a low degree of congruence (Bader and A simple model (Figure 2) demonstrates the advantages of the Hogue, 2002; von Mering et al., 2002), suggesting that we iterative strategy implemented in UVCLUSTER. Figure 2A are still quite far away from having a complete dataset of shows a graph with 11 elements connected by a total of 16 the protein interactions for any species. This means that it is interactions. Two clusters (units 1–4 and 8–11) are obvious. reasonable to expect that, in the future, many more direct inter- Figure 2B shows a UPGMA tree obtained using primary dis- actions will be found. In this sense, distances larger than one tances. Input order to obtain this tree followed the numeric may be considered to be overestimates of the true distances. value (1, 2, ... ,11) assigned to the elements. The solution All these considerations lead us to suggest that cluster shown in Figure 2B clearly fails to detect the two clusters, results from proteomic interaction data may be conveniently closely connecting units 4 and 5. This type of error is caused evaluated considering just the number of intracluster and inter- by the ties. If ties are solved in such a way that, by chance, cluster direct interactions. If we have a dataset in which K units 4 and 5 (or, alternatively, 7 and 8) become clustered, elements are connected by direct interactions, thus forming 368 Hierarchical clustering in proteomics Fig. 2. A synthetic example. (A) Graph of 11 elements connected by 16 interactions. Two clusters (units 1–4 and 8–11) are observed. (B) UPGMA-based tree using primary distances derived from the graph in (A). Notice the lack of correspondence with the graph. (C) UPGMA- based tree using the secondary distances obtained using UVCLUSTER with AC = 100 and 10 000 iterations. The topology very closely corresponds to the graph shown in (A). an undirected graph, we can define the following parameters. find a number M that corresponds to the maximum possible First, F will be the maximum possible number of direct inter- number of intracluster direct interactions: actions among elements [F = K(K − 1)/2]. Let us also call n the number of actual direct interactions discovered among M = k (k − 1)/2, (2) i i the K elements. For a particular partition into clusters, we can i−1 369 V.Arnau et al. where c is the number of clusters in the partition and k is the This particular set of proteins was selected for two main number of elements in each cluster. Finally, we will call p the reasons. First, it comprises a highly explored (i.e. most total number of direct intracluster interactions actually known. likely without false positive interactions) and very com- We suggest to evaluate the partitions found in proteomic data pact set of elements, in which obvious clusters or groups by selecting as best the one that minimizes the cumulative cannot be detected. This is shown in Figure 3, obtained hypergeometric distribution that follows: using PIVOT (Orlev et al., 2004), a program that provides a minimal-energy-based layout that highlights clustering M F −M Min (M , n) among units. However, whether clusters are present in j n−j . (3) Figure 3 is unclear. This group of proteins may thus be j =p n a good test to determine whether standard methods of clustering may detect any particular organization in the This is equivalent to select the partition that generates a dis- data, and how UVCLUSTER compares with these standard tribution that maximizes the proportion of intracluster direct procedures. Second, because it contains a few elements interactions while minimizing the proportion of intercluster that stand out as participating in processes that are related direct interactions. (through connections with actin function) but not part of We can apply this rule to the trees shown in Figure 2, the two main processes in which most elements particip- by using their topology in order to establish partitions with ate (actin patch assembly and patch-mediated endocytosis), a progressive number of clusters (c ≥ 2) (see Levenstien we can test whether we can discriminate these groups using et al., 2003 for a similar approach). Circles in the dichotomic UVCLUSTER. nodes have been numbered in Figure 2B and C accord- Figure 4 shows the results obtained with two different meth- ing to the number of clusters generated when we progress ods. First, we show in Figure 4A the tree generated by MEGA from left (largest distances) to right (smallest distances). We 2.1 that was obtained with UPGMA and the primary dis- found that the tree in Figure 2C has a minimum value of tances calculated from the graph shown in Figure 3. This Equation (3) in the particular case when c = 3. In that first option therefore corresponds to the analysis that can be case, we find that out of 16 total direct interactions, 14 are typically performed using standard clustering tools. Second, found within clusters and only 2 (units 4 and 5; units 7 and we show the tree obtained when the UPGMA algorithm 8) are found between clusters. According to the cumulative is applied to the set of secondary distances obtained using hypergeometric distribution described in Equation (3), the UVCLUSTER, with N = 10 000 and AC = 100 (Figure 4B). likelihood of finding by chance a distribution at least as asym- Both trees are quite different. As was obtained in the syn- −9 metric as the one defined by such partition is 3.17×10 . None thetic example shown above, the first tree is just one of the of the partitions obtained from the tree in Figure 2B has such many possible solutions that can be obtained by applying hier- −6 a small value (lowest value: 3.30 × 10 ). We can conclude, archical clustering to the primary distance dataset. When we as intuitively was evident, that UVCLUSTER analysis using evaluated the topologies shown in Figure 4A and B using secondary distances has provided a partition that is superior the hypergeometric-based parameters described above, we to those obtained by using the tree topology generated from found a partition (with c = 12) in the tree obtained using −27 the direct analysis of the primary distances. UVCLUSTER that is much more extreme (p = 1.79× 10 ) than the best value obtained from the direct analysis of the 4.3 UVCLUSTER exploration of the actin −23 primary distances (c = 10; p value = 1.80 × 10 ). This cytoskeleton of Saccharomyces cerevisiae optimal partition is shown also in Figure 4B. Because it To demonstrate the advantages that our program provides is known that one important problem with protein–protein when applied to real biological data, we first chose as a interaction data is the presence of false positives, we tested model a set of 34 S.cerevisiae proteins characterized by Drees whether the partition shown in Figure 4B is robust by ran- et al. (2001), which they described as 26 proteins participat- domly generating false protein–protein interactions that were ing in actin patch assembly and patch-mediated endocytosis added to the DIP database. Even when 7605 false interac- together with 8 proteins involved in other related processes, tions were added (making the randomly generated interactions such as cytokinesis (BNI1, BNR1), endocytosis (SVL3), con- 33% of all present in the modified database), the clusters trol of the morphogenesis checkpoint (SWE1, HSL7) or the were identically recovered when N = 10 000, AC = 100. CDC42 signaling pathway (CDC42, CLA4, GIC2) (Drees Only when 15 210 false interactions were added (50% of et al., 2001). Results shown in that study were updated all interactions are then spurious) we found a variation: by analyzing the DIP database, which we found contained clusters 2 and 7 appeared together. These results suggest all the information available in the literature. The graph of that, when provided a significant core of true interactions, protein–protein interactions shown in Figure 3 summarizes UVCLUSTER analyses may provide robust topologies even all currently (February 2004) available information for that in the presence of a substantial number of false positive set of proteins. interactions. 370 Hierarchical clustering in proteomics Fig. 3. Set of proteins involved in actin patch assembly and endocytosis according to Drees et al. (2001). Lines indicate direct interactions among them as found in the DIP database (January 2004 release). The figure was drawn using PIVOT (Orlev et al., 2004). We also wanted to check whether the partition obtained of proteins are involved according to Gene Ontology (GO) was biologically relevant. First, we considered the informa- terms. For the whole set of proteins, we found, as expected, tion provided by Drees et al. (2001). The proteins involved that the general GO process term defined as ‘actin cytoskel- (according to these authors) in actin patch assembly and eton organization and biogenesis’ included 20 out of the 34 function form clusters 1–5, 7, 11 and 12. Cluster 6 corres- proteins analyzed, with a probability of those proteins being −29 ponds to the three proteins described as involved in CDC42 together by chance equal to 1.03 × 10 . For the particular signaling. Cluster 8 includes the two proteins described by clusters including two or more proteins (because a limita- Drees et al. (2001) as involved in the morphogenesis check- tion of the program is that requires at least two proteins in point together with a third protein (APP1, called YNL094w a cluster, thus eliminating clusters 3, 4 and 12 in Figure 4B) in the study by Drees et al.) whose relationship with that we found the following results (considering only the GO terms process has not been determined. Cluster 9 contains the that gave the lowest probabilities of the cluster occurring endocytosis protein SVL3 plus a protein involved in actin by chance): clusters 1 and 7—actin cytoskeleton organiza- −11 polymerization and crosslinking to microtubules (CRN1). tion and biogenesis [p(cluster 1) = 2.65 × 10 ; p(cluster −6 Finally, cluster 10 contains the two proteins involved in 7) = 1.95 × 10 ]; cluster 6—Rho protein signal transduc- cytokinesis (BNI1, BNR1) plus PFY1, which plays sev- tion (equivalent to the definition by Drees et al. of ‘involved −8 eral roles in actin organization and polymerization. These to CDC42 signaling’; p = 1.51 × 10 ); cluster 8—G2/M ‘hybrid’ clusters were expected because the actin patch transition of mitotic cell cycle (again related to the assigna- proteins that closely connect with proteins involved in other tion of this group of proteins to the morphogenesis checkpoint −5 processes may contribute to linking these functions together by Drees et al.; p = 5.78 × 10 ); cluster 9—cell growth in the cell (Drees et al., 2001). As a second biological val- and/or maintenance (p = 0.091); cluster 10—response to −7 idation of the clusters, we performed searches using the osmotic stress (p = 5.06 × 10 ) and cluster 11—actin fil- −7 SGD Gene Ontology Term Finder (http://db.yeastgenome.org/ ament depolymerization (p = 7.55 × 10 ). Only clusters 2 cgi-bin/SGD/GO/goTermFinder) that allows, among other and 5 did not generate significant assignations according to features, to determine the most likely process in which a group GO terms. These results are interesting in two different ways. 371 V.Arnau et al. Fig. 4. Comparison of analyses using primary versus secondary distances from the dataset described in Figure 3. (A) UPGMA-based tree using the primary distances derived from Figure 3 data. Names in bold indicate deviations with respect to the tree in Figure 4B. (B) UPGMA- based tree using the secondary distances estimated using UVCLUSTER (AC = 100; N = 10 000). The optimal solution, according to the hypergeometric distribution of direct interactions, consists of 12 clusters, that are detailed on the right side. First, they demonstrate that proteins in most clusters can be may be significantly involved in actin patch assembly and significantly associated to particular cellular processes, thus function that did not appear in the description by Drees et al. confirming the biological significance of our results. Second, (2001). In order to do so, we calculated the average primary the fact that the processes assigned are generally different distance of each of the 4721 proteins in the DIP database for each cluster suggests that these clusters are composed against the 26 proteins that, according to Drees et al. (2001) of proteins involved in closely related but still different pro- were involved in these processes (the other 8, as we already cesses in vivo and that GO terms are precise enough so as to described above, often showed protein–protein interactions discriminate among those processes. with these 26, but functionally were less related). We found The possibility of including all available data for a par- that 19 of those 26 proteins were among the 40 proteins with ticular species using UVCLUSTER analyses allowed us to lowest average distances in the whole dataset (average dis- ask the question of whether there are other proteins that tance values ranging from 1.31 to 2.23) and even the worst 372 Hierarchical clustering in proteomics connected protein (TRM5; average distance with the rest suggesting members that could be missed when using other equal to 2.65) was in position 199 in our list. These results approaches. confirm the results obtained by Drees et al. (2001) using the 4.4 UVCLUSTER global analysis correlating gene updated information currently available: all the proteins con- expression and protein–protein interaction sidered by those authors indeed are in close proximity in the S.cerevisiae interactome graph. However, we concluded To demonstrate that UVCLUSTER can be used at a gen- that there are other proteins that are similarly or even better omic scale, we decided to compare UVCLUSTER-generated connected. Figure 5 contains a new UPGMA tree obtained results based on protein–protein interaction data with those using also the whole DIP interaction data, with AC = 100 derived from gene coexpression data obtained using microar- and N = 10 000, but including 38 other proteins poten- rays. There is good evidence, at least in yeast, that the products tially involved in actin patch assembly and function. These of highly coexpressed genes interact with a probability much 38 proteins have average distances lower than 2.27 to the higher than expected by chance (e.g. Kemmeren et al., 2002). 26 proteins in the original dataset. The close proximity to Thus, we can expect UVCLUSTER to recover, at least in part, actin and interspersion with the proteins of the original data- clusters of coexpressed genes using protein–protein interac- set of many of those newly added proteins can be easily tion data. To analyze coexpression, we used as a starting appreciated in Figure 5. It is also significant that duplicat- database the yeast ‘refined modules’ defined by (Bergmann ing the number of proteins only alters the relative position et al., 2004). These modules are groups of genes that show a of a few proteins of the original dataset. If we compare significant level of coexpression and that were obtained start- Figures 4B and 5, we can see that only YPR171W, SLA1, ing with a core of evolutionary conserved, coexpressed genes YHR133C, RVS167 and RVS161 appear in very different to which other genes with similar expression patterns were positions in both the trees. This result can be interpreted as a added Bergmann et al. (2004). These authors described eight new confirmation that the delimited clusters were quite reli- of those modules, including a total of 548 genes, as detailed in able. Adding more data in general only contributes to make Table 1. We decided to check to which extent UVCLUSTER the groups that were already observed larger. Similar results analyses may unearth those same modules using the available were obtained using AC = 50 (not shown). As a secondary protein–protein interaction data. We therefore took the list of biological validation of our results, we have also included 548 genes (that was reduced to 543 after eliminating 5 genes in Figure 5 some additional data. First, we add informa- that appeared in two or more modules) and analyzed whether tion for protein localization from Huh et al. (2003) and other the DIP database contains interaction data for their products. sources, as compiled in the MIPS database (Mewes et al., We found that the products of 376 of those genes indeed were 2002; http://mips.gsf.de/). Fifty-six proteins are described found in DIP. However, only 163 were involved in more than in MIPS as being localized to the actin cytoskeleton. Of one protein–protein interaction. These results demonstrate them, 24 are found in Figure 5, being 16 part of the ori- that the amount of information for the protein products of ginal data from Drees et al. (2001) while the other 8 are this set of genes is quite limited. We then used UVCLUSTER among the ones we have secondarily added to our tree using (AC = 100, N = 10 000) to determine clusters of proteins UVCLUSTER analyses (see details in Figure 5). Moreover, based on interactome data. Results are shown in Figure 6. another five UVCLUSTER-suggested proteins (LSB3, LSB5, A total of 162 proteins formed several well-defined clusters BEM1, ARP2, END3) are localized according to MIPS to the while the other 214 appeared isolated, due to the fact that they yeast cytoskeleton in a broader class that may also imply inter- did not present direct interactions with any other proteins in action with actin. Second, we include also information about the dataset (these last ones have been grouped together in a the GO term ‘actin cytoskeleton organization and biogenesis’ single branch in Figure 6). As shown in the figure, several of that, as we described above, includes most of the proteins in the clusters closely follow the results obtained by Bergmann our original dataset. The SGD Gene Ontology Term Finder et al. (2004). Thus, a cluster contains 56 rRNA processing assigns to that GO term the lowest probability among all proteins plus 14 ‘contaminant’ proteins involved in other −34 terms (p = 4.3 × 10 ) when all the proteins in Figure 5 processes. Another has 29 proteasome proteins with a single are included. Moreover, all other terms with low probability ‘contaminant’, a third one includes 17 MRP proteins with are related to it. In total, there are 28 proteins in Figure 5 two ‘contaminants’, etc. We became aware when considering that are included in the ‘actin cytoskeleton organization and these results that the information in DIP is indeed quite frag- biogenesis’ GO term. Of them, 8 are from the secondary data- mentary, which explains why some very characteristic clusters set obtained using UVCLUSTER. These results demonstrate are missing in Figure 6. For example, DIP does not con- that a significant fraction of the proteins suggested by the tain interaction data corresponding to cytoplasmic ribosomes UVCLUSTER study of the interactome are related to actin, (explaining why a large ribosomal cluster does not appear), or at least cytoskeleton, function. Thus, we can conclude that or to the small subunit of mitochondrial ribosomes (there- UVCLUSTER-based analyses may significantly contribute to fore, the 17 MRP proteins that appear together are all part delineate groups of closely integrated proteins in the cell, of the large subunit). Several of the modules (e.g. glycolysis, 373 V.Arnau et al. Fig. 5. Results of the analysis to discover new actin patch assembly and endocytosis proteins. In bold, the new proteins discovered. Italics refer to the eight proteins that Drees et al. (2001) considered as belonging to processes other than patch assembly and patch-mediated endocytosis. (a): Proteins that are localized to the actin cytoskeleton according to Huh et al. (2003). (b): Proteins assigned to the GO process ‘actin cytoskeleton organization and biogenesis’ according to the SGD database (see http://www.yeastgenome.org/GOContents.shtml). 374 Hierarchical clustering in proteomics Table 1. UVCLUSTER results using the refined modules by Bergmann et al. involved in a particular process, when provided with some (2004) preliminary information. For example, if we know that a group of proteins are acting on a relevant process, we can use them Module No. of No. of Size of main as ‘seeds’ to delimit, using UVCLUSTER, all the proteins genes in proteins in cluster defined that are closely connected to them in the interactome. There- module clusters by UVCLUSTER fore, we can obtain a clearer picture of the functions of the set defined by (% of total of interesting proteins (see Figure 5 and the related analyses UVCLUSTER genes in explained above). This approach can be seen as complement- module) ary to the one based on the detection of highly connected graphs that include a protein of interest developed by Bader rRNA processing 192 74 56 (29.2) Ribosomal protein 153 20 5 (3.3) and Hogue (2003; ‘directed mode’). Third, UVCLUSTER Mitochondrial ribosomal 76 19 17 (22.3) can be also used to establish groups of connected proteins protein even when some information is unavailable, by being able to Proteasome 40 31 29 (72.5) predict potential interactions. This can be done by using AC Peroxide 29 2 0 (0.0) values lower than 100, that is, allowing proteins that do not dir- Secreted protein 28 8 4 (14.3) Glycolysis 15 2 0 (0.0) ectly interact, but that are still quite close in the interactome, Heat shock 15 6 5 (33.3) to be clustered together. We have observed that diminishing the AC value has the effect of generating topologies that are worse than the one obtained with AC = 100 when eval- uated with the hypergeometric distribution-based parameter described above. This is due to the effect that the increased peroxide) are moreover formed in part by proteins that are permissiveness of the clustering process has on the secondary not expected to physically interact. Those results explain why distances, often allowing directly connected proteins to appear only some of the modules defined by Bergmann et al. (2004) in different clusters. However, this relaxation of the conditions actually are found in UVCLUSTER analyses (see Figure 6 may be of great interest if we assume that data are incomplete and Table 1 for the details). In any case, we can conclude that and some distances may be actually lower than those available. UVCLUSTER can be used for analysis involving hundreds of In those cases, relaxed analyses may suggest potential clusters proteins and, even in those complex cases, it recovers interest- of interacting proteins that would be missed by the strictest ing functional information based on interactome data. On the ones. It is significant that UVCLUSTER can also be used in other hand, the analysis just shown casts doubts on whether the opposite direction, that is, in order to detect false positive interaction data alone will ever provide a detailed picture of interactions. For example, promiscuous proteins, which are cell function. The relative incompleteness of the results we able to interact with many different partners, can be easily have just described can be conservatively explained suggest- detected using UVCLUSTER (with AC = 100) when they ing that the available information is very fragmentary and will appear as being equally distant to many other, totally unre- significantly improve in the future. However, it could also be lated, proteins (Arnau and Marín, 2003). This may be another explained by some/many highly integrated cellular processes, significant application of our program, because, as we com- which do not require direct protein–protein interactions. This mented above, the number of false positives is considered to would hinder the processes to be identified, no matter how be high for interaction data generated by non-directed, large- exhaustive the interactome data are. scale experiments (e.g. von Mering et al., 2002). As a final potential application, UVCLUSTER can be used to quickly 5 DISCUSSION explore whether proteins encoded by orthologous genes retain UVCLUSTER is a flexible tool for global exploration of pro- related functions in two species by comparing the relative tein function using interactome data that is based on iterative positions in both of their the interactomes. We have already hierarchical clustering. The strategy implemented in our pro- generated relevant information for some conserved genes by gram is related to permutation tests commonly used in other comparing the S.cerevisiae and D.melanogaster interactomes, contexts (reviewed by Felsenstein, 2004, pp. 359–363), which which will be presented elsewhere. are applied for the first time to the resolution of the ‘ties in UVCLUSTER has been designed keeping in mind the needs proximity’ problem that arises in clustering methods when of groups working in functional biology. Potential users of many distances are equal. We have shown that UVCLUSTER UVCLUSTER are those researchers interested in obtaining has four main strengths. First, it may easily discover and define in a short period of time significant information for parts of sets of closely linked proteins. Secondary distance data among the interactome of a species, for example, including all the the elements of a set delineate their relationships in a way that proteins involved in the particular biological process that they their direct, primary distances cannot do (see Figures 2 and 4). are studying. Therefore, analyses involving tens to hundreds Second, UVCLUSTER may be used to discover proteins of elements, similar to the ones we have shown above are 375 V.Arnau et al. Fig. 6. UVCLUSTER results for the refined modules defined by Bergmann et al. (2004). See text for details. expected to be the most frequently performed. It is important the number of elements to generate reliable secondary distance that, for our datasets, we have found that secondary distance tables. For example, when we repeatedly analyzed the actin results do not significantly change by increasing N above a dataset with N = 100, the topology shown in Figure 4B certain threshold. Our experience with the program suggests (obtained when N = 10 000) appeared only in 7 out of 10 that it is reasonable to use a value of N of at least ten times cases (in the other three, clusters 1 and 2 appeared fused 376 Hierarchical clustering in proteomics together). However, 10 repetitions with N = 500 generated Bader,G.D. and Hogue,C.W.V. (2002) Analyzing yeast protein–protein interaction data obtained from different the same clusters shown in Figure 4B. This recommendation sources. Nat. Biotechnol., 20, 991–997. of using at least 10 times the number of elements sharply Bader,G.D. and Hogue,C.W.V. (2003) An automated method for contrasts with the suggestions of several well-known com- finding molecular complexes in large protein interaction net- puter programs, which suggest to perform about 10 replicates works. BMC Bioinformatics, 4,2. when there are ties [summarized in Backeljau et al. (1996) Bader,G.D., Heilbut,A., Andrews,B., Tyers,M., Hughes,T. and and recently confirmed by our group]. Boone,C. (2003) Functional genomics and proteomics: charting UVCLUSTER analyses involving up to 1000 elements (e.g. a multidimensional map of the yeast cell. Trends Cell Biol., 13, 1000 proteins) can be, as the times described previously 344–356. demonstrate, easily performed on standard computer equip- Barabási,A.L. and Oltvai,Z.N. (2004) Network biology: under- ment. However, our general guideline to select the number of standing the cell’s functional organization. Nat. Rev. Genet., iterations suggests that UVCLUSTER, in its current imple- 5, 101–113. Bergmann,S., Ihmels,J. and Barkai,S. (2004) Similarities and dif- mentation, has severe limitations to analyze whole-proteome ferences in genome-wide expression data of six organisms. PLoS data. For example, for the whole S.cerevisiae dataset, which Biol., 2, 0085–0093. is defined by about 11 millions of primary distances, our Bu,D., Zhao,Y., Cai,L., Xue,H., Zhu,X., Lu,H., Zhang,J., Sun,S., guideline suggests performing analyses with N values of at Ling,L., Zhang,N., Li,G. and Chen,R. (2003) Topological least 50 000. Such a huge analysis is evidently not feas- structure analysis of the protein–protein interaction network in ible on standard PC equipment (estimated time: 6100 h, budding yeast. Nucl. Acids Res., 31, 2443–2450. with AC = 100), showing that a parallel implementation of Drees,B.L., Sundin,B., Brazeau,E., Caviston,J.P., Chen,G.C., UVCLUSTER, whose development is already in progress Guo,W., Kozminski,K.G., Lau,M.W., Moskow,J.J., Tong,A. et al. by our group, is essential for analyses involving very large (2001) A protein interaction map for cell polarity development. datasets. J. Cell Biol., 154, 549–571. Everitt,B.S., Landau,S. and Leese,M. (2001) Cluster Analysis, 4th Finally, it is obvious that UVCLUSTER may also be used to edn. Arnold, London. analyze information other than protein interaction data. Any Felsenstein,J. (2004) Inferring Phylogenies. Sinauer Associates, Inc. type of information that can be converted into primary dis- Sunderland, MA. tances and that suffers from the ties in proximity problem Floyd,R.W. (1962) Algorithm 97—Shortest path. Commun. can be advantageously analyzed using UVCLUSTER. Typical ACM, 5, 345. examples in genomics are the analyses of connections among Gagneur,J., Krause,R., Bouwmeester,T. and Casari,G. (2004) protein domains, in which distances are defined depending on Modular decomposition of protein–protein interaction networks. the combination of domains found in the set of proteins of one Genome Biol., 5, R57. or multiple species (e.g. Mott et al., 2002; Ye and Godzik, Gavin,A.C., Bösche,M., Krause,R., Grandi,P., Marzioch,M., 2004) or the analysis of distances estimated from genetic Bauer,A., Schultz,J., Rick,J.M., Michon,A.M., Cruciat,C.M. et al. (2002) Functional organization of the yeast proteome by interaction screens (Tong et al., 2004). Other fields, as the ana- systematic analysis of protein complexes. Nature, 415, 141–147. lysis of paper citations or coauthorships (Albert and Barabási, Gibbons,F.D. and Roth,F.P. (2002) Judging the quality of gene 2002) could also benefit from the use of UVCLUSTER. expression-based clustering methods using gene annotation. Genome Res., 12, 1574–1581. ACKNOWLEDGEMENTS Giot,L., Bader,J.S., Brouwer,C., Chaudhuri,A., Kuang,B., Li,Y., Research supported by the Spanish Ministry of Science Hao,Y.L., Ooi,C.E., Godwin,B., Vitols,E. et al. (2003) A protein and Technology (MCYT; projects GEN2001-4851-C06- interaction map of Drosophila melanogaster. Science, 302, 02 [NEUROGENOMICA/CAGEPEP] and SAF2003-09506), 1727–1736. Goldberg,D.S. and Roth,F.P. (2003) Assessing experimentally Fundació La Caixa (01/080-00) and Generalitat Valenciana derived interactions in a small world. Proc. Natl Acad. Sci., USA, (GV04B-141). S.M. is the recipient of an FPU predoctoral 100, 4372–4376. fellowship from the MCYT. Gordon,A.D. (1999) Classification, 2nd edn. Chapman and Hall/CRC, Boca Ratón, FL. REFERENCES Ho,Y., Gruhler,A., Hellbut,A., Bader,G.D., Moore,L., Adams,S.L., Albert,R. and Barabási,A.L. (2002) Statistical mechanics of complex Millar,A., Taylor,P., Bennett,K., Boutllier,K. et al. (2002) networks. Rev. Modern Phys., 74, 47–97. Systematic identification of protein complexes in Saccharomyces Arnau,V. and Marín,I. (2003) A hierarchical clustering strategy and cerevisiae by mass spectrometry. Nature, 415, 180–183. its application to proteomic interaction data. Lect. Notes Comput. Huh,W.K., Falvo,J.V., Gerke,L.C., Carroll,A.S., Howson,R.W., Sci., 2652, 62–69. Weissman,J.S. and O’Shea,E.K. (2003) Global analysis of protein Backeljau,T., De Bruyn,L., De Wolf,H., Jordaens,K., Van Dongen,S. localization in budding yeast. Nature, 245, 686–691. and Winnepenninckx,B. (1996) Multiple UPGMA and Neighbor- Ito,T., Chiba,T., Ozawa,R., Yoshida,M., Hattori,M. and Sakaki,Y. joining trees and the performances of some computer packages. (2001) A comprehensive two-hybrid analysis to explore the yeast Mol. Biol. Evol., 13, 309–313. protein interactome. Proc. Natl Acad. Sci., USA, 98, 4569–4574. 377 V.Arnau et al. Kemmeren,P., van Berkum,N.L., Vilo,J., Bijma,T., Donders,R., Rives,A.W. and Galitski,T. (2003) Modular organization of cellular Brazma,A. and Holstege,F.C. (2002) Protein interaction veri- networks. Proc. Natl Acad. Sci., USA, 100, 1128–1133. fication and functional annotation by integrated analysis of Salwinski,L. and Eisenberg,D. (2003) Computational methods of genome-scale data. Mol. Cell, 9, 1133–1143. analysis of protein–protein interactions. Curr. Opin. Struct. Biol., Kumar,S., Tamura,K., Jakobsen,I.B. and Nei,M. (2001) 13, 377–382. MEGA2: molecular evolutionary genetics analysis software. Schwikowski,B., Uetz,P. and Fields,S. (2000) A network of Bioinformatics, 17, 1244–1245. protein–protein interactions in yeast. Nat. Biotechnol., 18, Levenstien,M.A., Yang,Y and Ott,J. (2003) Statistical significance 1257–1261. for hierarchical clustering in genetic association and microarray Spirin,V. and Mirny,L.A. (2003) Protein complexes and functional expression studies. BMC Bioinformatics, 4, 62. modules in molecular networks. Proc. Natl Acad. Sci., USA, 100, Li,S., Armstrong,C.M., Bertin,N., Ge,H., Milstein,S., Boxem.M., 12123–12128. Vidalain,P.O., Han,J.D.J., Chesneau,A., Hao,T. et al. (2004) Takezaki,N. (1998) Tie trees generated by distance methods A map of the interactome network of the metazoan C.elegans. of phylogenetic reconstruction. Mol. Biol. Evol., 15, Science, 303, 540–543. 727–737. MacCuish,J., Nicolaou,C. and MacCuish,N.E. (2001) Ties in Tong,A.H.Y., Lesage,G., Bader,G.D., Ding,H., Xu,H., Xin,X., proximity and clustering compounds. J. Chem. Inf. Comput. Sci., Young,J., Berriz,G.F., Brost,R.L., Chang,M. et al. (2004) Global 41, 134–146. mapping of the yeast genetic interaction network. Science, 303, Mewes,H.W., Frishman,D., Guldener,U., Mannhaupt,G., Mayer,K., 808–813. Uetz,P., Giot,L., Cagney,G., Mansfield,T.A., Judson,R.S., Mokrejs,M., Morgenstern,B., Munsterkotter,M., Rudd,S. and Knight,J.R., Lockshon,D., Narayan,V., Srinivasan,M., Weil,B. (2002) MIPS: a database for genomes and protein Pochart,P. et al. (2000) A comprehensive analysis of sequences. Nucl. Acids Res., 30, 31–34. protein–protein interactions in Saccharomyces cerevisiae. Mott,R., Schultz,J, Bork,P. and Ponting,C.P. (2002) Predicting Nature, 403, 623–627. protein cellular localization using a domain projection method. von Mering,C., Krause,R., Snel,B., Cornell,M., Oliver,S.G., Genome Res., 12, 1168–1174. Fields,S. and Bork,P. (2002) Comparative assessment of large- Nei,M. and Kumar,S. (2000) Molecular Evolution and scale data sets of protein–protein interactions. Nature, 417, Phylogenetics. Oxford University Press, New York. 399–403. Orlev,N., Shamir,R. and Shiloh,Y. (2004) PIVOT: protein interac- Wagner,A. (2001) The yeast protein interaction network evolves tions visualization tool. Bioinformatics, 20, 424–425. rapidly and contains few redundant duplicate genes. Mol. Biol. Pereira-Leal,J.B., Enright,A.J. and Ouzounis,C.A. (2004) Detection Evol., 18, 1283–1292. of functional modules from protein interaction networks. Proteins, Wilhelm,T., Nasheuer,H.P. and Huang,S. (2003) Physical and 54, 49–57. functional modularity of the protein network in yeast. Mol. Cell. Prinz,S., Avila-Campillo,I., Aldridge,C., Srinivasan,A., Proteom., 2, 292–298. Dimitrov,K., Siegel,A.F. and Galitski,T. (2004) Control of Xenarios,I., Salwinski,L., Duan,X.J., Higney,P., Kim,S.M. and yeast filamentous-form growth by modules in an integrated Eisenberg,D. (2002) DIP, the Database of Interacting Proteins: a molecular network. Genome Res., 14, 380–390. research tool for studying cellular networks of protein interactions. Przulj,N., Wigle,D.A. and Jurisica,I. (2004) Functional topology in Nucl. Acids Res., 30, 303–305. a network of protein interactions. Bioinformatics, 20, 340–348. Ye,Y. and Godzik,A. (2004) Comparative analysis of protein domain Quackenbush,J. (2001) Computational analysis of microarray data. Nat. Rev. Genet., 2, 418–427. organization. Genome Res., 14, 343–353. http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Bioinformatics Oxford University Press

Iterative Cluster Analysis of Protein Interaction Data

Bioinformatics , Volume 21 (3): 15 – Sep 16, 2004

Loading next page...
 
/lp/oxford-university-press/iterative-cluster-analysis-of-protein-interaction-data-FHNCoM4fAR

References (43)

Publisher
Oxford University Press
Copyright
Bioinformatics vol. 21 issue 3 © Oxford University Press 2005; all rights reserved.
eISSN
1367-4811
DOI
10.1093/bioinformatics/bti021
pmid
15374873
Publisher site
See Article on Publisher Site

Abstract

Vol. 21 no. 3 2005, pages 364–378 BIOINFORMATICS ORIGINAL PAPER doi:10.1093/bioinformatics/bti021 Iterative Cluster Analysis of Protein Interaction Data 1 2 2,∗ Vicente Arnau , Sergio Mars and Ignacio Marín 1 2 Departamento de Informática and Departamento de Genética, Universidad de Valencia, Burjassot 46100, Valencia, Spain Received on March 12, 2004; revised on September 7, 2004; accepted on September 7, 2004 Advance Access publication September 16, 2004 ABSTRACT protein–protein interaction data has created a similar need Motivation: Generation of fast tools of hierarchical clustering for methods to efficiently explore the complex graphs of to be applied when distances among elements of a set are con- interconnected proteins that often faithfully represent com- strained, causing frequent distance ties, as happens in protein plex metabolic functions of a cell (Schwikowski et al., 2000; interaction data. Drees et al., 2001; reviewed in Salwinski and Eisenberg, 2003; Results: We present in this work the program UVCLUSTER, Bader et al., 2003). that iteratively explores distance datasets using hierarch- Among the best-known and most powerful methods of ical clustering. Once the user selects a group of proteins, classification are those involving hierarchical clustering UVCLUSTER converts the set of primary distances among (reviewed in Everitt et al., 2001). Elements are progressi- them (i.e. the minimum number of steps, or interactions, vely classified into sets either by sequentially putting them required to connect two proteins) into secondary distances together in non-overlapping classes (agglomerative methods) that measure the strength of the connection between each or by progressively dividing the full set of elements into pair of proteins when the interactions for all the proteins in the smaller groups (divisive methods). Many different types of group are considered. We show that this novel strategy has hierarchical clustering methods exist that depend on how the advantages over conventional clustering methods to explore distances among elements are evaluated to build (or split) protein–protein interaction data. UVCLUSTER easily incorpor- the groups, and there is considerable interest in determin- ates the information of the largest available interaction data- ing when some of those methods perform better than others sets to generate comprehensive primary distance tables. The (e.g. Gibbons and Roth, 2002). Hierarchical clustering is one versatility, simplicity of use and high speed of UVCLUSTER of the most common methods of classification used in biology. on standard personal computers suggest that it can be a Among many others, common uses of hierarchical cluster- benchmark analytical tool for interactome data analysis. ing techniques are the classification of organisms of different Availability: The program is available upon request from populations or species according to quantitative similarities the authors, free for academic users. Additional information (numerical taxonomy), the ordering, according to sequence available at http://www.uv.es/∼genomica/UVCLUSTER similarity, of sets of genes or proteins and, more recently, the Contact: ignacio.marin@uv.es determination of sets of genes with similar profiles of expres- sion according to microarray-derived data (reviewed in Nei 1 INTRODUCTION and Kumar, 2000; Quackenbush, 2001; Felsenstein, 2004). In this study, we explore the use of hierarchical clustering for The extraction of relevant information from massive amounts the functional classification of proteins using protein–protein of biological data is becoming crucial in the post-genomic interaction data. Our approach has two steps. First, we meas- era. Current efforts are focused on the generation of tools ure the distance among any two proteins in a protein–protein able to allow the classification of large amounts of similar, interaction network according to the minimum number of correlated or interconnected elements and retrieve from that steps required to connect them, where each step is a known, classification useful patterns or regularities that can be later physical protein–protein interaction. Second, we use those explored either in the laboratory or in silico. Thus, in the distances to classify the proteins in groups using hierarchical last years we have witnessed an ever-increasing interest in clustering. This approach is thus in principle very similar to the development of classification tools for gene expression those used for other types of data and would seem quite easy data obtained from microarray analysis (Quackenbush, 2001; to implement. However, protein–protein interaction data have Gibbons and Roth, 2002). The recent generation of massive special features that make the use of hierarchical clustering methods particularly problematic. To whom correspondence should be addressed. 364 Bioinformatics vol. 21 issue 3 © Oxford University Press 2004; all rights reserved. Hierarchical clustering in proteomics An often overlooked aspect of hierarchical cluster analysis of 7.14. This distance should become smaller when more is the fact that it has serious intrinsic problems when used interactions are determined. Thus, recently Wilhelm et al. with datasets in which the distances among many elements (2003) described a S.cerevisiae dataset involving about 20 000 are identical (the ‘ties in proximity’ problem; Backeljau et al., interactions and with an average distance for the proteins of the 1996; Takezaki, 1998; MacCuish et al., 2001). Ties generate largest connected component equal to only 2.57. Similarly, in multiple mathematically equivalent solutions when hierarch- a recent analysis in which 15 210 interactions were considered ical clustering is performed. In favorable cases, when ties are [data obtained from the January 2004 release of the DIP data- very rare, we would anyway expect all alternative clustering base; Xenarios et al. (2002)], we found that all proteins that solutions to be very similar. The simplest solution in those can be connected have distances that range from 1 to 12. When cases is to solve the ties arbitrarily, for example, in ways that we assigned a conventional value of distance equal to 24 (i.e. are dependent on the order of data input. However, when ties twice the largest detected for any connected proteins) to all are frequent, the number of alternative solutions and the dif- those pairs of proteins that cannot be connected by any path, ferences among those alternatives increase. Backeljau et al. we found an average distance of 4.97 for 4721 S.cerevisiae (1996) summarized the effect of ties on the performance of proteins. This value is of course an overestimate, because in several commonly used computer programs, concluding that the future more/all proteins are expected to be found connec- none of them was able to correctly confront this problem. The ted, and with shorter distances. In any case, for DIP data, we relative importance of ties was further analyzed by Takezaki found that there are 4721 × 4720/2 = 11.1 millions of dis- (1998), which determined the likelihood of ties and their effect tances among proteins that are constrained to a short range of on bootstrap tests for some small simulated datasets, conclud- possible distance values, mostly between 1 and 12, so it can ing that ties have only occasionally a significant influence. be deduced that the S.cerevisiae interactome dataset includes However, these two works analyzed small distance matrices millions of identical distances. Very recent data for animal based on nucleotide sequences, in which ties are relatively species, such as Drosophila melanogaster or Caenorhabditis rare and often caused by the effect of rounding decimal num- elegans are more limited (i.e. the average number of inter- bers. Ties are much more frequent in other types of data, most actions per protein is lower than in S.cerevisiae). However, especially when distances are constrained to a narrow range the general pattern of small world properties with constrained of values. Most especially, protein–protein interaction data distance values is similarly observed (Giot et al., 2003; Li generate one of the datasets in which ties are most prominent. et al., 2004). Currently, the most complete interactome data available in The ties in proximity problem has led most authors inter- eukaryotes involves the protein interaction network of the ested in analyzing protein interaction data to focus their efforts yeast Saccharomyces cerevisiae. In addition to results derived on generating tools, which are based on graph theory and from a large number of small-scale, directed studies, massive designed to detect clusters of highly or similarly connec- non-directed protein–protein interaction data for S.cerevisiae ted proteins (e.g. Bader and Hogue, 2003; Bu et al., 2003; were generated either by using two-hybrid assays (Uetz et al., Goldberg and Roth, 2003; Spirin and Mirny, 2003; Gagneur 2000; Ito et al., 2001) or by pulling down protein complexes et al., 2004; Pereira-Leal et al., 2004; Przulj et al., 2004). using a tagged subunit as bait (Gavin et al., 2002; Ho et al., These approaches are very interesting because they may effi- 2002). Analysis of the large datasets generated using these ciently detect functionally relevant modules. However, they approaches demonstrated that the S.cerevisiae interactome has have the obvious limitation of not allowing a general view ‘small world’ properties (Wagner, 2001; see review by Albert of the relationships among proteins selected with criteria and Barabási, 2002). We can define the distance between different from their degree of connection. In this study, we two proteins as the minimum number of direct interactions describe a novel, alternative strategy of analysis that allows among proteins in the dataset that are required in order to to explore the characteristic type of complex data that suffers connect them (a parameter often called ‘minimal path length’ the ties in proximity problem using hierarchical clustering. or simply ‘path length’; see Barabási and Oltvai, 2004). For In a significant precedent to our work, Rives and Galitski a ‘small world’ interaction dataset, it is found that many dis- (2003) circumvented the problem by using not the distances tances are identical because most proteins are either directly among elements but the correlation coefficient obtained when connected (i.e. they are part of the same protein complex or the distances of two proteins are measured against all the interact in two-hybrid assays; distance = 1) or separated by members in a database (see also Prinz et al., 2004 for a just a few steps, each one involving a physical interaction related procedure). Our method is different, being based on between two proteins or protein complexes. That interactome obtaining multiple, equally valid solutions and evaluating the data have small world properties was already anticipated distances among elements according to those multiple solu- using a small fraction of the information currently available tions. This strategy is implemented in a program that we have (just 899 interactions) by Wagner (2001), which described called UVCLUSTER, which is extremely simple to use and in the S.cerevisiae protein network a group of 466 pro- fast enough to analyze large datasets involving hundreds of teins connected by an average distance (average path length) proteins on standard PC computers. UVCLUSTER allows the 365 V.Arnau et al. determination of the relative degree of proximity of groups of proteins that can be defined by the user in several ways, generating outputs that can be easily explored or converted into dendrograms. We demonstrate that UVCLUSTER-based analyses have obvious advantages over conventional hierarch- ical clustering and that UVCLUSTER can be used to explore interactome data in order to detect new proteins relevant to any chosen biological process. 2 SYSTEM AND METHODS The UVCLUSTER program (available upon request from the authors, free for academic users) is written in C and its versions have been compiled and tested on Windows and Linux operating systems. The flow chart of the program is shown in Figure 1. UVCLUSTER analyses begin by import- ing a text file containing a dataset of direct protein–protein interactions. This dataset can be then filtered in two different ways. First, the user can create a list of proteins and then select one of two options: (1) use only those interactions between pairs of proteins in the list; or (2) exclude all interactions involving any protein in the list. After this first filter, a table with the selected group of interactions can be saved for further analyses. Then the user can apply a second filter by selecting a cutoff value for the maximum/minimum number of interac- tions that a protein may have to be included in the analysis. This second filter allows to exclude proteins with a number of interactions lower than a particular threshold (i.e. poorly connected proteins) or higher than the cutoff value (i.e. it eliminates ‘hubs’, highly connected proteins). After these two filters, the UVCLUSTER program generates with the remain- ing proteins a matrix of primary distances (d ; equivalent to their shortest path or minimal path length) among them. If two proteins cannot be connected directly or indirectly, we have followed the convention of assigning a distance that is twice the longest distance among connected proteins in the dataset Fig. 1. UVCLUSTER flow chart. (modifying this convention does not alter any of the results that we will present below). The tables of primary distances gener- ated by UVCLUSTER can be saved for further analyses. Thus, The second alternative involves providing the program with in other rounds of analysis, the user may decide whether to a list of proteins. Then, the program generates a subtable of import a new file to generate another primary distance table or primary distances, but only including the proteins in the list to directly use a table that is already available. This is useful, for which interaction data are available. because a primary distance table obtained from all the interac- Once the subtable of primary distances has been generated tion data that exist for a species is generated that can be used by any of these two methods, UVCLUSTER iteratively uses many times in different analyses. agglomerative hierarchical clustering on the primary distances Once a primary distance table has been generated or simply dataset to generate N alternative, equally valid, clustering loaded, UVCLUSTER allows the user to further select groups solutions. The value of N must be chosen by the user before of proteins to be analyzed, again in two different ways. A first starting the analysis after considering the speed of the com- option is to choose a single protein (‘network center’) while puter, number of proteins to analyze and the precision of establishing a cutoff distance value. In this case, the program analysis required (see below for a discussion). The solutions explores the primary distance table, by selecting all the pro- are generated in three steps: (1) random sampling of the ele- teins with primary distances from the selected one that are ments of the dataset; (2) elements in the dataset are clustered equal or lower than the cutoff value and generates a subtable according to group average linkage (Everitt et al., 2001); and containing only those proteins together with their distances. (3) the agglomerative process is finished at a certain point, 366 Hierarchical clustering in proteomics defined by the user that must select the value for a global exploration of the results, especially for those users that are ‘stopping rule’ parameter, the Affinity Coefficient (AC). AC is not familiar with MEGA or other similar packages. Finally, defined as follows: the third UVCLUSTER output file contains a graphical rep- resentation of the data in PGM (Portable GreyMap) format. To AC = 100[(P − C )/(P − 1)], (1) m m m generate the PGM file, proteins are ordered according to the where C (cluster mean) is the average of the distances for results described in the second UVCLUSTER output file. To all elements included in the clusters and P (partition mean) facilitate the analysis of the PGM figure, a fourth output file is is the average value of distances for the whole set of selected generated that contains a list of names that corresponds to the proteins. If AC = 100, then C = 1, meaning that only pro- order in which proteins are shown in that figure. The PGM rep- teins with distance equal to 1 can be clustered together. Using resentation (see Arnau and Marín, 2003) is a square formed AC < 100 relaxes the conditions, allowing that proteins sep- by K smaller color-coded squares, where K is the num- arated by distances higher than 1 be put together in the same ber of proteins analyzed. Shades of gray indicate the degree cluster. It therefore favors the generation of larger clusters, of interaction between each pair of proteins, with light gray although including proteins often indirectly connected. We corresponding to close proteins and black to distantly con- generally use values of AC ranging from 50 (highly relaxed) nected proteins. PGM format files can be read using freeware to 100 (maximally strict) in our analyses. If many proteins are programs such as IrfanView 3.85 (www.irfanview.com). directly connected, it may be useful to increase the AC value, while if distances among proteins are large, AC values may be 3 ALGORITHM decreased to facilitate the detection of potentially interesting The dataset of direct protein–protein interactions must be clusters. written as a text file, in such a way that each line of the Once the dataset of N alternative solutions is obtained, file contains the names of two interacting proteins separ- UVCLUSTER evaluates in how many of them each pair of ated by a tab. This is the format most commonly used elements appear together in the same cluster and generates to express protein–protein interaction data in public data- an output file containing a table with secondary distances (d ) bases. Therefore, massive amounts of data can be very among the elements. The secondary distance between two ele- quickly imported into UVCLUSTER. Primary distances are ments is defined as the number of solutions in which those two calculated from this file of direct interactions using Floyd’s elements do not appear together in the same cluster divided algorithm (Floyd, 1962). by the total number of solutions (N ). Therefore, iterative res- The main algorithm of the program, that characterizes the ampling of the primary distance data allows to establish how iterative clustering method, can be described as follows: likely it is for each pair of elements to be clustered together when many alternative, equally good, clustering solutions are generated. This is the key of the strategy implemented Import table with d values in UVCLUSTER. Secondary distances establish the strength Select N , AC values of the connection between two elements, relative to all the Repeat_from k = 1 elements in the analyzed dataset [an idea first developed Random ordering of elements; in Arnau and Marín (2003), although using a less appropriate Agglomerative hierarchical clustering using group clustering method]. Ties in secondary distances will be very average linkage (d , AC); rare, due to the complex nature of the connections among ele- Increment counters according to the solution found; ments. This means that the secondary distance dataset may k = k + 1; then be analyzed by conventional (i.e. non-iterative) methods To k = N to obtain a rigorous classification of the elements. Generate d values corrected by N To facilitate the exploration of the results, UVCLUSTER Export files containing: (a) tables of d , d values; generates four output files. The first one contains the tables of (b) UPGMA clustering with d values and primary and secondary distances among the chosen elements (c) graphical output in PGM format. plus the values of several significant parameters used in the analyses, such as AC, C and P . This first file also contains a m m table of secondary distances suitable to be copied to a text file and directly imported into MEGA 2.1 (Kumar et al., 2001). In 4 IMPLEMENTATION MEGA, the secondary distance data can be used to generate 4.1 Speed and performance dendrograms, using conventional methods of clustering such as UPGMA or Neighbor-joining. In the second UVCLUSTER No matter how complex or extensive the raw interaction data output file, the results of an agglomerative hierarchical clus- are, they can be quickly converted into primary distance tering using UPGMA performed with the secondary distance data by UVCLUSTER and then saved for later analyses. In data are detailed. This file may be useful for a preliminary addition, the speed of generation of primary distance tables 367 V.Arnau et al. allows us to easily update any comprehensive species-specific the solution generated will necessarily fail to detect the two table, every time a new information is available. For example, natural clusters. Figure 2C shows a UPGMA tree obtained the S.cerevisiae data available in the January 2004 release of using the secondary distances generated by UVCLUSTER the DIP database (4721 proteins, 15 210 interactions) can be with N = 10 000 and AC = 100. In this case, distances converted into a primary distance table in about 14 min on a among units 1–3 or 9–11 are equal to zero. It can be seen that standard PC computer (Intel Pentium IV 2.8 GHz processor Figure 2C closely corresponds to Figure 2A: the two clusters with 512 MB RAM memory). appear as discrete entities with units 4 and 8 being the ones Once a table with primary distances has been generated, most closely connected to the rest. secondary distance data are also obtained very quickly. Time It is interesting to develop an index to summarize the validity depends on the selected AC value. Smaller AC values that pro- of the clusters detected, in order to be able to compare differ- long the clustering process, require longer time. Therefore, to ent cluster results. Such an index would help in determining establish the speed of UVCLUSTER analyses, we have per- that the solution shown in Figure 2C is better than that shown formed tests with AC values ranging from 50 to 100. First, in Figure 2B. Many rules have been proposed to evaluate the by using a set of 34 elements and 561 primary distances different partitions of a dataset into clusters [see review by (see Section 4.3), and running on the same standard com- Gordon, 1999, pp. 60–65 and 185–204). However, we think puter detailed above, UVCLUSTER obtained the values of that none of these methods is fully convenient when applied secondary distances with a number of iterations, N = 10 000 to proteomic interaction data. They all use the whole distance in less than 2 s. Time increased linearly with the number of dataset for cluster validation (i.e. they take into account all iterations, so a similar analysis with N = 100 000 required distances, irrespectively of them implying direct interactions 9(AC = 100)to13(AC = 50) s. Similarly, for a set of or simply shortest paths among distant elements). Evaluation 150 randomly chosen elements (11 175 primary distance val- of the whole distance dataset is essential for clustering, but ues), an analysis with N = 10 000 took from 9 (AC = 100) validation of the clusters should take into account two fea- to 125 (AC = 50) s. Finally, the largest dataset that we tures of this particular type of information, which demonstrate have used composed of 500 randomly chosen proteins (i.e. that a distance equal to one is qualitatively different from the 124 750 distances), which was analyzed (N = 10 000), in 23 other distances. First, distances equal to one imply a strong to 160 min, again depending on the AC value used. functional link: two proteins directly interact in the cell, and UVCLUSTER results can be directly explored, considering therefore must be at the same time in the same place and the files that the program automatically generates containing often functioning together. On the contrary, when larger than either a UPGMA cluster analysis of secondary distances or a one, two identical distances may be radically different from graphical PGM representation. However, for advanced users, a cellular point of view. For example, a distance of two may and especially for analyses including up to a few hundreds mean that two proteins are part of a complex together with a of proteins, we think that the simplest and most efficient way third protein that physically interacts with both of them, or, to explore the results involves the generation of dendrograms alternatively, may mean that in different moments of the cell based on secondary distances. Ties in secondary distances are cycle (or in different tissues, if we refer to a multicellular very rare, facilitating the generation of non-ambiguous trees. organism), there are transient interactions of both proteins For this reason, UVCLUSTER also generates outputs com- with a third one. In the first case, all three proteins may patible with MEGA 2.1 (Kumar et al., 2001). This is one of function as a single unit, while in the second, the cellular the most used program packages in the molecular evolution roles of the two proteins may be radically different. A second and phylogenetics fields and includes several standard meth- consideration is that the protein interaction datasets are incom- ods to build dendrograms. Moreover, it can be obtained free plete. As we already commented above for S.cerevisiae, each from the authors (www.megasoftware.net). new experiment diminishes the distance among proteins. In fact, comparisons among the experiments performed so far 4.2 UVCLUSTER analysis of synthetic graphs in S.cerevisiae found a low degree of congruence (Bader and A simple model (Figure 2) demonstrates the advantages of the Hogue, 2002; von Mering et al., 2002), suggesting that we iterative strategy implemented in UVCLUSTER. Figure 2A are still quite far away from having a complete dataset of shows a graph with 11 elements connected by a total of 16 the protein interactions for any species. This means that it is interactions. Two clusters (units 1–4 and 8–11) are obvious. reasonable to expect that, in the future, many more direct inter- Figure 2B shows a UPGMA tree obtained using primary dis- actions will be found. In this sense, distances larger than one tances. Input order to obtain this tree followed the numeric may be considered to be overestimates of the true distances. value (1, 2, ... ,11) assigned to the elements. The solution All these considerations lead us to suggest that cluster shown in Figure 2B clearly fails to detect the two clusters, results from proteomic interaction data may be conveniently closely connecting units 4 and 5. This type of error is caused evaluated considering just the number of intracluster and inter- by the ties. If ties are solved in such a way that, by chance, cluster direct interactions. If we have a dataset in which K units 4 and 5 (or, alternatively, 7 and 8) become clustered, elements are connected by direct interactions, thus forming 368 Hierarchical clustering in proteomics Fig. 2. A synthetic example. (A) Graph of 11 elements connected by 16 interactions. Two clusters (units 1–4 and 8–11) are observed. (B) UPGMA-based tree using primary distances derived from the graph in (A). Notice the lack of correspondence with the graph. (C) UPGMA- based tree using the secondary distances obtained using UVCLUSTER with AC = 100 and 10 000 iterations. The topology very closely corresponds to the graph shown in (A). an undirected graph, we can define the following parameters. find a number M that corresponds to the maximum possible First, F will be the maximum possible number of direct inter- number of intracluster direct interactions: actions among elements [F = K(K − 1)/2]. Let us also call n the number of actual direct interactions discovered among M = k (k − 1)/2, (2) i i the K elements. For a particular partition into clusters, we can i−1 369 V.Arnau et al. where c is the number of clusters in the partition and k is the This particular set of proteins was selected for two main number of elements in each cluster. Finally, we will call p the reasons. First, it comprises a highly explored (i.e. most total number of direct intracluster interactions actually known. likely without false positive interactions) and very com- We suggest to evaluate the partitions found in proteomic data pact set of elements, in which obvious clusters or groups by selecting as best the one that minimizes the cumulative cannot be detected. This is shown in Figure 3, obtained hypergeometric distribution that follows: using PIVOT (Orlev et al., 2004), a program that provides a minimal-energy-based layout that highlights clustering M F −M Min (M , n) among units. However, whether clusters are present in j n−j . (3) Figure 3 is unclear. This group of proteins may thus be j =p n a good test to determine whether standard methods of clustering may detect any particular organization in the This is equivalent to select the partition that generates a dis- data, and how UVCLUSTER compares with these standard tribution that maximizes the proportion of intracluster direct procedures. Second, because it contains a few elements interactions while minimizing the proportion of intercluster that stand out as participating in processes that are related direct interactions. (through connections with actin function) but not part of We can apply this rule to the trees shown in Figure 2, the two main processes in which most elements particip- by using their topology in order to establish partitions with ate (actin patch assembly and patch-mediated endocytosis), a progressive number of clusters (c ≥ 2) (see Levenstien we can test whether we can discriminate these groups using et al., 2003 for a similar approach). Circles in the dichotomic UVCLUSTER. nodes have been numbered in Figure 2B and C accord- Figure 4 shows the results obtained with two different meth- ing to the number of clusters generated when we progress ods. First, we show in Figure 4A the tree generated by MEGA from left (largest distances) to right (smallest distances). We 2.1 that was obtained with UPGMA and the primary dis- found that the tree in Figure 2C has a minimum value of tances calculated from the graph shown in Figure 3. This Equation (3) in the particular case when c = 3. In that first option therefore corresponds to the analysis that can be case, we find that out of 16 total direct interactions, 14 are typically performed using standard clustering tools. Second, found within clusters and only 2 (units 4 and 5; units 7 and we show the tree obtained when the UPGMA algorithm 8) are found between clusters. According to the cumulative is applied to the set of secondary distances obtained using hypergeometric distribution described in Equation (3), the UVCLUSTER, with N = 10 000 and AC = 100 (Figure 4B). likelihood of finding by chance a distribution at least as asym- Both trees are quite different. As was obtained in the syn- −9 metric as the one defined by such partition is 3.17×10 . None thetic example shown above, the first tree is just one of the of the partitions obtained from the tree in Figure 2B has such many possible solutions that can be obtained by applying hier- −6 a small value (lowest value: 3.30 × 10 ). We can conclude, archical clustering to the primary distance dataset. When we as intuitively was evident, that UVCLUSTER analysis using evaluated the topologies shown in Figure 4A and B using secondary distances has provided a partition that is superior the hypergeometric-based parameters described above, we to those obtained by using the tree topology generated from found a partition (with c = 12) in the tree obtained using −27 the direct analysis of the primary distances. UVCLUSTER that is much more extreme (p = 1.79× 10 ) than the best value obtained from the direct analysis of the 4.3 UVCLUSTER exploration of the actin −23 primary distances (c = 10; p value = 1.80 × 10 ). This cytoskeleton of Saccharomyces cerevisiae optimal partition is shown also in Figure 4B. Because it To demonstrate the advantages that our program provides is known that one important problem with protein–protein when applied to real biological data, we first chose as a interaction data is the presence of false positives, we tested model a set of 34 S.cerevisiae proteins characterized by Drees whether the partition shown in Figure 4B is robust by ran- et al. (2001), which they described as 26 proteins participat- domly generating false protein–protein interactions that were ing in actin patch assembly and patch-mediated endocytosis added to the DIP database. Even when 7605 false interac- together with 8 proteins involved in other related processes, tions were added (making the randomly generated interactions such as cytokinesis (BNI1, BNR1), endocytosis (SVL3), con- 33% of all present in the modified database), the clusters trol of the morphogenesis checkpoint (SWE1, HSL7) or the were identically recovered when N = 10 000, AC = 100. CDC42 signaling pathway (CDC42, CLA4, GIC2) (Drees Only when 15 210 false interactions were added (50% of et al., 2001). Results shown in that study were updated all interactions are then spurious) we found a variation: by analyzing the DIP database, which we found contained clusters 2 and 7 appeared together. These results suggest all the information available in the literature. The graph of that, when provided a significant core of true interactions, protein–protein interactions shown in Figure 3 summarizes UVCLUSTER analyses may provide robust topologies even all currently (February 2004) available information for that in the presence of a substantial number of false positive set of proteins. interactions. 370 Hierarchical clustering in proteomics Fig. 3. Set of proteins involved in actin patch assembly and endocytosis according to Drees et al. (2001). Lines indicate direct interactions among them as found in the DIP database (January 2004 release). The figure was drawn using PIVOT (Orlev et al., 2004). We also wanted to check whether the partition obtained of proteins are involved according to Gene Ontology (GO) was biologically relevant. First, we considered the informa- terms. For the whole set of proteins, we found, as expected, tion provided by Drees et al. (2001). The proteins involved that the general GO process term defined as ‘actin cytoskel- (according to these authors) in actin patch assembly and eton organization and biogenesis’ included 20 out of the 34 function form clusters 1–5, 7, 11 and 12. Cluster 6 corres- proteins analyzed, with a probability of those proteins being −29 ponds to the three proteins described as involved in CDC42 together by chance equal to 1.03 × 10 . For the particular signaling. Cluster 8 includes the two proteins described by clusters including two or more proteins (because a limita- Drees et al. (2001) as involved in the morphogenesis check- tion of the program is that requires at least two proteins in point together with a third protein (APP1, called YNL094w a cluster, thus eliminating clusters 3, 4 and 12 in Figure 4B) in the study by Drees et al.) whose relationship with that we found the following results (considering only the GO terms process has not been determined. Cluster 9 contains the that gave the lowest probabilities of the cluster occurring endocytosis protein SVL3 plus a protein involved in actin by chance): clusters 1 and 7—actin cytoskeleton organiza- −11 polymerization and crosslinking to microtubules (CRN1). tion and biogenesis [p(cluster 1) = 2.65 × 10 ; p(cluster −6 Finally, cluster 10 contains the two proteins involved in 7) = 1.95 × 10 ]; cluster 6—Rho protein signal transduc- cytokinesis (BNI1, BNR1) plus PFY1, which plays sev- tion (equivalent to the definition by Drees et al. of ‘involved −8 eral roles in actin organization and polymerization. These to CDC42 signaling’; p = 1.51 × 10 ); cluster 8—G2/M ‘hybrid’ clusters were expected because the actin patch transition of mitotic cell cycle (again related to the assigna- proteins that closely connect with proteins involved in other tion of this group of proteins to the morphogenesis checkpoint −5 processes may contribute to linking these functions together by Drees et al.; p = 5.78 × 10 ); cluster 9—cell growth in the cell (Drees et al., 2001). As a second biological val- and/or maintenance (p = 0.091); cluster 10—response to −7 idation of the clusters, we performed searches using the osmotic stress (p = 5.06 × 10 ) and cluster 11—actin fil- −7 SGD Gene Ontology Term Finder (http://db.yeastgenome.org/ ament depolymerization (p = 7.55 × 10 ). Only clusters 2 cgi-bin/SGD/GO/goTermFinder) that allows, among other and 5 did not generate significant assignations according to features, to determine the most likely process in which a group GO terms. These results are interesting in two different ways. 371 V.Arnau et al. Fig. 4. Comparison of analyses using primary versus secondary distances from the dataset described in Figure 3. (A) UPGMA-based tree using the primary distances derived from Figure 3 data. Names in bold indicate deviations with respect to the tree in Figure 4B. (B) UPGMA- based tree using the secondary distances estimated using UVCLUSTER (AC = 100; N = 10 000). The optimal solution, according to the hypergeometric distribution of direct interactions, consists of 12 clusters, that are detailed on the right side. First, they demonstrate that proteins in most clusters can be may be significantly involved in actin patch assembly and significantly associated to particular cellular processes, thus function that did not appear in the description by Drees et al. confirming the biological significance of our results. Second, (2001). In order to do so, we calculated the average primary the fact that the processes assigned are generally different distance of each of the 4721 proteins in the DIP database for each cluster suggests that these clusters are composed against the 26 proteins that, according to Drees et al. (2001) of proteins involved in closely related but still different pro- were involved in these processes (the other 8, as we already cesses in vivo and that GO terms are precise enough so as to described above, often showed protein–protein interactions discriminate among those processes. with these 26, but functionally were less related). We found The possibility of including all available data for a par- that 19 of those 26 proteins were among the 40 proteins with ticular species using UVCLUSTER analyses allowed us to lowest average distances in the whole dataset (average dis- ask the question of whether there are other proteins that tance values ranging from 1.31 to 2.23) and even the worst 372 Hierarchical clustering in proteomics connected protein (TRM5; average distance with the rest suggesting members that could be missed when using other equal to 2.65) was in position 199 in our list. These results approaches. confirm the results obtained by Drees et al. (2001) using the 4.4 UVCLUSTER global analysis correlating gene updated information currently available: all the proteins con- expression and protein–protein interaction sidered by those authors indeed are in close proximity in the S.cerevisiae interactome graph. However, we concluded To demonstrate that UVCLUSTER can be used at a gen- that there are other proteins that are similarly or even better omic scale, we decided to compare UVCLUSTER-generated connected. Figure 5 contains a new UPGMA tree obtained results based on protein–protein interaction data with those using also the whole DIP interaction data, with AC = 100 derived from gene coexpression data obtained using microar- and N = 10 000, but including 38 other proteins poten- rays. There is good evidence, at least in yeast, that the products tially involved in actin patch assembly and function. These of highly coexpressed genes interact with a probability much 38 proteins have average distances lower than 2.27 to the higher than expected by chance (e.g. Kemmeren et al., 2002). 26 proteins in the original dataset. The close proximity to Thus, we can expect UVCLUSTER to recover, at least in part, actin and interspersion with the proteins of the original data- clusters of coexpressed genes using protein–protein interac- set of many of those newly added proteins can be easily tion data. To analyze coexpression, we used as a starting appreciated in Figure 5. It is also significant that duplicat- database the yeast ‘refined modules’ defined by (Bergmann ing the number of proteins only alters the relative position et al., 2004). These modules are groups of genes that show a of a few proteins of the original dataset. If we compare significant level of coexpression and that were obtained start- Figures 4B and 5, we can see that only YPR171W, SLA1, ing with a core of evolutionary conserved, coexpressed genes YHR133C, RVS167 and RVS161 appear in very different to which other genes with similar expression patterns were positions in both the trees. This result can be interpreted as a added Bergmann et al. (2004). These authors described eight new confirmation that the delimited clusters were quite reli- of those modules, including a total of 548 genes, as detailed in able. Adding more data in general only contributes to make Table 1. We decided to check to which extent UVCLUSTER the groups that were already observed larger. Similar results analyses may unearth those same modules using the available were obtained using AC = 50 (not shown). As a secondary protein–protein interaction data. We therefore took the list of biological validation of our results, we have also included 548 genes (that was reduced to 543 after eliminating 5 genes in Figure 5 some additional data. First, we add informa- that appeared in two or more modules) and analyzed whether tion for protein localization from Huh et al. (2003) and other the DIP database contains interaction data for their products. sources, as compiled in the MIPS database (Mewes et al., We found that the products of 376 of those genes indeed were 2002; http://mips.gsf.de/). Fifty-six proteins are described found in DIP. However, only 163 were involved in more than in MIPS as being localized to the actin cytoskeleton. Of one protein–protein interaction. These results demonstrate them, 24 are found in Figure 5, being 16 part of the ori- that the amount of information for the protein products of ginal data from Drees et al. (2001) while the other 8 are this set of genes is quite limited. We then used UVCLUSTER among the ones we have secondarily added to our tree using (AC = 100, N = 10 000) to determine clusters of proteins UVCLUSTER analyses (see details in Figure 5). Moreover, based on interactome data. Results are shown in Figure 6. another five UVCLUSTER-suggested proteins (LSB3, LSB5, A total of 162 proteins formed several well-defined clusters BEM1, ARP2, END3) are localized according to MIPS to the while the other 214 appeared isolated, due to the fact that they yeast cytoskeleton in a broader class that may also imply inter- did not present direct interactions with any other proteins in action with actin. Second, we include also information about the dataset (these last ones have been grouped together in a the GO term ‘actin cytoskeleton organization and biogenesis’ single branch in Figure 6). As shown in the figure, several of that, as we described above, includes most of the proteins in the clusters closely follow the results obtained by Bergmann our original dataset. The SGD Gene Ontology Term Finder et al. (2004). Thus, a cluster contains 56 rRNA processing assigns to that GO term the lowest probability among all proteins plus 14 ‘contaminant’ proteins involved in other −34 terms (p = 4.3 × 10 ) when all the proteins in Figure 5 processes. Another has 29 proteasome proteins with a single are included. Moreover, all other terms with low probability ‘contaminant’, a third one includes 17 MRP proteins with are related to it. In total, there are 28 proteins in Figure 5 two ‘contaminants’, etc. We became aware when considering that are included in the ‘actin cytoskeleton organization and these results that the information in DIP is indeed quite frag- biogenesis’ GO term. Of them, 8 are from the secondary data- mentary, which explains why some very characteristic clusters set obtained using UVCLUSTER. These results demonstrate are missing in Figure 6. For example, DIP does not con- that a significant fraction of the proteins suggested by the tain interaction data corresponding to cytoplasmic ribosomes UVCLUSTER study of the interactome are related to actin, (explaining why a large ribosomal cluster does not appear), or at least cytoskeleton, function. Thus, we can conclude that or to the small subunit of mitochondrial ribosomes (there- UVCLUSTER-based analyses may significantly contribute to fore, the 17 MRP proteins that appear together are all part delineate groups of closely integrated proteins in the cell, of the large subunit). Several of the modules (e.g. glycolysis, 373 V.Arnau et al. Fig. 5. Results of the analysis to discover new actin patch assembly and endocytosis proteins. In bold, the new proteins discovered. Italics refer to the eight proteins that Drees et al. (2001) considered as belonging to processes other than patch assembly and patch-mediated endocytosis. (a): Proteins that are localized to the actin cytoskeleton according to Huh et al. (2003). (b): Proteins assigned to the GO process ‘actin cytoskeleton organization and biogenesis’ according to the SGD database (see http://www.yeastgenome.org/GOContents.shtml). 374 Hierarchical clustering in proteomics Table 1. UVCLUSTER results using the refined modules by Bergmann et al. involved in a particular process, when provided with some (2004) preliminary information. For example, if we know that a group of proteins are acting on a relevant process, we can use them Module No. of No. of Size of main as ‘seeds’ to delimit, using UVCLUSTER, all the proteins genes in proteins in cluster defined that are closely connected to them in the interactome. There- module clusters by UVCLUSTER fore, we can obtain a clearer picture of the functions of the set defined by (% of total of interesting proteins (see Figure 5 and the related analyses UVCLUSTER genes in explained above). This approach can be seen as complement- module) ary to the one based on the detection of highly connected graphs that include a protein of interest developed by Bader rRNA processing 192 74 56 (29.2) Ribosomal protein 153 20 5 (3.3) and Hogue (2003; ‘directed mode’). Third, UVCLUSTER Mitochondrial ribosomal 76 19 17 (22.3) can be also used to establish groups of connected proteins protein even when some information is unavailable, by being able to Proteasome 40 31 29 (72.5) predict potential interactions. This can be done by using AC Peroxide 29 2 0 (0.0) values lower than 100, that is, allowing proteins that do not dir- Secreted protein 28 8 4 (14.3) Glycolysis 15 2 0 (0.0) ectly interact, but that are still quite close in the interactome, Heat shock 15 6 5 (33.3) to be clustered together. We have observed that diminishing the AC value has the effect of generating topologies that are worse than the one obtained with AC = 100 when eval- uated with the hypergeometric distribution-based parameter described above. This is due to the effect that the increased peroxide) are moreover formed in part by proteins that are permissiveness of the clustering process has on the secondary not expected to physically interact. Those results explain why distances, often allowing directly connected proteins to appear only some of the modules defined by Bergmann et al. (2004) in different clusters. However, this relaxation of the conditions actually are found in UVCLUSTER analyses (see Figure 6 may be of great interest if we assume that data are incomplete and Table 1 for the details). In any case, we can conclude that and some distances may be actually lower than those available. UVCLUSTER can be used for analysis involving hundreds of In those cases, relaxed analyses may suggest potential clusters proteins and, even in those complex cases, it recovers interest- of interacting proteins that would be missed by the strictest ing functional information based on interactome data. On the ones. It is significant that UVCLUSTER can also be used in other hand, the analysis just shown casts doubts on whether the opposite direction, that is, in order to detect false positive interaction data alone will ever provide a detailed picture of interactions. For example, promiscuous proteins, which are cell function. The relative incompleteness of the results we able to interact with many different partners, can be easily have just described can be conservatively explained suggest- detected using UVCLUSTER (with AC = 100) when they ing that the available information is very fragmentary and will appear as being equally distant to many other, totally unre- significantly improve in the future. However, it could also be lated, proteins (Arnau and Marín, 2003). This may be another explained by some/many highly integrated cellular processes, significant application of our program, because, as we com- which do not require direct protein–protein interactions. This mented above, the number of false positives is considered to would hinder the processes to be identified, no matter how be high for interaction data generated by non-directed, large- exhaustive the interactome data are. scale experiments (e.g. von Mering et al., 2002). As a final potential application, UVCLUSTER can be used to quickly 5 DISCUSSION explore whether proteins encoded by orthologous genes retain UVCLUSTER is a flexible tool for global exploration of pro- related functions in two species by comparing the relative tein function using interactome data that is based on iterative positions in both of their the interactomes. We have already hierarchical clustering. The strategy implemented in our pro- generated relevant information for some conserved genes by gram is related to permutation tests commonly used in other comparing the S.cerevisiae and D.melanogaster interactomes, contexts (reviewed by Felsenstein, 2004, pp. 359–363), which which will be presented elsewhere. are applied for the first time to the resolution of the ‘ties in UVCLUSTER has been designed keeping in mind the needs proximity’ problem that arises in clustering methods when of groups working in functional biology. Potential users of many distances are equal. We have shown that UVCLUSTER UVCLUSTER are those researchers interested in obtaining has four main strengths. First, it may easily discover and define in a short period of time significant information for parts of sets of closely linked proteins. Secondary distance data among the interactome of a species, for example, including all the the elements of a set delineate their relationships in a way that proteins involved in the particular biological process that they their direct, primary distances cannot do (see Figures 2 and 4). are studying. Therefore, analyses involving tens to hundreds Second, UVCLUSTER may be used to discover proteins of elements, similar to the ones we have shown above are 375 V.Arnau et al. Fig. 6. UVCLUSTER results for the refined modules defined by Bergmann et al. (2004). See text for details. expected to be the most frequently performed. It is important the number of elements to generate reliable secondary distance that, for our datasets, we have found that secondary distance tables. For example, when we repeatedly analyzed the actin results do not significantly change by increasing N above a dataset with N = 100, the topology shown in Figure 4B certain threshold. Our experience with the program suggests (obtained when N = 10 000) appeared only in 7 out of 10 that it is reasonable to use a value of N of at least ten times cases (in the other three, clusters 1 and 2 appeared fused 376 Hierarchical clustering in proteomics together). However, 10 repetitions with N = 500 generated Bader,G.D. and Hogue,C.W.V. (2002) Analyzing yeast protein–protein interaction data obtained from different the same clusters shown in Figure 4B. This recommendation sources. Nat. Biotechnol., 20, 991–997. of using at least 10 times the number of elements sharply Bader,G.D. and Hogue,C.W.V. (2003) An automated method for contrasts with the suggestions of several well-known com- finding molecular complexes in large protein interaction net- puter programs, which suggest to perform about 10 replicates works. BMC Bioinformatics, 4,2. when there are ties [summarized in Backeljau et al. (1996) Bader,G.D., Heilbut,A., Andrews,B., Tyers,M., Hughes,T. and and recently confirmed by our group]. Boone,C. (2003) Functional genomics and proteomics: charting UVCLUSTER analyses involving up to 1000 elements (e.g. a multidimensional map of the yeast cell. Trends Cell Biol., 13, 1000 proteins) can be, as the times described previously 344–356. demonstrate, easily performed on standard computer equip- Barabási,A.L. and Oltvai,Z.N. (2004) Network biology: under- ment. However, our general guideline to select the number of standing the cell’s functional organization. Nat. Rev. Genet., iterations suggests that UVCLUSTER, in its current imple- 5, 101–113. Bergmann,S., Ihmels,J. and Barkai,S. (2004) Similarities and dif- mentation, has severe limitations to analyze whole-proteome ferences in genome-wide expression data of six organisms. PLoS data. For example, for the whole S.cerevisiae dataset, which Biol., 2, 0085–0093. is defined by about 11 millions of primary distances, our Bu,D., Zhao,Y., Cai,L., Xue,H., Zhu,X., Lu,H., Zhang,J., Sun,S., guideline suggests performing analyses with N values of at Ling,L., Zhang,N., Li,G. and Chen,R. (2003) Topological least 50 000. Such a huge analysis is evidently not feas- structure analysis of the protein–protein interaction network in ible on standard PC equipment (estimated time: 6100 h, budding yeast. Nucl. Acids Res., 31, 2443–2450. with AC = 100), showing that a parallel implementation of Drees,B.L., Sundin,B., Brazeau,E., Caviston,J.P., Chen,G.C., UVCLUSTER, whose development is already in progress Guo,W., Kozminski,K.G., Lau,M.W., Moskow,J.J., Tong,A. et al. by our group, is essential for analyses involving very large (2001) A protein interaction map for cell polarity development. datasets. J. Cell Biol., 154, 549–571. Everitt,B.S., Landau,S. and Leese,M. (2001) Cluster Analysis, 4th Finally, it is obvious that UVCLUSTER may also be used to edn. Arnold, London. analyze information other than protein interaction data. Any Felsenstein,J. (2004) Inferring Phylogenies. Sinauer Associates, Inc. type of information that can be converted into primary dis- Sunderland, MA. tances and that suffers from the ties in proximity problem Floyd,R.W. (1962) Algorithm 97—Shortest path. Commun. can be advantageously analyzed using UVCLUSTER. Typical ACM, 5, 345. examples in genomics are the analyses of connections among Gagneur,J., Krause,R., Bouwmeester,T. and Casari,G. (2004) protein domains, in which distances are defined depending on Modular decomposition of protein–protein interaction networks. the combination of domains found in the set of proteins of one Genome Biol., 5, R57. or multiple species (e.g. Mott et al., 2002; Ye and Godzik, Gavin,A.C., Bösche,M., Krause,R., Grandi,P., Marzioch,M., 2004) or the analysis of distances estimated from genetic Bauer,A., Schultz,J., Rick,J.M., Michon,A.M., Cruciat,C.M. et al. (2002) Functional organization of the yeast proteome by interaction screens (Tong et al., 2004). Other fields, as the ana- systematic analysis of protein complexes. Nature, 415, 141–147. lysis of paper citations or coauthorships (Albert and Barabási, Gibbons,F.D. and Roth,F.P. (2002) Judging the quality of gene 2002) could also benefit from the use of UVCLUSTER. expression-based clustering methods using gene annotation. Genome Res., 12, 1574–1581. ACKNOWLEDGEMENTS Giot,L., Bader,J.S., Brouwer,C., Chaudhuri,A., Kuang,B., Li,Y., Research supported by the Spanish Ministry of Science Hao,Y.L., Ooi,C.E., Godwin,B., Vitols,E. et al. (2003) A protein and Technology (MCYT; projects GEN2001-4851-C06- interaction map of Drosophila melanogaster. Science, 302, 02 [NEUROGENOMICA/CAGEPEP] and SAF2003-09506), 1727–1736. Goldberg,D.S. and Roth,F.P. (2003) Assessing experimentally Fundació La Caixa (01/080-00) and Generalitat Valenciana derived interactions in a small world. Proc. Natl Acad. Sci., USA, (GV04B-141). S.M. is the recipient of an FPU predoctoral 100, 4372–4376. fellowship from the MCYT. Gordon,A.D. (1999) Classification, 2nd edn. Chapman and Hall/CRC, Boca Ratón, FL. REFERENCES Ho,Y., Gruhler,A., Hellbut,A., Bader,G.D., Moore,L., Adams,S.L., Albert,R. and Barabási,A.L. (2002) Statistical mechanics of complex Millar,A., Taylor,P., Bennett,K., Boutllier,K. et al. (2002) networks. Rev. Modern Phys., 74, 47–97. Systematic identification of protein complexes in Saccharomyces Arnau,V. and Marín,I. (2003) A hierarchical clustering strategy and cerevisiae by mass spectrometry. Nature, 415, 180–183. its application to proteomic interaction data. Lect. Notes Comput. Huh,W.K., Falvo,J.V., Gerke,L.C., Carroll,A.S., Howson,R.W., Sci., 2652, 62–69. Weissman,J.S. and O’Shea,E.K. (2003) Global analysis of protein Backeljau,T., De Bruyn,L., De Wolf,H., Jordaens,K., Van Dongen,S. localization in budding yeast. Nature, 245, 686–691. and Winnepenninckx,B. (1996) Multiple UPGMA and Neighbor- Ito,T., Chiba,T., Ozawa,R., Yoshida,M., Hattori,M. and Sakaki,Y. joining trees and the performances of some computer packages. (2001) A comprehensive two-hybrid analysis to explore the yeast Mol. Biol. Evol., 13, 309–313. protein interactome. Proc. Natl Acad. Sci., USA, 98, 4569–4574. 377 V.Arnau et al. Kemmeren,P., van Berkum,N.L., Vilo,J., Bijma,T., Donders,R., Rives,A.W. and Galitski,T. (2003) Modular organization of cellular Brazma,A. and Holstege,F.C. (2002) Protein interaction veri- networks. Proc. Natl Acad. Sci., USA, 100, 1128–1133. fication and functional annotation by integrated analysis of Salwinski,L. and Eisenberg,D. (2003) Computational methods of genome-scale data. Mol. Cell, 9, 1133–1143. analysis of protein–protein interactions. Curr. Opin. Struct. Biol., Kumar,S., Tamura,K., Jakobsen,I.B. and Nei,M. (2001) 13, 377–382. MEGA2: molecular evolutionary genetics analysis software. Schwikowski,B., Uetz,P. and Fields,S. (2000) A network of Bioinformatics, 17, 1244–1245. protein–protein interactions in yeast. Nat. Biotechnol., 18, Levenstien,M.A., Yang,Y and Ott,J. (2003) Statistical significance 1257–1261. for hierarchical clustering in genetic association and microarray Spirin,V. and Mirny,L.A. (2003) Protein complexes and functional expression studies. BMC Bioinformatics, 4, 62. modules in molecular networks. Proc. Natl Acad. Sci., USA, 100, Li,S., Armstrong,C.M., Bertin,N., Ge,H., Milstein,S., Boxem.M., 12123–12128. Vidalain,P.O., Han,J.D.J., Chesneau,A., Hao,T. et al. (2004) Takezaki,N. (1998) Tie trees generated by distance methods A map of the interactome network of the metazoan C.elegans. of phylogenetic reconstruction. Mol. Biol. Evol., 15, Science, 303, 540–543. 727–737. MacCuish,J., Nicolaou,C. and MacCuish,N.E. (2001) Ties in Tong,A.H.Y., Lesage,G., Bader,G.D., Ding,H., Xu,H., Xin,X., proximity and clustering compounds. J. Chem. Inf. Comput. Sci., Young,J., Berriz,G.F., Brost,R.L., Chang,M. et al. (2004) Global 41, 134–146. mapping of the yeast genetic interaction network. Science, 303, Mewes,H.W., Frishman,D., Guldener,U., Mannhaupt,G., Mayer,K., 808–813. Uetz,P., Giot,L., Cagney,G., Mansfield,T.A., Judson,R.S., Mokrejs,M., Morgenstern,B., Munsterkotter,M., Rudd,S. and Knight,J.R., Lockshon,D., Narayan,V., Srinivasan,M., Weil,B. (2002) MIPS: a database for genomes and protein Pochart,P. et al. (2000) A comprehensive analysis of sequences. Nucl. Acids Res., 30, 31–34. protein–protein interactions in Saccharomyces cerevisiae. Mott,R., Schultz,J, Bork,P. and Ponting,C.P. (2002) Predicting Nature, 403, 623–627. protein cellular localization using a domain projection method. von Mering,C., Krause,R., Snel,B., Cornell,M., Oliver,S.G., Genome Res., 12, 1168–1174. Fields,S. and Bork,P. (2002) Comparative assessment of large- Nei,M. and Kumar,S. (2000) Molecular Evolution and scale data sets of protein–protein interactions. Nature, 417, Phylogenetics. Oxford University Press, New York. 399–403. Orlev,N., Shamir,R. and Shiloh,Y. (2004) PIVOT: protein interac- Wagner,A. (2001) The yeast protein interaction network evolves tions visualization tool. Bioinformatics, 20, 424–425. rapidly and contains few redundant duplicate genes. Mol. Biol. Pereira-Leal,J.B., Enright,A.J. and Ouzounis,C.A. (2004) Detection Evol., 18, 1283–1292. of functional modules from protein interaction networks. Proteins, Wilhelm,T., Nasheuer,H.P. and Huang,S. (2003) Physical and 54, 49–57. functional modularity of the protein network in yeast. Mol. Cell. Prinz,S., Avila-Campillo,I., Aldridge,C., Srinivasan,A., Proteom., 2, 292–298. Dimitrov,K., Siegel,A.F. and Galitski,T. (2004) Control of Xenarios,I., Salwinski,L., Duan,X.J., Higney,P., Kim,S.M. and yeast filamentous-form growth by modules in an integrated Eisenberg,D. (2002) DIP, the Database of Interacting Proteins: a molecular network. Genome Res., 14, 380–390. research tool for studying cellular networks of protein interactions. Przulj,N., Wigle,D.A. and Jurisica,I. (2004) Functional topology in Nucl. Acids Res., 30, 303–305. a network of protein interactions. Bioinformatics, 20, 340–348. Ye,Y. and Godzik,A. (2004) Comparative analysis of protein domain Quackenbush,J. (2001) Computational analysis of microarray data. Nat. Rev. Genet., 2, 418–427. organization. Genome Res., 14, 343–353.

Journal

BioinformaticsOxford University Press

Published: Sep 16, 2004

There are no references for this article.