Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 14-Day Trial for You or Your Team.

Learn More →

Hybridization interactions between probesets in short oligo microarrays lead to spurious correlations

Hybridization interactions between probesets in short oligo microarrays lead to spurious... Background: Microarrays measure the binding of nucleotide sequences to a set of sequence specific probes. This information is combined with annotation specifying the relationship between probes and targets and used to make inferences about transcript- and, ultimately, gene expression. In some situations, a probe is capable of hybridizing to more than one transcript, in others, multiple probes can target a single sequence. These 'multiply targeted' probes can result in non- independence between measured expression levels. Results: An analysis of these relationships for Affymetrix arrays considered both the extent and influence of exact matches between probe and transcript sequences. For the popular HGU133A array, approximately half of the probesets were found to interact in this way. Both real and simulated expression datasets were used to examine how these effects influenced the expression signal. It was found not only to lead to increased signal strength for the affected probesets, but the major effect is to significantly increase their correlation, even in situations when only a single probe from a probeset was involved. By building a network of probe-probeset-transcript relationships, it is possible to identify families of interacting probesets. More than 10% of the families contain members annotated to different genes or even different Unigene clusters. Within a family, a mixture of genuine biological and artefactual correlations can occur. Conclusion: Multiple targeting is not only prevalent, but also significant. The ability of probesets to hybridize to more than one gene product can lead to false positives when analysing gene expression. Comprehensive annotation describing multiple targeting is required when interpreting array data. Background arrays, which use a set of short (typically 25-mer) oligonu- Sources of noise in microarray experiments may be cleotide probes to target a transcript, hybridization condi- numerous [1,2], thus most researchers try to minimize its tions are carefully controlled with the aim of minimizing influence or estimate it through various quality control, the effect of CH due to non-specific binding [4]. In addi- normalization and outlier filtering procedures [3]. One tion, each Perfect Match (PM) probe is accompanied by a source of variation is cross-hybridization (CH), which Mismatch probe (MM), in which the middle residue has occurs when unintended sequences hybridize to a probe been changed. The intention is that this can be used to alongside the intended target. In the case of Affymetrix provide a measure of the level of CH associated with each Page 1 of 14 (page number not for citation purposes) BMC Bioinformatics 2006, 7:276 http://www.biomedcentral.com/1471-2105/7/276 PM probe. A more detailed discussion of CH in short those two signals, not by their mean intensity (an example oligo arrays may be found in [5]. From October 2004, and further discussion of this can be found in the supple- Affymetrix also started to display brief summaries of cross- mental material). Many microarray data analysis tech- hybridization within their own NetAffx service [6]. niques rely on correlation analysis, with the majority of methodologies aiming to draw a distinction between In some circumstances, probes may match exactly to more genes that are, in some way, co-occurring, co-expressed or than one transcript. This is important because these correlated and those that do not follow a significant com- probes can no longer be identified with a unique tran- mon pattern. Methodologies such as hierarchical cluster- script, but are instead dependent on more than one gene ing [14,15] and relevance networks [16-18] make direct product. The situation is rendered somewhat more com- use of the Pearson correlation coefficient of expression plex by the fact that Affymetrix arrays use more than one values between probesets, whilst others (such as ANOVA probe (typically, 11 PM/MM pairs – together referred to as and more general linear models), are ultimately based on a "probeset") to target each transcript. Recently, several correlation-like principles. databases have been built to provide a mapping of Affymetrix probesets to known transcripts [7-10], to Results sequences from cDNA microarrays [11,12], or for apply- Affymetrix arrays use a series of probes to target a tran- ing algorithmic approaches to cross-platform or cross-spe- script. These probes are grouped together to form a cies comparisons [7]. A recent paper [13] presents a global probeset; expression processing algorithms such as MAS5 overview of the interpretation of GeneChip arrays, and [19], RMA [20] and GCRMA [21] combine the signals the need to update annotation to match the continued from each probe in a probeset to provide a single sum- evolution of genomic databases. The solution includes the mary value representing an estimate of the concentration redefinition of CDF files, similar to what was proposed of that transcript in solution. The issue of MT arises initially in [10], which may be sufficient in many cases. because certain probes are capable of hybridizing to more than one transcript, leading to non-independence, while The issue of 'multiply targeted' probes is important in other situations probes from more than one probeset because they have the potential to result in cross-talk are capable of hybridizing to a single transcript (Figure 1). between the probesets they are part of. If their effects are In general, these interactions combine to form a complex significant, and expression summarizing algorithms are lattice (Figure 2, see also Additional file 1). unable to control for them, then one outcome of this will be that otherwise unrelated probesets will appear corre- In this paper, we consider the extent and structure of these lated, since they are being driven by a shared signal. relationships, followed by an investigation of how much effect they have both on signal strength and on the corre- The ADAPT database [4] was used to investigate the extent lation between probesets. and significance of multiply-targeted probesets in Affyme- The prevalence of multiple targeting in oligo arrays trix expression data (see methods). Use is made of the fact that the platform's combination of short oligos and strict An analysis of the HG_U133A array reveals that many hybridization conditions, which are designed to maxi- transcripts (Ensembl: 7,257; RefSeq: 6,702) are matched mize binding to the PM probes whilst minimizing bind- with multiple probesets (i.e. case a in Fig. 1) while almost ing to the MM ones. This makes it viable to use in silico half (10,223) of the total 22,215 probesets (excluding methods to identify which probes are likely to bind with control probesets) show exact matches (with 1 or more 100% identity to which transcripts. We refer to cases of PM probes) to more than one Ensembl (9,460) or RefSeq exact matches between probe and transcript as Multiple (9,666) transcript (i.e. case b in Fig. 1). For comparison, Targeting (MT), to distinguish from the more general case 18,722 probesets were found to match to at least one well- of cross-hybridization, in which matches with less than known transcript. 100% identity may occur. The effect of MM probes is minimal: the number of MM Particular attention is directed at the influence MT can probes that can hybridize exactly to known transcripts is have on the apparent correlation between probesets' about 1,000 times smaller (Ensembl: 1,899 MM matches expression measurements. Since Pearson correlation is vs. over 1,956,000 PM matches, RefSeq: 1,962 MM vs. scale independent, it is not influenced by the overall mag- 1,922,000 PM) – most of them singleton matches to unre- nitude of either signal being compared, but rather on the lated sequences. Thus we exclude MM probes from subse- similarity in their shapes. Although it may seem counter- quent analyses. Since MM probes were not considered, intuitive, when two signals are superimposed, the amount and RMA makes no use of these probes in its computa- of correlation found between each of the original signals tions, RMA processed data is used for all calculations pre- and the combined one is driven by the relative variance of Page 2 of 14 (page number not for citation purposes) BMC Bioinformatics 2006, 7:276 http://www.biomedcentral.com/1471-2105/7/276 a) b) c) probeset1 transcript1 transcript1 probeset A transcript probeset probeset B probeset C transcript2 transcript2 probeset2 MT moti Figure 1 fs MT motifs. The basic motifs of multiple targeting. a) PTP motif b) TPT motif c) a simple combination of both – PTPTP motif. The motifs form the basic building blocks of multiple targeting networks. The strength of relationship between a transcript and a probeset is dependent on the number of probes matching to the transcript. sented here, although similar effects were also observed non-specific. Similarly, "_s_at" probesets are identified as with MAS5 processing. potentially targeting different gene family members or splice variants. The analysis shows that many of the Affymetrix probeset names are supposed to identify probesets associated with MT are not identified in this way probesets that are associated with multiple targeting. In and are simply annotated " _at" (2,189 according to particular, those marked "_x_at" are identified as being Ensembl matches; 1,496 for RefSeq). These numbers are LGL graph o Figure 2 f MT LGL graph of MT. a) LGL graph of all probeset-transcript relationships in HG_U133A array b) and c) are close-up views of regions in a) Page 3 of 14 (page number not for citation purposes) BMC Bioinformatics 2006, 7:276 http://www.biomedcentral.com/1471-2105/7/276 likely to be underestimates because ADAPT was built probesets, while edges represent matches between tran- using only well characterized sequences. Thus, a signifi- scripts and probesets, labelled with the number of match- cant number of the standard "_at" probesets are involved ing probes involved in the interaction. Such graphs are in MT. informative because so many probesets have the potential to be involved in MT (almost half for HGU133A arrays). Structures of multiple targeting in oligo arrays Since Affymetrix arrays measure the binding of cRNA Basic motifs sequences to sequence-specific probes, the searches used The two basic building blocks of MT interaction networks to define MT help catalogue which binding events are pos- are Probeset-Transcript-Probeset (PTP) motifs (Figure 1a), sible. Knowledge of MT interactions is important because and Transcript-Probeset-Transcript (TPT) motifs (Figure it begins to describe what is actually being measured in a 1b). Depending on the robustness of the analysis algo- microarray experiment. rithms used to process array data, the presence of either motif can be expected to lead to non-independence Figure 2 shows one such graph, laid out using LGL [22]. between the expression profiles of the participating Edges attached to RefSeq transcripts are painted red, probesets. Ensembl ones, green. Blue is used to mark the strength of MT, with intensity corresponding to the number of A search for both types of motifs confirms the prevalence matching probes. The LGL graph, when magnified, shows of MT in oligo arrays. Table 1 summarizes the rates of a set of disconnected families -detached sub-graphs of var- occurrence of both motifs for a variety of Affymetrix ious complexity. Thus, almost all of the MT relationships arrays. The PTP motif is especially common – it involves are local ones. almost half the probesets on the HG_U133A array and over a third of those on the HG_U133Plus2. Generally, To build families, the database was queried to identify all the more recent arrays have a larger proportion of PTP motifs. Then, a simple search algorithm used to iden- probesets involved in MT. tify the maximal graph that can be reached from a starting probeset using the identified motifs. Probesets that are Families of related probesets not involved in any PTP motifs result in trivial families Probesets may be involved in multiple PTP and TPT that consist of just a single probeset. An additional step is motifs, resulting in an MT-network. This can be expressed used to eliminate "hub probesets", as described below. as a graph in which nodes represent transcripts and Table 1: Summarization of PTP and TPT motifs for various Affymetrix arrays PTP motifs TPT motifs array p-set pairs probesets % probesets transcripts probesets % probesets hgu133Plus2 78504 18182 33.29% 11576 7285 13.34% hgu133a 54762 9666 43.51% 9288 5289 23.81% hgu133b 6218 3197 14.16% 3241 1526 6.76% hgu95c 70156 1620 12.88% 3201 928 7.38% hgu95b 2930 1729 13.77% 2256 1042 8.30% hgu95e 2984 1515 12.05% 2113 897 7.13% hgu95d 1444 612 4.87% 1765 502 3.99% hgu95a 15320 3880 31.32% 7309 3085 24.91% moe430_2 30552 14234 31.61% 4995 2439 5.42% moe430a 21686 9808 43.35% 4142 1929 8.53% moe430b 4224 2647 11.76% 1155 517 2.30% mgu74av2 17142 2843 22.89% 3221 1184 9.53% mgu74bv2 3606 2302 18.55% 1175 521 4.20% mgu74cv2 1158 636 5.36% 888 243 2.05% Rat230_2 7134 4818 15.52% 2821 1011 3.26% rae230a 2812 2056 12.96% 2404 792 4.99% rae230b 1792 1293 8.46% 581 226 1.48% rgu34b 908 680 7.79% 681 181 2.07% rgu34c 1382 769 8.81% 745 243 2.78% rgu34a 14818 3373 38.59% 2302 752 8.60% rnu34a 1050 600 47.51% 171 81 6.41% rtu34a 1130 442 45.47% 255 120 12.35% Page 4 of 14 (page number not for citation purposes) BMC Bioinformatics 2006, 7:276 http://www.biomedcentral.com/1471-2105/7/276 For HG_U133A arrays, this process results in the identifi- Real data, same transcript cation of 3,859 families containing at least 2 probesets Figure 1 draws a distinction between transcripts that share (for examples – see Additional file 2). The mean number a probeset, and probesets that share a transcript. The first of probesets in the family is not high -about 2.56. Interest- case (PTP, la) is relatively trivial: we should expect to see ingly, 429 families (involving a total of 1,529 probesets), correlation between these probesets. The extent of the were found in which family members were annotated to excessive correlation is confirmed by Figure 3, which different genes. Importantly, these families were not sim- shows the distribution of the Pearson correlation coeffi- ply comprised of "_x_at" probesets: 456 were annotated cient calculated between every probeset pair on the array. "_at" and 497 – "_s_at". The resultant distribution is almost normal, with a slight A full list of MT families is included in the supplementary displacement ( r = 0.02, for Gene Atlas data processed by data (see Additional file 3), along with an applet that RMA, for other datasets the mean is comparably small). allows the exploration of these families, attached to exem- By contrast, when only multiply targeted probesets are plary expression data (see Additional file 4). considered (as in Figure 1a), the distribution is strongly distorted towards positive values ( r = 0.55). Thus, as Hub probesets expected, probesets targeting the same transcript show There is a group of probesets (not always annotated by Affymetrix as " _x_at") that match a large number of tran- much higher correlation than those that are not linked in scripts, usually with a small number of probes. They may be called "hub" probesets, because their expression com- bines signals from many available transcripts. In the net- work of probeset-transcript relationships, hub probesets ● often join together smaller families of probesets, often many at a time. A typical example of a hub probeset is ● "221992_at" which matches to 44 RefSeq or Ensembl ● ● transcripts, with an average 3.18 probes per match, or "210524_x_at" (127 matches, 1.5 probe on average). Hubs were selected for the family search algorithm described above if the average number of matching probes was less than 3 and the total number of transcripts greater than 30, or if the total number of transcript matches was greater than 70. This resulted in 277 hub ● ● probesets being selected, allowing the granularity of fam- ilies to be kept to a reasonable level (also see Table 2 for ● ●● ● ● ● ● hub selection criteria). ● ● ● ● ● ● ●● ● ● ● ● ● Quantitation of the effect of multiple targeting −1.0 −0.5 0.0 0.5 1.0 Probes found by the database searches to target multiple correlation value transcripts, generally have a higher measured signal than those that target unique transcripts. For example, the aver- In Figure 3 fluence of MT on correlation between probesets age measured expression level in the Gene Atlas data is Influence of MT on correlation between probesets. The distribution of Pearson correlation for all probeset pairs 16% higher for multiply targeted PM probes and over (black) vs. MT probeset pairs (red). Data from 50 arrays 80% higher when the PM – MM difference for individual from Gene Atlas processed with RMA. The global (black) PM:MM probe pairs is considered. curve represents correlation for 1 million random probeset pairs, while the MT curve (red) was drawn using all probeset These numbers refer to differences in the raw probe inten- pairs from over 110,000 PTP motifs that occur in the sities, which are subsequently grouped into probesets and HG_U133A array. The peak of the MT curve close to a cor- processed by an expression summary tool such as MAS5 relation of 1 may be explained by a group of probesets having or RMA. The following sections investigate whether these almost constantly high signal. Most of these are 'hub' changes at the probe level are carried through to the MAS5 probesets as defined in the text. The green distribution is a or RMA processed expression summaries, and the influ- normal distribution with the same mean ( = 0.018) and stand- ence they have on Pearson correlation. ard deviation as the black one. It can be seen that the global distribution is very close to normal. Page 5 of 14 (page number not for citation purposes) freq 0.00 0.05 0.10 0.15 BMC Bioinformatics 2006, 7:276 http://www.biomedcentral.com/1471-2105/7/276 Table 2: Number of hub probesets and hub probesets not annotated x_at depending on the condition of the number of matching transcripts Transcripts matched – more than: hubs non-x_at hubs 10 1548 630 15 1020 289 20 762 144 25 613 96 30 531 73 35 450 56 40 404 47 45 378 42 50 334 29 this way. Similar results were also seen with MAS5 and first set. In this way, the case shown in Figure 1b was sim- GCRMA processed data (not shown). Importantly, this ulated – i.e., a probeset hybridizing to two different tran- effect is not confined to probesets in which 11/11 probes scripts (one with all the probes matching, the other with a match. Figure 4 shows the distribution of Pearson correla- varying number of matches). The second group of probesets was produced by randomly selecting up to 500 tion for probesets in which only a subset of probes are probesets. Variance filtering was performed to ensure that involved in MT. It can be seen that even a single matching at least one of the transcripts had an expression profile probe can result in increased correlation. This is surprising that varied. Since Pearson correlation is not dependent on given that oligo array data processing methods such as the mean intensity of the signals, but rather the similarity MAS5 and RMA are designed to be robust against outliers – a single probe behaving differently from its peers may not be expected to have a large influence on the data. This is investigated in more detail below. Simulation data ● ● Intensity ● ● ● Figure 1b shows a situation where the expression level of ● ● a probeset might be expected to be driven by two different ● transcripts. Since there is no independent estimate availa- ble for the expression levels of the individual transcripts ●● ● ● ● involved in TPT motifs, simulation experiments were per- ● ● ● formed to mimic the effect by artificially spiking raw expression data. ●● Figure 5 shows the results of one such simulation, ● ● designed to consider the effects of the presence of an addi- ● ●● ● ● tional transcript in equal abundance to the intended tar- ●● ● ● ●● ●● ●●● ●●●● ●●●● ●● ●●● ●●● ●● ●● ●● get. It can be seen that as the number of spiked probes increases, the signal becomes more pronounced. As previ- −1.0 −0.5 0.0 0.5 1.0 ously observed with real data, a single matching probe can correlation value have a significant influence on the computed expression level. Even when the expression level is relatively high the Effect of MT in real data on Figure 4 correlation between probesets signal from only 2 probes can be sufficient to lead to Effect of MT in real data on correlation between apparent differential expression. Even so, the largest fold probesets. Distribution of Pearson correlation for MT-asso- changes are generally restricted to the lower intensity ciated probesets. Curves correspond to the number of inter- probesets, indicating that both MAS5 and RMA do a good acting probes in the PTP motif: orange – 1 probe, magenta – job at reducing the effects of outliers. up to 3 probes, blue – up to 7 probes, green – all MT probeset pairs. The peak at the correlation close to 1 is due Correlation to hub probesets that generally have high intensity and match In a second simulation experiment, spiking was achieved to many transcripts with single probes. by adding the signals from a second set of probes to the Page 6 of 14 (page number not for citation purposes) freq 0.00 0.05 0.10 0.15 BMC Bioinformatics 2006, 7:276 http://www.biomedcentral.com/1471-2105/7/276 ● ● ● ● 8 ● 10 5 ● 6 7 7 ● ● 4 10 ● 7 ● 10 10 9 6 8 9 8 ● 9 9 9 9 9 4 10 10 9 10 8 10 4 ● 9 88 68 6 ● 10 10 10 9 77 10 9 10 6710 79 7 7 7 6 ● 10 9 4 10 8 9 ● 9 9 910 10 9 10 ● 9 10 10 7 10 8 9 108 9 2 ●● ● ● 10 10 6 9 10 9 4 4 ● ● 7 7 10 6 5 10 6 10 6 10 6 7 9 97 7 77 6 9 8 710 7 4 1 ● 7 9 4 8 99 99 8 57 ● ● 7 9 8 7 5 10 6 6 8 10 10 7 ● 7 8 910 6 9 6 910 8 45 5 8 ● 710 98 664 47 4 5 10 9 6 8 75 9 10 4 2 ● 7 39 8 106 59 6 5 7 7 8910 6 6 4 7 9 8 5 6 5 9 8 7 1 9 10 10 9 7 1010 10 79 6 5 8 6 4 4 1 ● 8 97 9 9 8 9 3 3 2 3 2 7 678 3 2 9 8 8 5 4 66 7 5 4 3 5 797 57 10 2 33 6 4 5 4 6 10 35 1 6 5 5 ● ● 10 8 610 7 4 9 5 3 45 7 7 8 6 86 4 46 5 5 6 6 3 ● 6 6 6 45 8 3 1 ● 6 4 7 4 131 1 ● 6 34 26 21 5 4 1 ● 6 4 8 6 4 1 2 2 1 10 62 5 7 3 3 2 2 ● 10 5 5 6 1 ● 6 78 43 3 6 63 5 5 1 ● 6 4 5 2 1 4 4 3 4 1 1 2 1 5 1 54 4 6 44 3 3 2 4 5 32 ● 4 1 1 5 2 4 2 1 3 ● 3 1 2 3 41 1 1 1 1 3 2 ● 1 43 4 2 1 1 2 1 ● ● 4 2 22 11 3 3 4 1 1 1 2 3 3 54 2 2 3 1 1 2 321 ● 1 24 ● 2 1 1 ● 2 1 2 2 ● 3 2 1 ● 124 ● ● ● ● 32 3 1 3 1 222 4 4 ● 1 12 2 ● 3 2 1 32 3 ● 3 3 ● 1 22 1 3 2 ● ● ● 3 ● 21 ● ● ● ● ● ● ●● ● ● 3 ●● ●● ●● ●●● ●●●● ● ● ●●●● 4 6 8 10 12 −1.0 −0.5 0.0 0.5 1.0 before correlation value Simu Figure 5 lation experiment – fold change Variance fi Figure 6 ltering of spikes and target probesets Simulation experiment – fold change. Change in meas- Variance filtering of spikes and target probesets. The ured signal intensity following spiking to simulate the pres- distribution of correlation for data generated as in Figure 5, ence of an additional hybridizing transcript in equal but grouped according to variance. Green – high variance abundance to the intended target. Numbers denote the probeset plus high variance spiking, blue – high variance quantity of probes modified in a probeset. Axes are log . Even probeset plus low variance spiking, magenta – low variance a single spiked probe can result in significant change of the probeset plus low variance spiking, cyan – low variance intensity. Fold changes can be seen even for high intensity probeset plus high variance spiking. Red – correlation before target probesets. spiking. The effects of multiple targeting on correlation is most pronounced when the intended target is of low vari- ance, however even in the case of high variance targets, cor- relation is likely to be influenced. of their shapes, filtering was performed on variance, not intensity. Pearson correlation, r was calculated between sion level are expected to be generally small. This is con- each member of the first list and its corresponding partner firmed in both the real and simulation datasets. in the second. Before spiking, the two sets should be uncorrelated; spiking is expected to increase correlation. However, even when overall changes in intensity are min- As in the real data, the signal from the spiked probes con- imal, increase in Pearson correlation can still be high. This tributes significantly to the correlation, even when only a is because Pearson correlation is driven by similarity in small number of probes are involved. It can be seen from profile shape, not intensity; small amounts of stray signal Figure 6 that even when high variance probesets are the can lead to large increases in r, even if the overall mean recipients of additional spiked signals changes in r are between probesets are very different. Since Pearson corre- possible. Thus, the effects are not restricted to probesets lation centers each variable about its mean, and scales it with low signal (see also Additional files 5, 6, 7, 8). by its standard deviation, correlation is entirely depend- ent on the relative shape and variance of the two signals, Intensity vs. correlation not their overall intensity. When two signals, a and b are Both real and artificial datasets demonstrate that MT can compared to their sum, s, the signal that is most correlated have a significant effect on correlation, even when only a with s depends not on their relative sizes, but on their rel- small proportion of probesets are involved in the interac- ative variance. This is counter-intuitive but important to tion. Algorithms such as RMA and MAS5 successfully recognize when considering the effects of interacting sig- employ robust averaging techniques (such as median pol- nals on correlation (see Additional file 9). ishing or a Tukey's biweight) to reduce the effect of out- liers. Thus, when only a small number of probes in a This effect can be demonstrated by varying the amount of probeset are involved in MT, changes in measured expres- contribution made by the spiking probes (f f -see Meth- Page 7 of 14 (page number not for citation purposes) after 468 10 12 freq 0.00 0.05 0.10 0.15 BMC Bioinformatics 2006, 7:276 http://www.biomedcentral.com/1471-2105/7/276 ● ● ● ●● ● ● ● ● ● ● ● ● ●● ●● ●● ● ● ●●●●● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ●●● ●●● ● ● ● ● ● ● ● ● ●●● ● ● ● ●●●●● ● ●●●●●● ● ● ●●●●●● ● ● ● ●● ●●● ●●●● ● ● ● ● ● ●● ●●● ●●● ● ● ● ● ●●● ●●● ● ● ●●●● ●●● ● ● ● ●● ● ●●● ●●●●●●●●● ● ● ● ●●● ●●●●●● ●●● ● ● ● ●●●●● ●● ●● ● ●●● ● ●● ● ● ●●●● ●●● ● ● ● ●●● ●●● ●●● ● ● ●● ● ● ●●● ●●●●●●●●●● ● ● ●● ●●●● ●●●●●●●● ● ● ●●● ● ● ● ● ●●● ●● ●●●●●●●●●●●●● ● ● ●● ●●● ●● ●● ●●● ●● ● ● ● ● ●●● ●●●●●● ●●●● ● ● ● ●●●●●●●●●● ●●●●●● ●● ●● ●●●●●● ●● ●●● ●● ● ● ●●●●●●●●● ● ● ●● ● ●●● ●●●● ● ●● ● ●●●● ●●●●● ●●●●●●● ● ●● ● ●●●●●●●●●●● ● ●● ● ● ●●●● ●●●●●● ● ● ●●● ●●●●●●●● ●●●●● ● ● ● ●●●●●●●●●●●● ●●●● ● ● ●●●●● ●●●●●●●●● ● ●●●●●●●●●●●●●●●●●●●● ● ●●●●●● ●●● ●● ●●● ●● ●●●●● ● ● ●●●●●● ●●● ● ●● ●●● ●●●● ● ● ●● ● ● ● ●●● ●● ● ●● 456789 −1.0 −0.5 0.0 0.5 1.0 before value ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ●● ● ● ●●● ●●●● ● ● ● ● ● ● ● ● ● ● ● ●● ●●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ●● ● ● ● ● ● ● ●●● ●●● ●● ● ●●● ● ●● ● ● ●● ● ● ●●● ● ●●●● ● ● ●● ● ● ● ●●● ● ●● ● ●●● ●● ●● ● ● ● ● ● ● ● ● ● ● ●●●● ● ●● ● ● ●● ●●●● ● ●● ● ● ● ●● ● ● ● ● ● ● ●●●● ● ● ● ● ● ● ●●● ● ● ● ●● ● ● ● ●●●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●●●● ●●● ● ●●● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ●●●●● ●● ● ●●●●●● ● ● ●●● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ●●●● ● ● ● ● ● ●● ●● ●● ●●● ● ●● ●● ● ●● ● ● ● ● ● ● ●● ● ● ●● ● ● ●●●● ●● ● ●●● ● ●● ● ● ●●● ● ● ● ●● ●●● ● ●●● ●● ●●● ● ● ●●● ●● ● ●● ●● ●●●● ●●● ● ● ●● ● ● ●● ● ● ●●●● ● ●● ●● ● ● ● ●●●●● ● ●● ● ●●● ●● ●●● ●● ● ● ● ● ●● ●●● ●●●● ● ● ●● ● ●●●● ●● ●● ● ● ● ● ● ● ● ●●● ● ●●● ● ●● ●●● ● ● ● ● ● ● ● ●● ● ●● ● ● ●●● ●●● ●● 456789 −1.0 −0.5 0.0 0.5 1.0 before value ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ●● ● ● ● ● ●●● ● ●● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ●●● ●● ● ●● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●●●● ●● ● ● ● ●●● ● ● ●●●● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ● ●●●●●●● ●● ● ● ●● ● ●●● ● ● ●●●●●●● ●●●● ● ●● ●● ● ● ●● ●● ●●●● ●●●● ●●● ● ●●●● ●●●●● ● ● ●● ● ●● ● ●● ● ● ● ● ●●●●● ● ●●●● ●● ● ● ●● ● ● ●● ● ● ●● ● ● ● ● ● ●● ●●● ● ●● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ●● ● ●●● ●● ● ● ●●●●● ● ●●● ● ●● ●● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●●●● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ● ●● ●● ● ● ● ●● ●● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●●● ● ● ●● ● ●●● ●● ●● ● ●● ● ● ● ● ● ●● ● ● ● ●● ● ● ●● ●● ● ● ● ● ●● ● ● ● ●●●●●● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ●●●●● ●● ● ● ●● ●●● ●●●●●● ● ● ● ● ●● ● ● ●● ● ● ● ● ●●●●● ● ● ●●● ●●● ●● 468 10 −1.0 −0.5 0.0 0.5 1.0 before value Infl Figure 7 uence of the level of spiking on RMA expression values and distributions of correlation Influence of the level of spiking on RMA expression values and distributions of correlation. 1st row: f f = 0.05, 2nd row: f f = 0.2, 3rd row: f f = 1. 1st column scatterplot of the signal after RMA versus the signal before spiking, 2nd column dis- tortion of the correlation distribution – changes after spiking. 500 randomly selected targets and spikes. Even small amount of stray signal may significantly influence correlation. For f f = 0.2, the fold change is not so much affected, but the effect on cor- relation distortion is almost as large as for f f = 1. ods) to the resulting value. Figure 7 shows that even when Together Figures 6 and 7 show that increases in correla- only 5% of the spike signal was present, the influence on tion are not simply confined to those cases in which a Pearson correlation can still be large, even though the large and varying signal is being added to a low-variance resultant fold change is generally small. probesets. Page 8 of 14 (page number not for citation purposes) after after after 468 10 468 468 freq freq freq 0.00 0.06 0.12 0.00 0.06 0.12 0.00 0.06 0.12 BMC Bioinformatics 2006, 7:276 http://www.biomedcentral.com/1471-2105/7/276 In situations where probesets are already strongly corre- make a distinction between probable real (i.e. biological) lated, the addition of extra signal due to cross hybridiza- and probably artefactual (i.e. MT driven) clusters. tion with another transcript might be expected to reduce correlation. Spiking experiments found this to be the case Discussion (data not shown). Interestingly, however, even though It is clear that multiple targeting is an important artefact there are occasions where r is reduced by multiple target- within microarray data: nearly half of all probesets on the ing, the general tendency is towards significantly HG_U133A array are associated with MT. When real increased correlation (as shown in Figure 3; similar figure expression data are considered, it can be seen that these for simulation experiments – see Additional file 10). probesets are significantly more correlated than would be expected by chance. These results are also supported by False positive rates will also be raised because otherwise simulation experiments, using datasets derived from real absent probesets with signals resulting only from back- experimental data, that allow MT to be considered in a ground levels of non-specific hybridization can experi- more controlled framework. MT can lead to increased cor- ence additional, structured, signal due to exact matches to relation between associated probesets, even when only a transcripts other than the intended target. small proportion of their probes are involved. Although expression summary algorithms are successful at reducing Functional homogeneity and spurious correlations in the effects of outlier probes, the do not remove them com- families of probesets pletely, and small amounts of stray signal can still have a Analysis of MT-families shows that out of the 3,859 significant influence on correlation. The reason for this shown in Figure 2, 395 contained probesets annotated apparent paradox is the scale-invariance of Pearson corre- (using the BioConductor annaffy package [3]) to 2 or lation; absolute signal is not important. What is impor- more UniGene clusters. When gene symbols are consid- tant are the variance and (effectively) the relative ered, annotation becomes even more ambiguous: 429 similarity in shape of the expression profiles. For this rea- families contained transcripts annotated to different son, particular care must be taken when analysing expres- genes. Thus, even though the majority of families are sion data using correlation-based approaches. The homogenous with respect to UniGene and gene symbols situation is also further complicated by the fact that MT – some 10–15% (depending on size of the family and occurs at a probe level – adding additional signal to indi- source of annotation) may be annotated to different vidual probes within a probeset – but correlation is calcu- genes. This translates to about 1000 probesets. lated after normalization and expression summarization using an algorithm such as RMA or MAS5. This additional As we have shown using both real and artificial data, MT complexity makes it difficult to reliably predict what will leads to increased correlation. A consequence of this is happen when signals are combined. However, empirical that probesets associated by MT should be drawn closer to data (Figure 6) show that influence on correlation is one another in dendrograms, such as those used to cluster dependent on the relative variance of the two probesets probesets for visualisation using heatmaps. For example, being combined. As expected, high variance spiked probes the heatmap in Figure 8 was created using three groups of generally have more of an effect than low variance spikes, probesets. The first (annotated to the genes RPS29,HFL- but interestingly, adding low variance spikes to low vari- B5,EIF4A1,RPL36A and RPL18), was identified in [23] as ance data (the magenta line in Figure 6) has more of an discriminating between standard-risk and high-risk TEL- effect than adding high variance spikes to low variance AML1 cytogenetic abnormalities. Non of these probesets data (the cyan line). This is likely to be a consequence of are associated by MT and can thus be considered to form the expression summarization and normalization that is a "well behaving" biological family. The second set (anno- imposed on the data. tated to TUBB6, TUBB2 and TUBB3) constitute another biological family, but they are also associated by MT to One consequence of MT is that because it serves to add each other, as well as to other tubulin genes. This family structure to otherwise random probesets with no genuine thus represents a mixed biological – MT family. The third signal, it can lead to the detection of false positives unless group of probesets are associated with RPS10, but also to the presence of cross-matching probesets is known. Anal- numerous pseudogene transcripts. This group represents ysis of the intensity distributions for MT and non-MT an "MT family" where the relationship is expected to be probesets shows a considerable degree of overlap (see artefactual. These three sets of probesets were added to a Additional file 11). This means that MT probesets cannot further set of randomly selected probesets, to act as "back- be removed simply by filtering on intensity. In fact, ground", and then clustered. The MT-family, the tubulins because MT generally increases signal strength, such filter- and the biological family are found as separate clusters ing might actually serve to enrich for MT probes. (the MT-family with even closer links than others), dem- onstrating that the hierarchical clustering is unable to Page 9 of 14 (page number not for citation purposes) BMC Bioinformatics 2006, 7:276 http://www.biomedcentral.com/1471-2105/7/276 IBSP : 201411_s_at RALA : 214435_x_at C8orf70 : 214961_at MT>: RPS10 : 211542_x_at MT>: RPS10 : 200817_x_at MT>: FLJ20294 : 200095_x_at MT>: LOC388885 : 216505_x_at BIO>>> :RPL18 : 200022_at BIO>>> :RPL36A : 201406_at BIO>>> :RPS29 : 201094_at BIO>>> :hfl−B5 : 202231_at BCL9 : 212625_at TUB>>>>>: TUBB2 : 213726_x_a TUB>>>>>: TUBB2 : 208977_x_a TUB>>>>>: TUBB3 : 213476_x_a BIO>>> :EIF4A1 : 201530_x_at MT>: RPS10 : 214001_x_at RALA : 207370_at TUB>>>>>: TUBB2 : 209372_x_a QPCTL : 220657_at STX10 : 220438_at KLHL11 : 205308_at KIAA0774 : 217030_at TUB>>>>>: TUBB6 : 209191_at PLEKHB2 : 204129_at −1 fold change from mean 1 −0.5 0.5 H Figure 8 eatmap example Heatmap example. Heatmap and hierarchical clustering of 3 families of probesets (MT-driven, tubulin and a functional one), plus randomly chosen non-MT probesets. The clustering does not make any distinction between functional and MT families – is grouping them together in very similar way. MT is ultimately a sequence-based event; it occurs when probesets (and, via annotation, genes) with correlated two sequences show 100% identity across the 25bp tar- expression profiles, and to use these relationships to infer geted by a probe. At the level of a probeset, this is most functional similarities. Since sequence similarity is itself likely to occur when transcripts show a high degree of often the basis by which common function is inferred sequence similarity. The relationships is troublesome, [24], sequence similarity combined with MT has the because a major use of expression data is to identify potential to become a self-fulfilling prophecy. Page 10 of 14 (page number not for citation purposes) HR25.CEL IR16.CEL IR17.CEL SR06.CEL SR08.CEL IR21.CEL SR03.CEL SR07.CEL HR22.CEL HR26.CEL SR02.CEL SR04.CEL SR11.CEL HR28.CEL HR30.CEL HR31.CEL IR15.CEL IR19.CEL IR14.CEL IR20.CEL SR05.CEL IR18.CEL SR09.CEL HR29.CEL HR23.CEL HR24.CEL HR27.CEL SR12.CEL SR01.CEL SR10.CEL SR13.CEL BMC Bioinformatics 2006, 7:276 http://www.biomedcentral.com/1471-2105/7/276 A search of the database found that about 5% of family Conclusion members contained probesets annotated to different Cross hybridization between probesets is a significant genes. Thus, the chances of finding a spurious functional effect that has real consequences for the interpretation of relationship due to MT between a pair of randomly microarray data. It may cause a variety of problems during selected genes is small. However, this is optimistic, analysis including false positives and negatives, and gen- because microarray analysis generally involves filtering to erally increased correlation between multiply-targeted produce a set of significant probesets (either by magni- probesets. Although the results presented here are for tude of change, or by statistical confidence). The result of Affymetrix arrays, it is reasonable to expect similar effects such filtering is to enrich the final 'hit list' not only for real to occur with other expression-based technologies. The biological effects, but also for anything else that is consist- use of short oligos and strict hybridization conditions ent, including biochemical or sequence based artefacts makes it possible to perform the in silico searches required such as MT. This is illustrated by the heatmap in the Figure to identify MT within Affymetrix data. However, CH is not 8; MT families fall into separate clusters against a back- exclusive to any one platform, and similar behaviour is ground of randomly selected probesets. likely to be seen elsewhere. Expression summary algo- rithms must correct not only for variation across arrays, One possible solution to MT is to redefine probesets so but also for variation between individual probes within a that probes targeting the same transcript are placed into probeset. This is generally performed using some kind of larger probesets representing the entire sequence, as pro- robust averaging procedure, but even small amounts of posed by [10]. This is the approach taken also by [13], but stray signal can lead to high correlations between authors conclude that "under many circumstances it is not probesets. Although algorithms such as RMA and MAS5 possible to generate transcript-specific probe sets for genes do a very good job of significantly reducing the influence with multiple transcripts based on probes available on the of outlier probes, they do not always remove it completely current generation of GeneChips". Thus they may be used – and this is manifested by significantly increased correla- to make distinction at the level of genes, but not at the tion between probesets, even when only a small subset of level of transcripts or splice variants – MT with all its con- probes are involved. sequences, still exists. There is a compromise to be made between generalisation and maintaining the ability to Many of the issues described above can be avoided with resolve subtle differences between transcripts and, for more detailed annotation. Often the terms 'gene', 'tran- example, splice variants. script' and 'probeset' are used interchangeably. This is dangerous, because the relationship is not one-one-one, The issue becomes more significant with the new genera- and the existence of MT networks can lead to apparent tion of microarrays such as the Affymetrix exon array [25] biological relationships that are, in fact, artefactual. that deliberately use multiple probesets to distinguish Expression data that is presented simply as a gene list is between individual transcripts from within a set of splice difficult to interpret properly, since the complexities of the variants expressed by a particular gene. The result is a interaction networks implicit within the data are lost. The many-many relationship between gene, transcript and community should ensure that the actual probeset IDs are probeset. always available alongside gene names or transcript acces- sions. This allows the graph structures associated with Annotation schemes that attempt to compress these gene-transcript-probeset mappings to be explored where many-many relationships into a one-to-one mapping lose necessary and used to fully interpret the complexities of the complexities inherent in the system. Grouping gene expression data. together a probeset that targets more than one transcript with probesets that target one or more individual tran- Methods scripts, results in MT occurring between the new probeset Graph rendering and all the other transcripts it shares probes with. From MT networks and interaction graphs were produced by one perspective, many these issues are simply down to extracting data from ADAPT and redirecting the output for annotation. The apparent aretefacts in the data only exist visualization to LGL [22], and our own visualization soft- because the probeset annotations do not accurately reflect ware. As global layouts of graphs such as LGL are static, the transcripts they bind to. thus not interactive, because the number of vertices is too big for efficient real-time rendering, an applet was devel- With all solutions, including those that attempt to solve oped for fast and flexible analyses of individual families. the problem by aggregating probes into larger probesets, These small, local graphs within the applet were realized annotation is crucial, since inaccuracies will arise unless with the JUNG API [26]. all the many-many relationships that occur within the data are represented explicitly. Page 11 of 14 (page number not for citation purposes) BMC Bioinformatics 2006, 7:276 http://www.biomedcentral.com/1471-2105/7/276 Databases, experimental data sources and data processing lated data were batch normalized using RMA and com- ADAPT [4] is a database of mappings between Affymetrix pared to the original un-spiked data (again batch probesets, transcripts and genes. It is populated by search- normalized using RMA, separately from the first set). In all ing all probe sequences for exact matches to transcript simulation experiments spiking for the selected probesets data taken from RefSeq (Release 11 at the time of writing) was carried out across the entire set of arrays. [27] and Ensembl (V30 at the time of writing) [28]. For RefSeq, both "known" and "model" sequences are used; In the second experiment, a set of 500 probesets was for Ensembl, ADAPT uses those assigned "known", selected, as before. A second set with the same number of "novel" or "pseudo" status. Both databases are used probesets was then chosen at random. These probesets because they employ different methods to predict tran- were selected from a subset of the probesets available, script/gene sequences. generated by filtering the expression data on variance. In this way, both sets could be sampled from probesets with The ADAPT database was queried (using SQL and Rdbi- specifically high, average or low variance of expression. PqSQL database link to R) to extract a set of tables describ- High and low variance are defined as the top or bottom ing all possible MT links between probesets and 2000 probesets, sorted by of variance, excluding the 100 transcripts, excluding anti-sense strand matches. The most extreme ones. Each probeset in the second list was probesets may match transcripts with anything from 1 to used to supply data for the probeset in the first list; 16 matching probes. The tables implicitly define an between 1 and 11 probes were chosen and the probe unconnected graph (see Figure 2 and supplemental file 1), intensities from the second list added to the correspond- and form the basis for all subsequent explorations of MT. ing probes in the first list. Various levels of influence were In order to consider the strength of the MT effect and its applied, adding a specific proportion of one probeset sig- consequences on expression studies, data from ADAPT nal to another: PM1 = PM + f f * PM , where for f 1after 1before 2 were combined with expression data from experiments f ranging from 0.05 to 1. In this way, cross-hybridization generated using the HG-U133A array. Expression levels between probesets in the first list, and the transcripts rep- were produced using MAS5 and RMA, as implemented in resented by the probesets in the second list, was simu- Bioconductor (packages affy and simpleaffy [3,29,30]). lated. Results were analogous when experiments were repeated Abbreviations with MAS5. All plots presented were generated using the ADAPT – "A Database of Affymetrix Probesets and Tran- Novartis Gene Atlas [31] dataset. Similar results were seen scripts" with both leukaemia [23] and sarcoma [32] datasets – publicly available from ArrayExpress. CH – cross-hybridization Pearson correlation was calculated for all the pairs of LGL – Large Graph Layout probesets found to be targeting the same transcript. The distribution of correlation coefficients was calculated for MAS5 – MicroArray Suite – Affymetrix algorithm (MAS all probeset pairs and for all pairs where one of the 5.0) probesets matches to a transcript with less than a specified number of probes. MM – mismatch probe Simulation data MT – multiple targeting A subset of 50 HG_U133A arrays from Gene Atlas V2 was used as the basis for simulation experiments designed to PM – perfect match probe explore how the number of MT probes influences expres- sion measurements from RMA processed data. PTP – probeset-transcript-probeset network motif Spiking was conducted as follows: prior to expression RMA – Robust Multichip Average algorithm summary generation using RMA, 500 probesets were selected at random to be spiked and 500 (at random) to TPT – transcript-probeset-transcript network motif act as a source of spiking data. No filtering was applied to these probesets. Probesets were randomly paired, and Authors' contributions between 1 and 10 probe-pairs selected for each probeset MO developed the concept of interactions in families of (again at random). The signals from the spike-sources probesets, carried out database and statistical analyses and were added to the original signals for the spike-targets. In drafted the manuscript, CM conceived the study on this way TPT motifs were simulated. The resulting simu- probes alignments to transcript and its implications, Page 12 of 14 (page number not for citation purposes) BMC Bioinformatics 2006, 7:276 http://www.biomedcentral.com/1471-2105/7/276 [http://www.biomedcentral.com/content/supplementary/1471- supervised and participated in its design and helped to 2105-7-276-S6.pdf] draft the manuscript. Additional File 7 Additional material Spiking experiment, signal filtering 3. As Additional file 5, but high inten- sity spikes added to low targets. Click here for file Additional File 1 [http://www.biomedcentral.com/content/supplementary/1471- Graph of MT families for HGU133A array. Animated GIF, 3D visualiza- 2105-7-276-S7.pdf] tion of MT-families in the array. Click here for file Additional File 8 [http://www.biomedcentral.com/content/supplementary/1471- Spiking experiment, signal filtering 4. As Additional file 5, but low inten- 2105-7-276-S1.gif] sity spikes added to low targets. Click here for file Additional File 2 [http://www.biomedcentral.com/content/supplementary/1471- Examples of 3 families of probesets and transcripts. Screenshots from the 2105-7-276-S8.pdf] applet. Big nodes signify probesets (green – positive detection call), small magenta ones – transcripts. The width of edges is proportional to the quan- Additional File 9 tity of MT probes. Probes are marked with a name, annotation in Affyme- R code for correlation experiment. A simple experiment to consider the trix or BioConductor and expression values. Presented are families relationship between correlation cooefficient and variance using 10,000 associated mainly with PAX8, RUNX1/RPL22 and tubulins. randomly generated cases. Click here for file Click here for file [http://www.biomedcentral.com/content/supplementary/1471- [http://www.biomedcentral.com/content/supplementary/1471- 2105-7-276-S2.tiff] 2105-7-276-S9.R] Additional File 3 Additional File 10 List of MT families for HGU133A array. The CSV file lists all the discov- Influence of specific number of spiked probes on correlation. Changes in ered MT probeset families along with their gene-level annotations accord- Pearson correlation following spiking to simulate MT between probesets. ing to Affymetrix and BioConductor. Red plot – correlation before spiking, orange – 1 spiked probe per probeset, Click here for file magenta – up to 3 probes, blue – up to 7 probes, green – all probes spiked. [http://www.biomedcentral.com/content/supplementary/1471- As in the case of real data – even a single probe may influence the distri- 2105-7-276-S3.csv] bution of correlation, however in that case there are no effects of biological similarity – that's why the effect exists, but is smallest for single probes. Additional File 4 Click here for file Applet for families exploration. http://bioinformatics.picr.man.ac.uk/ [http://www.biomedcentral.com/content/supplementary/1471- adaptnet. An applet for browsing graphs of MT-families in the HGU133A 2105-7-276-S10.pdf] array. Big nodes represent HGU133A probesets: green ones have "Present" detection call, pink ones "Absent". They are labelled with Additional File 11 Affymetfix and BioConductor annotations, detection call and expression Distribution of the expression signal for MT and non-MT probesets after value in the experiment. Small magenta nodes represent transcripts. There processing with RMA and MAS5. The plots (normalized distributions of is a possibility to add Exon 1.0ST probesets (blue) to the graph. The width summarized expression values) indicate a slight increase in the high signal of edges is proportional to the number of matching probes. The applet is values for MT probesets (blue) against non-MT probesets (green). intended for online use – it is connected to an application server and Click here for file ADAPT database. [http://www.biomedcentral.com/content/supplementary/1471- Click here for file 2105-7-276-S11.pdf] [http://www.biomedcentral.com/content/supplementary/1471- 2105-7-276-S4.txt] Additional File 5 Spiking experiment, signal filtering 1. Scatter plot and correlation distri- Acknowledgements bution, generated as in Figure 7, but filtered by average signal intensity. This work was funded by Cancer Research UK. Low intensity: the 10% probesets with lowest mean signal. High intensity: the 10% probesets with highest mean signal. Low intensity spikes added Zhi Cheng Wang and Tim Yates maintain and manage the ADAPT database. to high targets. The plots in Additional files 5, 6, 7, 8 prove that with any We thank Stuart Pepper, Francesca Buffa and Claire Wilson for useful dis- sort of signal intensity filtering, the shift in correlation coefficient occurs. cussions. Click here for file [http://www.biomedcentral.com/content/supplementary/1471- References 2105-7-276-S5.pdf] 1. Zakharkin S, Kim K, Mehta T, Chen L, Barnes S, Scheirer K, Parrish R, Allison D, Page G: Sources of variation in Affymetrix micro- Additional File 6 array experiments. BMC Bioinformatics 2005, 6:214. 2. Nimgaonkar A, Sanoudou D, Butte A, Haslett J, Kunkel L, Beggs A, Spiking experiment, signal filtering 2. As Additional file 5, but high inten- Kohane I: Reproducibility of gene expression across genera- sity spikes added to high targets. tions of Affymetrix microarrays. BMC Bioinformatics 2003, 4:27. Click here for file Page 13 of 14 (page number not for citation purposes) BMC Bioinformatics 2006, 7:276 http://www.biomedcentral.com/1471-2105/7/276 3. Wilson CL, Miller CJ: Simpleaffy: a BioConductor package for genomes, transcripts and proteins. Nucleic Acids Res 2005, Affymetrix Quality Control and data analysis. Bioinformatics 33(Database issue):D501-504. 2005, 21(18):3683-3685. 28. Birney E, Andrews T, Bevan P, Caccamo M, Chen Y, Clarke L, Coates 4. Leong HS, Yates T, Wilson C, Miller CJ: ADAPT: a database of G, Cuff J, Curwen V, Cutts T, Down T, Eyras E, Fernandez-Suarez X, affymetrix probesets and transcripts. Bioinformatics 2005, Gane P, Gibbins B, Gilbert J, Hammond M, Hotz H, Iyer V, Jekosch K, 21(10):2552-2553. Kahari A, Kasprzyk A, Keefe D, Keenan S, Lehvaslaiho H, McVicker 5. Wu C, Carta R, Zhang L: Sequence dependence of cross-hybrid- G, Melsopp C, Meidl P, Mongin E, Pettett R, Potter S, Proctor G, Rae ization on short oligo microarrays. Nucl Acids Res 2005, M, Searle S, Slater G, Smedley D, Smith J, Spooner W, Stabenau A, 33(9):e84. Stalker J, Storey R, Ureta-Vidal A, Woodwark K, Cameron G, Durbin 6. Liu G, Loraine AE, Shigeta R, Cline M, Cheng J, Valmeekam V, Sun S, R, Cox A, Hubbard T, M C: An overview of Ensembl. Genome Kulp D, Siani-Rose MA: NetAffx: Affymetrix probesets and Research 2004, 14:925-8. annotations. Nucleic Acids Research 2003, 31:82-86. 29. Gentleman R, Carey V, Bates D, Bolstad B, Dettling M, Dudoit S, Ellis 7. Mecham BH, Klus GT, Strovel J, Augustus M, Byrne D, Bozso P, Wet- B, Gautier L, Ge Y, Gentry J, Hornik K, Hothorn T, Huber W, Iacus more DZ, Mariani TJ, Kohane IS, Szallasi Z: Sequence-matched S, Irizarry R, Leisch F, Li C, Maechler M, Rossini A, Sawitzki G, Smith probes produce increased cross-platform consistency and C, Smyth G, Tierney L, Yang J, Zhang J: Bioconductor: open soft- more reproducible biological results in microarray-based ware development for computational biology and bioinfor- gene expression measurements. Nucl Acids Res 2004, 32(9):e74. matics. Genome Biology 2004, 5(10R80 [http://genomebiology.com/ 8. Mecham BH, Wetmore DZ, Szallasi Z, Sadovsky Y, Kohane I, Mariani 2004/5/10/R80]. TJ: Increased measurement accuracy for sequence-verified 30. Gautier L, Cope L, Bolstad BM, Irizarry RA: affy – analysis of microarray probes. Physiol Genomics 2004, 18(3):308-315. Affymetrix Gene Chip data at the probe level. Bioinformatics 9. Harbig J, Sprinkle R, Enkemann SA: A sequence-based identifica- 2004, 20(3):307-315. tion of the genes detected by probesets on the Affymetrix 31. Su AI, Wiltshire T, Batalov S, Lapp H, Ching KA, Block D, Zhang J, U133 plus 2.0 array. Nucl Acids Res 2005, 33(3):e31. Soden R, Hayakawa M, Kreiman G, Cooke MP, Walker JR, Hogenesch 10. Gautier L, Moller M, Friis-Hansen L, Knudsen S: Alternative map- JB: A gene atlas of the mouse and human protein-encoding ping of probes to genes for Affymetrix chips. BMC Bioinformat- transcriptomes. PNAS 2004, 101(166062-6067 [http:// ics 2004, 5:111. www.pnas.org/cgi/content/abstract/101/16/6062]. 11. Carter S, Eklund A, Mecham B, Kohane I, Szallasi Z: Redefinition of 32. Wang H, Trotter M, Lagos D, Bourboulia D, Henderson S, Makinen Affymetrix probe sets by sequence overlap with cDNA T, Elliman S, Flanagan A, Alitalo K, C B: Kaposi sarcoma herpesvi- microarray probes reduces cross-platform inconsistencies in rus-induced cellular reprogramming contributes to the lym- cancer-associated gene expression measurements. BMC Bio- phatic endothelial gene expression in Kaposi sarcoma. Nature informatics 2005, 6:107. Genetics 2004, 36(7):687-93. 12. Consortium GO: The Gene Ontology (GO) database and informatics resource. Nucleic Acids Research 2004:D258-D261. 13. Dai M, Wang P, Boyd AD, Kostov G, Athey B, Jones EG, Bunney WE, Myers RM, Speed TP, Akil H, Watson SJ, Meng F: Evolving gene/ transcript definitions significantly alter the interpretation of GeneChip data. Nucleic Acids Res 2005, 33(20):el75. 14. Shannon W, Culverhouse R, Duncan J: Analyzing microarray data using cluster analysis. Pharmacogenomics 2003, 4:41-52. 15. Sherlock G: Analysis of large-scale gene expression data. Brief- ings in Bioinformatics 2001, 2(4):350-362. 16. Butte AJ, Tamayo P, Slonim D, Golub TR, Kohane IS: Discovering functional relationships between RNA expression and chem- otherapeutic susceptibility using relevance networks. PNAS 2000, 97(22):12182-12186. 17. Butte AJ, Kohane I: Mutual Information Relevance Networks: Functional Genomic Clustering Using Pairwise Entropy Measurements. Pac Symp Biocomput 2000:418-429. 18. Stuart J, Segal E, Koller D, Kim S: A Gene-Coexpression Network for Global Discovery of Conserved Genetic Modules. Science 2003, 302:249-255. 19. Affymetrix: Statistical Algorithms Description Document. 20. Irizarry R, Bolstad B, Collin F, Cope L, Hobbs B, Speed T: Summa- ries of Affymetrix GeneChip probe level data. Nucleic Acids Res 2003, 31(4):el5. 21. Wu Z, Irizarry R, Gentleman R, Murillo F, Spencer F: A Model Based Background Adjustment for Oligonucleotide Expres- sion Arrays. Technical Report. John Hopkins University. Depart- ment of Biostatistics Working Papers 2003. 22. Adai AT, Date SV, Wieland S, Marcotte EM: LGL: Creating a Map Publish with Bio Med Central and every of Protein Function with an Algorithm for Visualizing Very scientist can read your work free of charge Large Biological Networks. Journal of Molecular Biology 2004, 340:179-190. "BioMed Central will be the most significant development for 23. Teuffel O, Dettling M, Cario G, Stanulla M, Schrappe M, Buehlmann P, disseminating the results of biomedical researc h in our lifetime." Niggli F, Schaefer B: Gene Expression Profiles and Risk Stratifi- Sir Paul Nurse, Cancer Research UK cation in Childhood Acute Leukemia. Haematologica 2004, 89:801-808. Your research papers will be: 24. Attwood T, Miller C: Progress in bioinformatics and the impor- available free of charge to the entire biomedical community tance of being earnest. Biotechnol Annu Rev 2002, 8:1-54. 25. Affymetrix: Exon Probeset Annotations and Transcript Clus- peer reviewed and published immediately upon acceptance ter Groupings. 2005. cited in PubMed and archived on PubMed Central 26. O'Madadhain J, Fisher D, Smyth P: Analysis and Visualization of Network Data using JUNG. Journal of Statistical Software in press. yours — you keep the copyright 27. Pruitt KD, Tatusova T, Maglott DR: NCBI Reference Sequence BioMedcentral (RefSeq): a curated non-redundant sequence database of Submit your manuscript here: http://www.biomedcentral.com/info/publishing_adv.asp Page 14 of 14 (page number not for citation purposes) http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png BMC Bioinformatics Springer Journals

Hybridization interactions between probesets in short oligo microarrays lead to spurious correlations

BMC Bioinformatics , Volume 7 (1) – Jun 2, 2006

Loading next page...
 
/lp/springer-journals/hybridization-interactions-between-probesets-in-short-oligo-Peil2Q5x29

References (42)

Publisher
Springer Journals
Copyright
Copyright © 2006 by Okoniewski and Miller; licensee BioMed Central Ltd.
Subject
Life Sciences; Bioinformatics; Microarrays; Computational Biology/Bioinformatics; Computer Appl. in Life Sciences; Combinatorial Libraries; Algorithms
eISSN
1471-2105
DOI
10.1186/1471-2105-7-276
pmid
16749918
Publisher site
See Article on Publisher Site

Abstract

Background: Microarrays measure the binding of nucleotide sequences to a set of sequence specific probes. This information is combined with annotation specifying the relationship between probes and targets and used to make inferences about transcript- and, ultimately, gene expression. In some situations, a probe is capable of hybridizing to more than one transcript, in others, multiple probes can target a single sequence. These 'multiply targeted' probes can result in non- independence between measured expression levels. Results: An analysis of these relationships for Affymetrix arrays considered both the extent and influence of exact matches between probe and transcript sequences. For the popular HGU133A array, approximately half of the probesets were found to interact in this way. Both real and simulated expression datasets were used to examine how these effects influenced the expression signal. It was found not only to lead to increased signal strength for the affected probesets, but the major effect is to significantly increase their correlation, even in situations when only a single probe from a probeset was involved. By building a network of probe-probeset-transcript relationships, it is possible to identify families of interacting probesets. More than 10% of the families contain members annotated to different genes or even different Unigene clusters. Within a family, a mixture of genuine biological and artefactual correlations can occur. Conclusion: Multiple targeting is not only prevalent, but also significant. The ability of probesets to hybridize to more than one gene product can lead to false positives when analysing gene expression. Comprehensive annotation describing multiple targeting is required when interpreting array data. Background arrays, which use a set of short (typically 25-mer) oligonu- Sources of noise in microarray experiments may be cleotide probes to target a transcript, hybridization condi- numerous [1,2], thus most researchers try to minimize its tions are carefully controlled with the aim of minimizing influence or estimate it through various quality control, the effect of CH due to non-specific binding [4]. In addi- normalization and outlier filtering procedures [3]. One tion, each Perfect Match (PM) probe is accompanied by a source of variation is cross-hybridization (CH), which Mismatch probe (MM), in which the middle residue has occurs when unintended sequences hybridize to a probe been changed. The intention is that this can be used to alongside the intended target. In the case of Affymetrix provide a measure of the level of CH associated with each Page 1 of 14 (page number not for citation purposes) BMC Bioinformatics 2006, 7:276 http://www.biomedcentral.com/1471-2105/7/276 PM probe. A more detailed discussion of CH in short those two signals, not by their mean intensity (an example oligo arrays may be found in [5]. From October 2004, and further discussion of this can be found in the supple- Affymetrix also started to display brief summaries of cross- mental material). Many microarray data analysis tech- hybridization within their own NetAffx service [6]. niques rely on correlation analysis, with the majority of methodologies aiming to draw a distinction between In some circumstances, probes may match exactly to more genes that are, in some way, co-occurring, co-expressed or than one transcript. This is important because these correlated and those that do not follow a significant com- probes can no longer be identified with a unique tran- mon pattern. Methodologies such as hierarchical cluster- script, but are instead dependent on more than one gene ing [14,15] and relevance networks [16-18] make direct product. The situation is rendered somewhat more com- use of the Pearson correlation coefficient of expression plex by the fact that Affymetrix arrays use more than one values between probesets, whilst others (such as ANOVA probe (typically, 11 PM/MM pairs – together referred to as and more general linear models), are ultimately based on a "probeset") to target each transcript. Recently, several correlation-like principles. databases have been built to provide a mapping of Affymetrix probesets to known transcripts [7-10], to Results sequences from cDNA microarrays [11,12], or for apply- Affymetrix arrays use a series of probes to target a tran- ing algorithmic approaches to cross-platform or cross-spe- script. These probes are grouped together to form a cies comparisons [7]. A recent paper [13] presents a global probeset; expression processing algorithms such as MAS5 overview of the interpretation of GeneChip arrays, and [19], RMA [20] and GCRMA [21] combine the signals the need to update annotation to match the continued from each probe in a probeset to provide a single sum- evolution of genomic databases. The solution includes the mary value representing an estimate of the concentration redefinition of CDF files, similar to what was proposed of that transcript in solution. The issue of MT arises initially in [10], which may be sufficient in many cases. because certain probes are capable of hybridizing to more than one transcript, leading to non-independence, while The issue of 'multiply targeted' probes is important in other situations probes from more than one probeset because they have the potential to result in cross-talk are capable of hybridizing to a single transcript (Figure 1). between the probesets they are part of. If their effects are In general, these interactions combine to form a complex significant, and expression summarizing algorithms are lattice (Figure 2, see also Additional file 1). unable to control for them, then one outcome of this will be that otherwise unrelated probesets will appear corre- In this paper, we consider the extent and structure of these lated, since they are being driven by a shared signal. relationships, followed by an investigation of how much effect they have both on signal strength and on the corre- The ADAPT database [4] was used to investigate the extent lation between probesets. and significance of multiply-targeted probesets in Affyme- The prevalence of multiple targeting in oligo arrays trix expression data (see methods). Use is made of the fact that the platform's combination of short oligos and strict An analysis of the HG_U133A array reveals that many hybridization conditions, which are designed to maxi- transcripts (Ensembl: 7,257; RefSeq: 6,702) are matched mize binding to the PM probes whilst minimizing bind- with multiple probesets (i.e. case a in Fig. 1) while almost ing to the MM ones. This makes it viable to use in silico half (10,223) of the total 22,215 probesets (excluding methods to identify which probes are likely to bind with control probesets) show exact matches (with 1 or more 100% identity to which transcripts. We refer to cases of PM probes) to more than one Ensembl (9,460) or RefSeq exact matches between probe and transcript as Multiple (9,666) transcript (i.e. case b in Fig. 1). For comparison, Targeting (MT), to distinguish from the more general case 18,722 probesets were found to match to at least one well- of cross-hybridization, in which matches with less than known transcript. 100% identity may occur. The effect of MM probes is minimal: the number of MM Particular attention is directed at the influence MT can probes that can hybridize exactly to known transcripts is have on the apparent correlation between probesets' about 1,000 times smaller (Ensembl: 1,899 MM matches expression measurements. Since Pearson correlation is vs. over 1,956,000 PM matches, RefSeq: 1,962 MM vs. scale independent, it is not influenced by the overall mag- 1,922,000 PM) – most of them singleton matches to unre- nitude of either signal being compared, but rather on the lated sequences. Thus we exclude MM probes from subse- similarity in their shapes. Although it may seem counter- quent analyses. Since MM probes were not considered, intuitive, when two signals are superimposed, the amount and RMA makes no use of these probes in its computa- of correlation found between each of the original signals tions, RMA processed data is used for all calculations pre- and the combined one is driven by the relative variance of Page 2 of 14 (page number not for citation purposes) BMC Bioinformatics 2006, 7:276 http://www.biomedcentral.com/1471-2105/7/276 a) b) c) probeset1 transcript1 transcript1 probeset A transcript probeset probeset B probeset C transcript2 transcript2 probeset2 MT moti Figure 1 fs MT motifs. The basic motifs of multiple targeting. a) PTP motif b) TPT motif c) a simple combination of both – PTPTP motif. The motifs form the basic building blocks of multiple targeting networks. The strength of relationship between a transcript and a probeset is dependent on the number of probes matching to the transcript. sented here, although similar effects were also observed non-specific. Similarly, "_s_at" probesets are identified as with MAS5 processing. potentially targeting different gene family members or splice variants. The analysis shows that many of the Affymetrix probeset names are supposed to identify probesets associated with MT are not identified in this way probesets that are associated with multiple targeting. In and are simply annotated " _at" (2,189 according to particular, those marked "_x_at" are identified as being Ensembl matches; 1,496 for RefSeq). These numbers are LGL graph o Figure 2 f MT LGL graph of MT. a) LGL graph of all probeset-transcript relationships in HG_U133A array b) and c) are close-up views of regions in a) Page 3 of 14 (page number not for citation purposes) BMC Bioinformatics 2006, 7:276 http://www.biomedcentral.com/1471-2105/7/276 likely to be underestimates because ADAPT was built probesets, while edges represent matches between tran- using only well characterized sequences. Thus, a signifi- scripts and probesets, labelled with the number of match- cant number of the standard "_at" probesets are involved ing probes involved in the interaction. Such graphs are in MT. informative because so many probesets have the potential to be involved in MT (almost half for HGU133A arrays). Structures of multiple targeting in oligo arrays Since Affymetrix arrays measure the binding of cRNA Basic motifs sequences to sequence-specific probes, the searches used The two basic building blocks of MT interaction networks to define MT help catalogue which binding events are pos- are Probeset-Transcript-Probeset (PTP) motifs (Figure 1a), sible. Knowledge of MT interactions is important because and Transcript-Probeset-Transcript (TPT) motifs (Figure it begins to describe what is actually being measured in a 1b). Depending on the robustness of the analysis algo- microarray experiment. rithms used to process array data, the presence of either motif can be expected to lead to non-independence Figure 2 shows one such graph, laid out using LGL [22]. between the expression profiles of the participating Edges attached to RefSeq transcripts are painted red, probesets. Ensembl ones, green. Blue is used to mark the strength of MT, with intensity corresponding to the number of A search for both types of motifs confirms the prevalence matching probes. The LGL graph, when magnified, shows of MT in oligo arrays. Table 1 summarizes the rates of a set of disconnected families -detached sub-graphs of var- occurrence of both motifs for a variety of Affymetrix ious complexity. Thus, almost all of the MT relationships arrays. The PTP motif is especially common – it involves are local ones. almost half the probesets on the HG_U133A array and over a third of those on the HG_U133Plus2. Generally, To build families, the database was queried to identify all the more recent arrays have a larger proportion of PTP motifs. Then, a simple search algorithm used to iden- probesets involved in MT. tify the maximal graph that can be reached from a starting probeset using the identified motifs. Probesets that are Families of related probesets not involved in any PTP motifs result in trivial families Probesets may be involved in multiple PTP and TPT that consist of just a single probeset. An additional step is motifs, resulting in an MT-network. This can be expressed used to eliminate "hub probesets", as described below. as a graph in which nodes represent transcripts and Table 1: Summarization of PTP and TPT motifs for various Affymetrix arrays PTP motifs TPT motifs array p-set pairs probesets % probesets transcripts probesets % probesets hgu133Plus2 78504 18182 33.29% 11576 7285 13.34% hgu133a 54762 9666 43.51% 9288 5289 23.81% hgu133b 6218 3197 14.16% 3241 1526 6.76% hgu95c 70156 1620 12.88% 3201 928 7.38% hgu95b 2930 1729 13.77% 2256 1042 8.30% hgu95e 2984 1515 12.05% 2113 897 7.13% hgu95d 1444 612 4.87% 1765 502 3.99% hgu95a 15320 3880 31.32% 7309 3085 24.91% moe430_2 30552 14234 31.61% 4995 2439 5.42% moe430a 21686 9808 43.35% 4142 1929 8.53% moe430b 4224 2647 11.76% 1155 517 2.30% mgu74av2 17142 2843 22.89% 3221 1184 9.53% mgu74bv2 3606 2302 18.55% 1175 521 4.20% mgu74cv2 1158 636 5.36% 888 243 2.05% Rat230_2 7134 4818 15.52% 2821 1011 3.26% rae230a 2812 2056 12.96% 2404 792 4.99% rae230b 1792 1293 8.46% 581 226 1.48% rgu34b 908 680 7.79% 681 181 2.07% rgu34c 1382 769 8.81% 745 243 2.78% rgu34a 14818 3373 38.59% 2302 752 8.60% rnu34a 1050 600 47.51% 171 81 6.41% rtu34a 1130 442 45.47% 255 120 12.35% Page 4 of 14 (page number not for citation purposes) BMC Bioinformatics 2006, 7:276 http://www.biomedcentral.com/1471-2105/7/276 For HG_U133A arrays, this process results in the identifi- Real data, same transcript cation of 3,859 families containing at least 2 probesets Figure 1 draws a distinction between transcripts that share (for examples – see Additional file 2). The mean number a probeset, and probesets that share a transcript. The first of probesets in the family is not high -about 2.56. Interest- case (PTP, la) is relatively trivial: we should expect to see ingly, 429 families (involving a total of 1,529 probesets), correlation between these probesets. The extent of the were found in which family members were annotated to excessive correlation is confirmed by Figure 3, which different genes. Importantly, these families were not sim- shows the distribution of the Pearson correlation coeffi- ply comprised of "_x_at" probesets: 456 were annotated cient calculated between every probeset pair on the array. "_at" and 497 – "_s_at". The resultant distribution is almost normal, with a slight A full list of MT families is included in the supplementary displacement ( r = 0.02, for Gene Atlas data processed by data (see Additional file 3), along with an applet that RMA, for other datasets the mean is comparably small). allows the exploration of these families, attached to exem- By contrast, when only multiply targeted probesets are plary expression data (see Additional file 4). considered (as in Figure 1a), the distribution is strongly distorted towards positive values ( r = 0.55). Thus, as Hub probesets expected, probesets targeting the same transcript show There is a group of probesets (not always annotated by Affymetrix as " _x_at") that match a large number of tran- much higher correlation than those that are not linked in scripts, usually with a small number of probes. They may be called "hub" probesets, because their expression com- bines signals from many available transcripts. In the net- work of probeset-transcript relationships, hub probesets ● often join together smaller families of probesets, often many at a time. A typical example of a hub probeset is ● "221992_at" which matches to 44 RefSeq or Ensembl ● ● transcripts, with an average 3.18 probes per match, or "210524_x_at" (127 matches, 1.5 probe on average). Hubs were selected for the family search algorithm described above if the average number of matching probes was less than 3 and the total number of transcripts greater than 30, or if the total number of transcript matches was greater than 70. This resulted in 277 hub ● ● probesets being selected, allowing the granularity of fam- ilies to be kept to a reasonable level (also see Table 2 for ● ●● ● ● ● ● hub selection criteria). ● ● ● ● ● ● ●● ● ● ● ● ● Quantitation of the effect of multiple targeting −1.0 −0.5 0.0 0.5 1.0 Probes found by the database searches to target multiple correlation value transcripts, generally have a higher measured signal than those that target unique transcripts. For example, the aver- In Figure 3 fluence of MT on correlation between probesets age measured expression level in the Gene Atlas data is Influence of MT on correlation between probesets. The distribution of Pearson correlation for all probeset pairs 16% higher for multiply targeted PM probes and over (black) vs. MT probeset pairs (red). Data from 50 arrays 80% higher when the PM – MM difference for individual from Gene Atlas processed with RMA. The global (black) PM:MM probe pairs is considered. curve represents correlation for 1 million random probeset pairs, while the MT curve (red) was drawn using all probeset These numbers refer to differences in the raw probe inten- pairs from over 110,000 PTP motifs that occur in the sities, which are subsequently grouped into probesets and HG_U133A array. The peak of the MT curve close to a cor- processed by an expression summary tool such as MAS5 relation of 1 may be explained by a group of probesets having or RMA. The following sections investigate whether these almost constantly high signal. Most of these are 'hub' changes at the probe level are carried through to the MAS5 probesets as defined in the text. The green distribution is a or RMA processed expression summaries, and the influ- normal distribution with the same mean ( = 0.018) and stand- ence they have on Pearson correlation. ard deviation as the black one. It can be seen that the global distribution is very close to normal. Page 5 of 14 (page number not for citation purposes) freq 0.00 0.05 0.10 0.15 BMC Bioinformatics 2006, 7:276 http://www.biomedcentral.com/1471-2105/7/276 Table 2: Number of hub probesets and hub probesets not annotated x_at depending on the condition of the number of matching transcripts Transcripts matched – more than: hubs non-x_at hubs 10 1548 630 15 1020 289 20 762 144 25 613 96 30 531 73 35 450 56 40 404 47 45 378 42 50 334 29 this way. Similar results were also seen with MAS5 and first set. In this way, the case shown in Figure 1b was sim- GCRMA processed data (not shown). Importantly, this ulated – i.e., a probeset hybridizing to two different tran- effect is not confined to probesets in which 11/11 probes scripts (one with all the probes matching, the other with a match. Figure 4 shows the distribution of Pearson correla- varying number of matches). The second group of probesets was produced by randomly selecting up to 500 tion for probesets in which only a subset of probes are probesets. Variance filtering was performed to ensure that involved in MT. It can be seen that even a single matching at least one of the transcripts had an expression profile probe can result in increased correlation. This is surprising that varied. Since Pearson correlation is not dependent on given that oligo array data processing methods such as the mean intensity of the signals, but rather the similarity MAS5 and RMA are designed to be robust against outliers – a single probe behaving differently from its peers may not be expected to have a large influence on the data. This is investigated in more detail below. Simulation data ● ● Intensity ● ● ● Figure 1b shows a situation where the expression level of ● ● a probeset might be expected to be driven by two different ● transcripts. Since there is no independent estimate availa- ble for the expression levels of the individual transcripts ●● ● ● ● involved in TPT motifs, simulation experiments were per- ● ● ● formed to mimic the effect by artificially spiking raw expression data. ●● Figure 5 shows the results of one such simulation, ● ● designed to consider the effects of the presence of an addi- ● ●● ● ● tional transcript in equal abundance to the intended tar- ●● ● ● ●● ●● ●●● ●●●● ●●●● ●● ●●● ●●● ●● ●● ●● get. It can be seen that as the number of spiked probes increases, the signal becomes more pronounced. As previ- −1.0 −0.5 0.0 0.5 1.0 ously observed with real data, a single matching probe can correlation value have a significant influence on the computed expression level. Even when the expression level is relatively high the Effect of MT in real data on Figure 4 correlation between probesets signal from only 2 probes can be sufficient to lead to Effect of MT in real data on correlation between apparent differential expression. Even so, the largest fold probesets. Distribution of Pearson correlation for MT-asso- changes are generally restricted to the lower intensity ciated probesets. Curves correspond to the number of inter- probesets, indicating that both MAS5 and RMA do a good acting probes in the PTP motif: orange – 1 probe, magenta – job at reducing the effects of outliers. up to 3 probes, blue – up to 7 probes, green – all MT probeset pairs. The peak at the correlation close to 1 is due Correlation to hub probesets that generally have high intensity and match In a second simulation experiment, spiking was achieved to many transcripts with single probes. by adding the signals from a second set of probes to the Page 6 of 14 (page number not for citation purposes) freq 0.00 0.05 0.10 0.15 BMC Bioinformatics 2006, 7:276 http://www.biomedcentral.com/1471-2105/7/276 ● ● ● ● 8 ● 10 5 ● 6 7 7 ● ● 4 10 ● 7 ● 10 10 9 6 8 9 8 ● 9 9 9 9 9 4 10 10 9 10 8 10 4 ● 9 88 68 6 ● 10 10 10 9 77 10 9 10 6710 79 7 7 7 6 ● 10 9 4 10 8 9 ● 9 9 910 10 9 10 ● 9 10 10 7 10 8 9 108 9 2 ●● ● ● 10 10 6 9 10 9 4 4 ● ● 7 7 10 6 5 10 6 10 6 10 6 7 9 97 7 77 6 9 8 710 7 4 1 ● 7 9 4 8 99 99 8 57 ● ● 7 9 8 7 5 10 6 6 8 10 10 7 ● 7 8 910 6 9 6 910 8 45 5 8 ● 710 98 664 47 4 5 10 9 6 8 75 9 10 4 2 ● 7 39 8 106 59 6 5 7 7 8910 6 6 4 7 9 8 5 6 5 9 8 7 1 9 10 10 9 7 1010 10 79 6 5 8 6 4 4 1 ● 8 97 9 9 8 9 3 3 2 3 2 7 678 3 2 9 8 8 5 4 66 7 5 4 3 5 797 57 10 2 33 6 4 5 4 6 10 35 1 6 5 5 ● ● 10 8 610 7 4 9 5 3 45 7 7 8 6 86 4 46 5 5 6 6 3 ● 6 6 6 45 8 3 1 ● 6 4 7 4 131 1 ● 6 34 26 21 5 4 1 ● 6 4 8 6 4 1 2 2 1 10 62 5 7 3 3 2 2 ● 10 5 5 6 1 ● 6 78 43 3 6 63 5 5 1 ● 6 4 5 2 1 4 4 3 4 1 1 2 1 5 1 54 4 6 44 3 3 2 4 5 32 ● 4 1 1 5 2 4 2 1 3 ● 3 1 2 3 41 1 1 1 1 3 2 ● 1 43 4 2 1 1 2 1 ● ● 4 2 22 11 3 3 4 1 1 1 2 3 3 54 2 2 3 1 1 2 321 ● 1 24 ● 2 1 1 ● 2 1 2 2 ● 3 2 1 ● 124 ● ● ● ● 32 3 1 3 1 222 4 4 ● 1 12 2 ● 3 2 1 32 3 ● 3 3 ● 1 22 1 3 2 ● ● ● 3 ● 21 ● ● ● ● ● ● ●● ● ● 3 ●● ●● ●● ●●● ●●●● ● ● ●●●● 4 6 8 10 12 −1.0 −0.5 0.0 0.5 1.0 before correlation value Simu Figure 5 lation experiment – fold change Variance fi Figure 6 ltering of spikes and target probesets Simulation experiment – fold change. Change in meas- Variance filtering of spikes and target probesets. The ured signal intensity following spiking to simulate the pres- distribution of correlation for data generated as in Figure 5, ence of an additional hybridizing transcript in equal but grouped according to variance. Green – high variance abundance to the intended target. Numbers denote the probeset plus high variance spiking, blue – high variance quantity of probes modified in a probeset. Axes are log . Even probeset plus low variance spiking, magenta – low variance a single spiked probe can result in significant change of the probeset plus low variance spiking, cyan – low variance intensity. Fold changes can be seen even for high intensity probeset plus high variance spiking. Red – correlation before target probesets. spiking. The effects of multiple targeting on correlation is most pronounced when the intended target is of low vari- ance, however even in the case of high variance targets, cor- relation is likely to be influenced. of their shapes, filtering was performed on variance, not intensity. Pearson correlation, r was calculated between sion level are expected to be generally small. This is con- each member of the first list and its corresponding partner firmed in both the real and simulation datasets. in the second. Before spiking, the two sets should be uncorrelated; spiking is expected to increase correlation. However, even when overall changes in intensity are min- As in the real data, the signal from the spiked probes con- imal, increase in Pearson correlation can still be high. This tributes significantly to the correlation, even when only a is because Pearson correlation is driven by similarity in small number of probes are involved. It can be seen from profile shape, not intensity; small amounts of stray signal Figure 6 that even when high variance probesets are the can lead to large increases in r, even if the overall mean recipients of additional spiked signals changes in r are between probesets are very different. Since Pearson corre- possible. Thus, the effects are not restricted to probesets lation centers each variable about its mean, and scales it with low signal (see also Additional files 5, 6, 7, 8). by its standard deviation, correlation is entirely depend- ent on the relative shape and variance of the two signals, Intensity vs. correlation not their overall intensity. When two signals, a and b are Both real and artificial datasets demonstrate that MT can compared to their sum, s, the signal that is most correlated have a significant effect on correlation, even when only a with s depends not on their relative sizes, but on their rel- small proportion of probesets are involved in the interac- ative variance. This is counter-intuitive but important to tion. Algorithms such as RMA and MAS5 successfully recognize when considering the effects of interacting sig- employ robust averaging techniques (such as median pol- nals on correlation (see Additional file 9). ishing or a Tukey's biweight) to reduce the effect of out- liers. Thus, when only a small number of probes in a This effect can be demonstrated by varying the amount of probeset are involved in MT, changes in measured expres- contribution made by the spiking probes (f f -see Meth- Page 7 of 14 (page number not for citation purposes) after 468 10 12 freq 0.00 0.05 0.10 0.15 BMC Bioinformatics 2006, 7:276 http://www.biomedcentral.com/1471-2105/7/276 ● ● ● ●● ● ● ● ● ● ● ● ● ●● ●● ●● ● ● ●●●●● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ●●● ●●● ● ● ● ● ● ● ● ● ●●● ● ● ● ●●●●● ● ●●●●●● ● ● ●●●●●● ● ● ● ●● ●●● ●●●● ● ● ● ● ● ●● ●●● ●●● ● ● ● ● ●●● ●●● ● ● ●●●● ●●● ● ● ● ●● ● ●●● ●●●●●●●●● ● ● ● ●●● ●●●●●● ●●● ● ● ● ●●●●● ●● ●● ● ●●● ● ●● ● ● ●●●● ●●● ● ● ● ●●● ●●● ●●● ● ● ●● ● ● ●●● ●●●●●●●●●● ● ● ●● ●●●● ●●●●●●●● ● ● ●●● ● ● ● ● ●●● ●● ●●●●●●●●●●●●● ● ● ●● ●●● ●● ●● ●●● ●● ● ● ● ● ●●● ●●●●●● ●●●● ● ● ● ●●●●●●●●●● ●●●●●● ●● ●● ●●●●●● ●● ●●● ●● ● ● ●●●●●●●●● ● ● ●● ● ●●● ●●●● ● ●● ● ●●●● ●●●●● ●●●●●●● ● ●● ● ●●●●●●●●●●● ● ●● ● ● ●●●● ●●●●●● ● ● ●●● ●●●●●●●● ●●●●● ● ● ● ●●●●●●●●●●●● ●●●● ● ● ●●●●● ●●●●●●●●● ● ●●●●●●●●●●●●●●●●●●●● ● ●●●●●● ●●● ●● ●●● ●● ●●●●● ● ● ●●●●●● ●●● ● ●● ●●● ●●●● ● ● ●● ● ● ● ●●● ●● ● ●● 456789 −1.0 −0.5 0.0 0.5 1.0 before value ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ●● ● ● ●●● ●●●● ● ● ● ● ● ● ● ● ● ● ● ●● ●●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ●● ● ● ● ● ● ● ●●● ●●● ●● ● ●●● ● ●● ● ● ●● ● ● ●●● ● ●●●● ● ● ●● ● ● ● ●●● ● ●● ● ●●● ●● ●● ● ● ● ● ● ● ● ● ● ● ●●●● ● ●● ● ● ●● ●●●● ● ●● ● ● ● ●● ● ● ● ● ● ● ●●●● ● ● ● ● ● ● ●●● ● ● ● ●● ● ● ● ●●●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●●●● ●●● ● ●●● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ●●●●● ●● ● ●●●●●● ● ● ●●● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ●●●● ● ● ● ● ● ●● ●● ●● ●●● ● ●● ●● ● ●● ● ● ● ● ● ● ●● ● ● ●● ● ● ●●●● ●● ● ●●● ● ●● ● ● ●●● ● ● ● ●● ●●● ● ●●● ●● ●●● ● ● ●●● ●● ● ●● ●● ●●●● ●●● ● ● ●● ● ● ●● ● ● ●●●● ● ●● ●● ● ● ● ●●●●● ● ●● ● ●●● ●● ●●● ●● ● ● ● ● ●● ●●● ●●●● ● ● ●● ● ●●●● ●● ●● ● ● ● ● ● ● ● ●●● ● ●●● ● ●● ●●● ● ● ● ● ● ● ● ●● ● ●● ● ● ●●● ●●● ●● 456789 −1.0 −0.5 0.0 0.5 1.0 before value ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ●● ● ● ● ● ●●● ● ●● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ●●● ●● ● ●● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●●●● ●● ● ● ● ●●● ● ● ●●●● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ● ●●●●●●● ●● ● ● ●● ● ●●● ● ● ●●●●●●● ●●●● ● ●● ●● ● ● ●● ●● ●●●● ●●●● ●●● ● ●●●● ●●●●● ● ● ●● ● ●● ● ●● ● ● ● ● ●●●●● ● ●●●● ●● ● ● ●● ● ● ●● ● ● ●● ● ● ● ● ● ●● ●●● ● ●● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ●● ● ●●● ●● ● ● ●●●●● ● ●●● ● ●● ●● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●●●● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ● ●● ●● ● ● ● ●● ●● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●●● ● ● ●● ● ●●● ●● ●● ● ●● ● ● ● ● ● ●● ● ● ● ●● ● ● ●● ●● ● ● ● ● ●● ● ● ● ●●●●●● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ●●●●● ●● ● ● ●● ●●● ●●●●●● ● ● ● ● ●● ● ● ●● ● ● ● ● ●●●●● ● ● ●●● ●●● ●● 468 10 −1.0 −0.5 0.0 0.5 1.0 before value Infl Figure 7 uence of the level of spiking on RMA expression values and distributions of correlation Influence of the level of spiking on RMA expression values and distributions of correlation. 1st row: f f = 0.05, 2nd row: f f = 0.2, 3rd row: f f = 1. 1st column scatterplot of the signal after RMA versus the signal before spiking, 2nd column dis- tortion of the correlation distribution – changes after spiking. 500 randomly selected targets and spikes. Even small amount of stray signal may significantly influence correlation. For f f = 0.2, the fold change is not so much affected, but the effect on cor- relation distortion is almost as large as for f f = 1. ods) to the resulting value. Figure 7 shows that even when Together Figures 6 and 7 show that increases in correla- only 5% of the spike signal was present, the influence on tion are not simply confined to those cases in which a Pearson correlation can still be large, even though the large and varying signal is being added to a low-variance resultant fold change is generally small. probesets. Page 8 of 14 (page number not for citation purposes) after after after 468 10 468 468 freq freq freq 0.00 0.06 0.12 0.00 0.06 0.12 0.00 0.06 0.12 BMC Bioinformatics 2006, 7:276 http://www.biomedcentral.com/1471-2105/7/276 In situations where probesets are already strongly corre- make a distinction between probable real (i.e. biological) lated, the addition of extra signal due to cross hybridiza- and probably artefactual (i.e. MT driven) clusters. tion with another transcript might be expected to reduce correlation. Spiking experiments found this to be the case Discussion (data not shown). Interestingly, however, even though It is clear that multiple targeting is an important artefact there are occasions where r is reduced by multiple target- within microarray data: nearly half of all probesets on the ing, the general tendency is towards significantly HG_U133A array are associated with MT. When real increased correlation (as shown in Figure 3; similar figure expression data are considered, it can be seen that these for simulation experiments – see Additional file 10). probesets are significantly more correlated than would be expected by chance. These results are also supported by False positive rates will also be raised because otherwise simulation experiments, using datasets derived from real absent probesets with signals resulting only from back- experimental data, that allow MT to be considered in a ground levels of non-specific hybridization can experi- more controlled framework. MT can lead to increased cor- ence additional, structured, signal due to exact matches to relation between associated probesets, even when only a transcripts other than the intended target. small proportion of their probes are involved. Although expression summary algorithms are successful at reducing Functional homogeneity and spurious correlations in the effects of outlier probes, the do not remove them com- families of probesets pletely, and small amounts of stray signal can still have a Analysis of MT-families shows that out of the 3,859 significant influence on correlation. The reason for this shown in Figure 2, 395 contained probesets annotated apparent paradox is the scale-invariance of Pearson corre- (using the BioConductor annaffy package [3]) to 2 or lation; absolute signal is not important. What is impor- more UniGene clusters. When gene symbols are consid- tant are the variance and (effectively) the relative ered, annotation becomes even more ambiguous: 429 similarity in shape of the expression profiles. For this rea- families contained transcripts annotated to different son, particular care must be taken when analysing expres- genes. Thus, even though the majority of families are sion data using correlation-based approaches. The homogenous with respect to UniGene and gene symbols situation is also further complicated by the fact that MT – some 10–15% (depending on size of the family and occurs at a probe level – adding additional signal to indi- source of annotation) may be annotated to different vidual probes within a probeset – but correlation is calcu- genes. This translates to about 1000 probesets. lated after normalization and expression summarization using an algorithm such as RMA or MAS5. This additional As we have shown using both real and artificial data, MT complexity makes it difficult to reliably predict what will leads to increased correlation. A consequence of this is happen when signals are combined. However, empirical that probesets associated by MT should be drawn closer to data (Figure 6) show that influence on correlation is one another in dendrograms, such as those used to cluster dependent on the relative variance of the two probesets probesets for visualisation using heatmaps. For example, being combined. As expected, high variance spiked probes the heatmap in Figure 8 was created using three groups of generally have more of an effect than low variance spikes, probesets. The first (annotated to the genes RPS29,HFL- but interestingly, adding low variance spikes to low vari- B5,EIF4A1,RPL36A and RPL18), was identified in [23] as ance data (the magenta line in Figure 6) has more of an discriminating between standard-risk and high-risk TEL- effect than adding high variance spikes to low variance AML1 cytogenetic abnormalities. Non of these probesets data (the cyan line). This is likely to be a consequence of are associated by MT and can thus be considered to form the expression summarization and normalization that is a "well behaving" biological family. The second set (anno- imposed on the data. tated to TUBB6, TUBB2 and TUBB3) constitute another biological family, but they are also associated by MT to One consequence of MT is that because it serves to add each other, as well as to other tubulin genes. This family structure to otherwise random probesets with no genuine thus represents a mixed biological – MT family. The third signal, it can lead to the detection of false positives unless group of probesets are associated with RPS10, but also to the presence of cross-matching probesets is known. Anal- numerous pseudogene transcripts. This group represents ysis of the intensity distributions for MT and non-MT an "MT family" where the relationship is expected to be probesets shows a considerable degree of overlap (see artefactual. These three sets of probesets were added to a Additional file 11). This means that MT probesets cannot further set of randomly selected probesets, to act as "back- be removed simply by filtering on intensity. In fact, ground", and then clustered. The MT-family, the tubulins because MT generally increases signal strength, such filter- and the biological family are found as separate clusters ing might actually serve to enrich for MT probes. (the MT-family with even closer links than others), dem- onstrating that the hierarchical clustering is unable to Page 9 of 14 (page number not for citation purposes) BMC Bioinformatics 2006, 7:276 http://www.biomedcentral.com/1471-2105/7/276 IBSP : 201411_s_at RALA : 214435_x_at C8orf70 : 214961_at MT>: RPS10 : 211542_x_at MT>: RPS10 : 200817_x_at MT>: FLJ20294 : 200095_x_at MT>: LOC388885 : 216505_x_at BIO>>> :RPL18 : 200022_at BIO>>> :RPL36A : 201406_at BIO>>> :RPS29 : 201094_at BIO>>> :hfl−B5 : 202231_at BCL9 : 212625_at TUB>>>>>: TUBB2 : 213726_x_a TUB>>>>>: TUBB2 : 208977_x_a TUB>>>>>: TUBB3 : 213476_x_a BIO>>> :EIF4A1 : 201530_x_at MT>: RPS10 : 214001_x_at RALA : 207370_at TUB>>>>>: TUBB2 : 209372_x_a QPCTL : 220657_at STX10 : 220438_at KLHL11 : 205308_at KIAA0774 : 217030_at TUB>>>>>: TUBB6 : 209191_at PLEKHB2 : 204129_at −1 fold change from mean 1 −0.5 0.5 H Figure 8 eatmap example Heatmap example. Heatmap and hierarchical clustering of 3 families of probesets (MT-driven, tubulin and a functional one), plus randomly chosen non-MT probesets. The clustering does not make any distinction between functional and MT families – is grouping them together in very similar way. MT is ultimately a sequence-based event; it occurs when probesets (and, via annotation, genes) with correlated two sequences show 100% identity across the 25bp tar- expression profiles, and to use these relationships to infer geted by a probe. At the level of a probeset, this is most functional similarities. Since sequence similarity is itself likely to occur when transcripts show a high degree of often the basis by which common function is inferred sequence similarity. The relationships is troublesome, [24], sequence similarity combined with MT has the because a major use of expression data is to identify potential to become a self-fulfilling prophecy. Page 10 of 14 (page number not for citation purposes) HR25.CEL IR16.CEL IR17.CEL SR06.CEL SR08.CEL IR21.CEL SR03.CEL SR07.CEL HR22.CEL HR26.CEL SR02.CEL SR04.CEL SR11.CEL HR28.CEL HR30.CEL HR31.CEL IR15.CEL IR19.CEL IR14.CEL IR20.CEL SR05.CEL IR18.CEL SR09.CEL HR29.CEL HR23.CEL HR24.CEL HR27.CEL SR12.CEL SR01.CEL SR10.CEL SR13.CEL BMC Bioinformatics 2006, 7:276 http://www.biomedcentral.com/1471-2105/7/276 A search of the database found that about 5% of family Conclusion members contained probesets annotated to different Cross hybridization between probesets is a significant genes. Thus, the chances of finding a spurious functional effect that has real consequences for the interpretation of relationship due to MT between a pair of randomly microarray data. It may cause a variety of problems during selected genes is small. However, this is optimistic, analysis including false positives and negatives, and gen- because microarray analysis generally involves filtering to erally increased correlation between multiply-targeted produce a set of significant probesets (either by magni- probesets. Although the results presented here are for tude of change, or by statistical confidence). The result of Affymetrix arrays, it is reasonable to expect similar effects such filtering is to enrich the final 'hit list' not only for real to occur with other expression-based technologies. The biological effects, but also for anything else that is consist- use of short oligos and strict hybridization conditions ent, including biochemical or sequence based artefacts makes it possible to perform the in silico searches required such as MT. This is illustrated by the heatmap in the Figure to identify MT within Affymetrix data. However, CH is not 8; MT families fall into separate clusters against a back- exclusive to any one platform, and similar behaviour is ground of randomly selected probesets. likely to be seen elsewhere. Expression summary algo- rithms must correct not only for variation across arrays, One possible solution to MT is to redefine probesets so but also for variation between individual probes within a that probes targeting the same transcript are placed into probeset. This is generally performed using some kind of larger probesets representing the entire sequence, as pro- robust averaging procedure, but even small amounts of posed by [10]. This is the approach taken also by [13], but stray signal can lead to high correlations between authors conclude that "under many circumstances it is not probesets. Although algorithms such as RMA and MAS5 possible to generate transcript-specific probe sets for genes do a very good job of significantly reducing the influence with multiple transcripts based on probes available on the of outlier probes, they do not always remove it completely current generation of GeneChips". Thus they may be used – and this is manifested by significantly increased correla- to make distinction at the level of genes, but not at the tion between probesets, even when only a small subset of level of transcripts or splice variants – MT with all its con- probes are involved. sequences, still exists. There is a compromise to be made between generalisation and maintaining the ability to Many of the issues described above can be avoided with resolve subtle differences between transcripts and, for more detailed annotation. Often the terms 'gene', 'tran- example, splice variants. script' and 'probeset' are used interchangeably. This is dangerous, because the relationship is not one-one-one, The issue becomes more significant with the new genera- and the existence of MT networks can lead to apparent tion of microarrays such as the Affymetrix exon array [25] biological relationships that are, in fact, artefactual. that deliberately use multiple probesets to distinguish Expression data that is presented simply as a gene list is between individual transcripts from within a set of splice difficult to interpret properly, since the complexities of the variants expressed by a particular gene. The result is a interaction networks implicit within the data are lost. The many-many relationship between gene, transcript and community should ensure that the actual probeset IDs are probeset. always available alongside gene names or transcript acces- sions. This allows the graph structures associated with Annotation schemes that attempt to compress these gene-transcript-probeset mappings to be explored where many-many relationships into a one-to-one mapping lose necessary and used to fully interpret the complexities of the complexities inherent in the system. Grouping gene expression data. together a probeset that targets more than one transcript with probesets that target one or more individual tran- Methods scripts, results in MT occurring between the new probeset Graph rendering and all the other transcripts it shares probes with. From MT networks and interaction graphs were produced by one perspective, many these issues are simply down to extracting data from ADAPT and redirecting the output for annotation. The apparent aretefacts in the data only exist visualization to LGL [22], and our own visualization soft- because the probeset annotations do not accurately reflect ware. As global layouts of graphs such as LGL are static, the transcripts they bind to. thus not interactive, because the number of vertices is too big for efficient real-time rendering, an applet was devel- With all solutions, including those that attempt to solve oped for fast and flexible analyses of individual families. the problem by aggregating probes into larger probesets, These small, local graphs within the applet were realized annotation is crucial, since inaccuracies will arise unless with the JUNG API [26]. all the many-many relationships that occur within the data are represented explicitly. Page 11 of 14 (page number not for citation purposes) BMC Bioinformatics 2006, 7:276 http://www.biomedcentral.com/1471-2105/7/276 Databases, experimental data sources and data processing lated data were batch normalized using RMA and com- ADAPT [4] is a database of mappings between Affymetrix pared to the original un-spiked data (again batch probesets, transcripts and genes. It is populated by search- normalized using RMA, separately from the first set). In all ing all probe sequences for exact matches to transcript simulation experiments spiking for the selected probesets data taken from RefSeq (Release 11 at the time of writing) was carried out across the entire set of arrays. [27] and Ensembl (V30 at the time of writing) [28]. For RefSeq, both "known" and "model" sequences are used; In the second experiment, a set of 500 probesets was for Ensembl, ADAPT uses those assigned "known", selected, as before. A second set with the same number of "novel" or "pseudo" status. Both databases are used probesets was then chosen at random. These probesets because they employ different methods to predict tran- were selected from a subset of the probesets available, script/gene sequences. generated by filtering the expression data on variance. In this way, both sets could be sampled from probesets with The ADAPT database was queried (using SQL and Rdbi- specifically high, average or low variance of expression. PqSQL database link to R) to extract a set of tables describ- High and low variance are defined as the top or bottom ing all possible MT links between probesets and 2000 probesets, sorted by of variance, excluding the 100 transcripts, excluding anti-sense strand matches. The most extreme ones. Each probeset in the second list was probesets may match transcripts with anything from 1 to used to supply data for the probeset in the first list; 16 matching probes. The tables implicitly define an between 1 and 11 probes were chosen and the probe unconnected graph (see Figure 2 and supplemental file 1), intensities from the second list added to the correspond- and form the basis for all subsequent explorations of MT. ing probes in the first list. Various levels of influence were In order to consider the strength of the MT effect and its applied, adding a specific proportion of one probeset sig- consequences on expression studies, data from ADAPT nal to another: PM1 = PM + f f * PM , where for f 1after 1before 2 were combined with expression data from experiments f ranging from 0.05 to 1. In this way, cross-hybridization generated using the HG-U133A array. Expression levels between probesets in the first list, and the transcripts rep- were produced using MAS5 and RMA, as implemented in resented by the probesets in the second list, was simu- Bioconductor (packages affy and simpleaffy [3,29,30]). lated. Results were analogous when experiments were repeated Abbreviations with MAS5. All plots presented were generated using the ADAPT – "A Database of Affymetrix Probesets and Tran- Novartis Gene Atlas [31] dataset. Similar results were seen scripts" with both leukaemia [23] and sarcoma [32] datasets – publicly available from ArrayExpress. CH – cross-hybridization Pearson correlation was calculated for all the pairs of LGL – Large Graph Layout probesets found to be targeting the same transcript. The distribution of correlation coefficients was calculated for MAS5 – MicroArray Suite – Affymetrix algorithm (MAS all probeset pairs and for all pairs where one of the 5.0) probesets matches to a transcript with less than a specified number of probes. MM – mismatch probe Simulation data MT – multiple targeting A subset of 50 HG_U133A arrays from Gene Atlas V2 was used as the basis for simulation experiments designed to PM – perfect match probe explore how the number of MT probes influences expres- sion measurements from RMA processed data. PTP – probeset-transcript-probeset network motif Spiking was conducted as follows: prior to expression RMA – Robust Multichip Average algorithm summary generation using RMA, 500 probesets were selected at random to be spiked and 500 (at random) to TPT – transcript-probeset-transcript network motif act as a source of spiking data. No filtering was applied to these probesets. Probesets were randomly paired, and Authors' contributions between 1 and 10 probe-pairs selected for each probeset MO developed the concept of interactions in families of (again at random). The signals from the spike-sources probesets, carried out database and statistical analyses and were added to the original signals for the spike-targets. In drafted the manuscript, CM conceived the study on this way TPT motifs were simulated. The resulting simu- probes alignments to transcript and its implications, Page 12 of 14 (page number not for citation purposes) BMC Bioinformatics 2006, 7:276 http://www.biomedcentral.com/1471-2105/7/276 [http://www.biomedcentral.com/content/supplementary/1471- supervised and participated in its design and helped to 2105-7-276-S6.pdf] draft the manuscript. Additional File 7 Additional material Spiking experiment, signal filtering 3. As Additional file 5, but high inten- sity spikes added to low targets. Click here for file Additional File 1 [http://www.biomedcentral.com/content/supplementary/1471- Graph of MT families for HGU133A array. Animated GIF, 3D visualiza- 2105-7-276-S7.pdf] tion of MT-families in the array. Click here for file Additional File 8 [http://www.biomedcentral.com/content/supplementary/1471- Spiking experiment, signal filtering 4. As Additional file 5, but low inten- 2105-7-276-S1.gif] sity spikes added to low targets. Click here for file Additional File 2 [http://www.biomedcentral.com/content/supplementary/1471- Examples of 3 families of probesets and transcripts. Screenshots from the 2105-7-276-S8.pdf] applet. Big nodes signify probesets (green – positive detection call), small magenta ones – transcripts. The width of edges is proportional to the quan- Additional File 9 tity of MT probes. Probes are marked with a name, annotation in Affyme- R code for correlation experiment. A simple experiment to consider the trix or BioConductor and expression values. Presented are families relationship between correlation cooefficient and variance using 10,000 associated mainly with PAX8, RUNX1/RPL22 and tubulins. randomly generated cases. Click here for file Click here for file [http://www.biomedcentral.com/content/supplementary/1471- [http://www.biomedcentral.com/content/supplementary/1471- 2105-7-276-S2.tiff] 2105-7-276-S9.R] Additional File 3 Additional File 10 List of MT families for HGU133A array. The CSV file lists all the discov- Influence of specific number of spiked probes on correlation. Changes in ered MT probeset families along with their gene-level annotations accord- Pearson correlation following spiking to simulate MT between probesets. ing to Affymetrix and BioConductor. Red plot – correlation before spiking, orange – 1 spiked probe per probeset, Click here for file magenta – up to 3 probes, blue – up to 7 probes, green – all probes spiked. [http://www.biomedcentral.com/content/supplementary/1471- As in the case of real data – even a single probe may influence the distri- 2105-7-276-S3.csv] bution of correlation, however in that case there are no effects of biological similarity – that's why the effect exists, but is smallest for single probes. Additional File 4 Click here for file Applet for families exploration. http://bioinformatics.picr.man.ac.uk/ [http://www.biomedcentral.com/content/supplementary/1471- adaptnet. An applet for browsing graphs of MT-families in the HGU133A 2105-7-276-S10.pdf] array. Big nodes represent HGU133A probesets: green ones have "Present" detection call, pink ones "Absent". They are labelled with Additional File 11 Affymetfix and BioConductor annotations, detection call and expression Distribution of the expression signal for MT and non-MT probesets after value in the experiment. Small magenta nodes represent transcripts. There processing with RMA and MAS5. The plots (normalized distributions of is a possibility to add Exon 1.0ST probesets (blue) to the graph. The width summarized expression values) indicate a slight increase in the high signal of edges is proportional to the number of matching probes. The applet is values for MT probesets (blue) against non-MT probesets (green). intended for online use – it is connected to an application server and Click here for file ADAPT database. [http://www.biomedcentral.com/content/supplementary/1471- Click here for file 2105-7-276-S11.pdf] [http://www.biomedcentral.com/content/supplementary/1471- 2105-7-276-S4.txt] Additional File 5 Spiking experiment, signal filtering 1. Scatter plot and correlation distri- Acknowledgements bution, generated as in Figure 7, but filtered by average signal intensity. This work was funded by Cancer Research UK. Low intensity: the 10% probesets with lowest mean signal. High intensity: the 10% probesets with highest mean signal. Low intensity spikes added Zhi Cheng Wang and Tim Yates maintain and manage the ADAPT database. to high targets. The plots in Additional files 5, 6, 7, 8 prove that with any We thank Stuart Pepper, Francesca Buffa and Claire Wilson for useful dis- sort of signal intensity filtering, the shift in correlation coefficient occurs. cussions. Click here for file [http://www.biomedcentral.com/content/supplementary/1471- References 2105-7-276-S5.pdf] 1. Zakharkin S, Kim K, Mehta T, Chen L, Barnes S, Scheirer K, Parrish R, Allison D, Page G: Sources of variation in Affymetrix micro- Additional File 6 array experiments. BMC Bioinformatics 2005, 6:214. 2. Nimgaonkar A, Sanoudou D, Butte A, Haslett J, Kunkel L, Beggs A, Spiking experiment, signal filtering 2. As Additional file 5, but high inten- Kohane I: Reproducibility of gene expression across genera- sity spikes added to high targets. tions of Affymetrix microarrays. BMC Bioinformatics 2003, 4:27. Click here for file Page 13 of 14 (page number not for citation purposes) BMC Bioinformatics 2006, 7:276 http://www.biomedcentral.com/1471-2105/7/276 3. Wilson CL, Miller CJ: Simpleaffy: a BioConductor package for genomes, transcripts and proteins. Nucleic Acids Res 2005, Affymetrix Quality Control and data analysis. Bioinformatics 33(Database issue):D501-504. 2005, 21(18):3683-3685. 28. Birney E, Andrews T, Bevan P, Caccamo M, Chen Y, Clarke L, Coates 4. Leong HS, Yates T, Wilson C, Miller CJ: ADAPT: a database of G, Cuff J, Curwen V, Cutts T, Down T, Eyras E, Fernandez-Suarez X, affymetrix probesets and transcripts. Bioinformatics 2005, Gane P, Gibbins B, Gilbert J, Hammond M, Hotz H, Iyer V, Jekosch K, 21(10):2552-2553. Kahari A, Kasprzyk A, Keefe D, Keenan S, Lehvaslaiho H, McVicker 5. Wu C, Carta R, Zhang L: Sequence dependence of cross-hybrid- G, Melsopp C, Meidl P, Mongin E, Pettett R, Potter S, Proctor G, Rae ization on short oligo microarrays. Nucl Acids Res 2005, M, Searle S, Slater G, Smedley D, Smith J, Spooner W, Stabenau A, 33(9):e84. Stalker J, Storey R, Ureta-Vidal A, Woodwark K, Cameron G, Durbin 6. Liu G, Loraine AE, Shigeta R, Cline M, Cheng J, Valmeekam V, Sun S, R, Cox A, Hubbard T, M C: An overview of Ensembl. Genome Kulp D, Siani-Rose MA: NetAffx: Affymetrix probesets and Research 2004, 14:925-8. annotations. Nucleic Acids Research 2003, 31:82-86. 29. Gentleman R, Carey V, Bates D, Bolstad B, Dettling M, Dudoit S, Ellis 7. Mecham BH, Klus GT, Strovel J, Augustus M, Byrne D, Bozso P, Wet- B, Gautier L, Ge Y, Gentry J, Hornik K, Hothorn T, Huber W, Iacus more DZ, Mariani TJ, Kohane IS, Szallasi Z: Sequence-matched S, Irizarry R, Leisch F, Li C, Maechler M, Rossini A, Sawitzki G, Smith probes produce increased cross-platform consistency and C, Smyth G, Tierney L, Yang J, Zhang J: Bioconductor: open soft- more reproducible biological results in microarray-based ware development for computational biology and bioinfor- gene expression measurements. Nucl Acids Res 2004, 32(9):e74. matics. Genome Biology 2004, 5(10R80 [http://genomebiology.com/ 8. Mecham BH, Wetmore DZ, Szallasi Z, Sadovsky Y, Kohane I, Mariani 2004/5/10/R80]. TJ: Increased measurement accuracy for sequence-verified 30. Gautier L, Cope L, Bolstad BM, Irizarry RA: affy – analysis of microarray probes. Physiol Genomics 2004, 18(3):308-315. Affymetrix Gene Chip data at the probe level. Bioinformatics 9. Harbig J, Sprinkle R, Enkemann SA: A sequence-based identifica- 2004, 20(3):307-315. tion of the genes detected by probesets on the Affymetrix 31. Su AI, Wiltshire T, Batalov S, Lapp H, Ching KA, Block D, Zhang J, U133 plus 2.0 array. Nucl Acids Res 2005, 33(3):e31. Soden R, Hayakawa M, Kreiman G, Cooke MP, Walker JR, Hogenesch 10. Gautier L, Moller M, Friis-Hansen L, Knudsen S: Alternative map- JB: A gene atlas of the mouse and human protein-encoding ping of probes to genes for Affymetrix chips. BMC Bioinformat- transcriptomes. PNAS 2004, 101(166062-6067 [http:// ics 2004, 5:111. www.pnas.org/cgi/content/abstract/101/16/6062]. 11. Carter S, Eklund A, Mecham B, Kohane I, Szallasi Z: Redefinition of 32. Wang H, Trotter M, Lagos D, Bourboulia D, Henderson S, Makinen Affymetrix probe sets by sequence overlap with cDNA T, Elliman S, Flanagan A, Alitalo K, C B: Kaposi sarcoma herpesvi- microarray probes reduces cross-platform inconsistencies in rus-induced cellular reprogramming contributes to the lym- cancer-associated gene expression measurements. BMC Bio- phatic endothelial gene expression in Kaposi sarcoma. Nature informatics 2005, 6:107. Genetics 2004, 36(7):687-93. 12. Consortium GO: The Gene Ontology (GO) database and informatics resource. Nucleic Acids Research 2004:D258-D261. 13. Dai M, Wang P, Boyd AD, Kostov G, Athey B, Jones EG, Bunney WE, Myers RM, Speed TP, Akil H, Watson SJ, Meng F: Evolving gene/ transcript definitions significantly alter the interpretation of GeneChip data. Nucleic Acids Res 2005, 33(20):el75. 14. Shannon W, Culverhouse R, Duncan J: Analyzing microarray data using cluster analysis. Pharmacogenomics 2003, 4:41-52. 15. Sherlock G: Analysis of large-scale gene expression data. Brief- ings in Bioinformatics 2001, 2(4):350-362. 16. Butte AJ, Tamayo P, Slonim D, Golub TR, Kohane IS: Discovering functional relationships between RNA expression and chem- otherapeutic susceptibility using relevance networks. PNAS 2000, 97(22):12182-12186. 17. Butte AJ, Kohane I: Mutual Information Relevance Networks: Functional Genomic Clustering Using Pairwise Entropy Measurements. Pac Symp Biocomput 2000:418-429. 18. Stuart J, Segal E, Koller D, Kim S: A Gene-Coexpression Network for Global Discovery of Conserved Genetic Modules. Science 2003, 302:249-255. 19. Affymetrix: Statistical Algorithms Description Document. 20. Irizarry R, Bolstad B, Collin F, Cope L, Hobbs B, Speed T: Summa- ries of Affymetrix GeneChip probe level data. Nucleic Acids Res 2003, 31(4):el5. 21. Wu Z, Irizarry R, Gentleman R, Murillo F, Spencer F: A Model Based Background Adjustment for Oligonucleotide Expres- sion Arrays. Technical Report. John Hopkins University. Depart- ment of Biostatistics Working Papers 2003. 22. Adai AT, Date SV, Wieland S, Marcotte EM: LGL: Creating a Map Publish with Bio Med Central and every of Protein Function with an Algorithm for Visualizing Very scientist can read your work free of charge Large Biological Networks. Journal of Molecular Biology 2004, 340:179-190. "BioMed Central will be the most significant development for 23. Teuffel O, Dettling M, Cario G, Stanulla M, Schrappe M, Buehlmann P, disseminating the results of biomedical researc h in our lifetime." Niggli F, Schaefer B: Gene Expression Profiles and Risk Stratifi- Sir Paul Nurse, Cancer Research UK cation in Childhood Acute Leukemia. Haematologica 2004, 89:801-808. Your research papers will be: 24. Attwood T, Miller C: Progress in bioinformatics and the impor- available free of charge to the entire biomedical community tance of being earnest. Biotechnol Annu Rev 2002, 8:1-54. 25. Affymetrix: Exon Probeset Annotations and Transcript Clus- peer reviewed and published immediately upon acceptance ter Groupings. 2005. cited in PubMed and archived on PubMed Central 26. O'Madadhain J, Fisher D, Smyth P: Analysis and Visualization of Network Data using JUNG. Journal of Statistical Software in press. yours — you keep the copyright 27. Pruitt KD, Tatusova T, Maglott DR: NCBI Reference Sequence BioMedcentral (RefSeq): a curated non-redundant sequence database of Submit your manuscript here: http://www.biomedcentral.com/info/publishing_adv.asp Page 14 of 14 (page number not for citation purposes)

Journal

BMC BioinformaticsSpringer Journals

Published: Jun 2, 2006

There are no references for this article.