Open Advanced Search
Get 20M+ Full-Text Papers For Less Than $1.50/day.
Subscribe now for You or Your Team.
Learn More →
KEGG: Kyoto Encyclopedia of Genes and Genomes
KEGG: Kyoto Encyclopedia of Genes and Genomes
Kanehisa, Minoru;Goto, Susumu
© 2000 Oxford University Press Nucleic Acids Research, 2000, Vol. 28, No. 1 27–30 Minoru Kanehisa* and Susumu Goto Institute for Chemical Research, Kyoto University, Uji, Kyoto 611-0011, Japan Received September 29, 1999; Accepted October 4, 1999 ABSTRACT The KEGG project was initiated in May 1995 under the Human Genome Program of the Ministry of Education, Science, Sports and Culture in Japan (2). All the data in KEGG is a knowledge base for systematic analysis of gene and associated software tools are made available as part of the functions, linking genomic information with higher Japanese GenomeNet service (3). KEGG consists of three order functional information. The genomic information databases: PATHWAY for representation of higher order is stored in the GENES database, which is a collection functions in terms of the network of interacting molecules, of gene catalogs for all the completely sequenced GENES for the collection of gene catalogs for all the completely sequenced genomes and some partial genomes, and genomes and some partial genomes with up-to-date LIGAND (4) for the collection of chemical compounds in the annotation of gene functions. The higher order func- cell, enzyme molecules and enzymatic reactions. The overall tional information is stored in the PATHWAY database, architecture of the KEGG system is basically the same as which contains graphical representations of cellular previously reported (5). The user may enter the KEGG system processes, such as metabolism, membrane transport, top-down starting from the pathway (functional) information signal transduction and cell cycle. The PATHWAY or bottom-up starting from the genomic information at the database is supplemented by a set of ortholog group KEGG table of contents page (http://www.genome.ad.jp/kegg/ tables for the information about conserved subpath- kegg2.html ). ways (pathway motifs), which are often encoded by positionally coupled genes on the chromosome and GENOMIC INFORMATION which are especially useful in predicting gene functions. GENES database A third database in KEGG is LIGAND for the information about chemical compounds, enzyme molecules and The current status of the KEGG databases is summarized in enzymatic reactions. KEGG provides Java graphics Table 1. During the past year, we have made every effort to keep up with the data increase of complete genome sequences, and tools for browsing genome maps, comparing two also with the imminent data explosion of gene expression profiles. genome maps and manipulating expression maps, as The number of GENES entries for just 29 species—human, mouse, well as computational tools for sequence comparison, Drosophila, Arabidopsis, Schizosaccharomyce pombe,and graph comparison and path computation. The KEGG 24 completely sequenced genomes—totals ~110 000 entries, databases are daily updated and made freely available which is already larger than the number of entries in the well- (http://www.genome.ad.jp/kegg/ ). annotated SWISS-PROT database (6). The GENES database contains the bare minimum information for each gene as shown in Table 2, but it is intended to be a resource containing INTRODUCTION up-to-date, standardized descriptions of gene functions. While the genome sequencing projects rapidly determine gene GENES also serves as a gateway to a number of other catalogs for an increasing number of organisms, functional resources containing more detailed information. annotation of individual genes is still largely incomplete. We developed various computational tools for the maintenance KEGG (Kyoto Encyclopedia of Genes and Genomes) is an of the GENES database, especially for extraction of information effort to link genomic information with higher order functional from GenBank, which is not a trivial task, and for assisting information by computerizing current knowledge on cellular systematic annotation of gene functions. The overall flow of processes and by standardizing gene annotations. Generally both computerized and manual processes is illustrated in speaking, the biological function of the living cell is a result of Figure 1. A web-based annotation tool is used, together with many interacting molecules; it cannot be attributed to just a other computational tools, to assign EC numbers, to assign single gene or a single molecule (1). The functional assignment ortholog identifiers, to incorporate new experimental evidence in KEGG is a process of linking a set of genes in the genome from literature, and to annotate predictions based on pathway with a network of interacting molecules in the cell, such as a construction. As described below, the ortholog identifiers will pathway or a complex, representing a higher order biological be used as primary keys for automatic matching of genes in the function. genome and gene products in the pathway. *To whom correspondence should be addressed. Tel: +81 774 38 3270; Fax: +81 774 38 3269; Email: firstname.lastname@example.org 28 Nucleic Acids Research, 2000, Vol. 28, No. 1 Table 1. The summary of KEGG release 12.0 (October 1999) Database Content PATHWAY 2706 entries for pathway diagrams constructed from 143 manually drawn diagrams GENES 110 018 entries in 24 complete genomes and 12 partial genomes LIGAND 5645 entries in the COMPOUND section 3705 entries in the ENZYME section 5207 reactions in the REACTION section Auxiliary data Content Ortholog group table 61 tables Genome map 23 complete genomes and one partial genome Comparative genome map 23 23 complete genome comparisons Expression map Four sets of expression maps Gene catalog 53 catalogs Molecular catalog Eight catalogs Disease catalog Three catalogs Table 2. The data content of the GENES database entry Field Content Links Data source ENTRY Entry identifier (gene accession number) LinkDB database GenBank or original database NAME Gene names and alternative names GenBank or original database DEFINITION Annotation of gene function LIGAND/ENZYME database, SWISS-PROT GenBank, original database, database and PubMed database SWISS-PROT and KEGG CLASS Classification of genes according KEGG/PATHWAY database KEGG to the KEGG pathways POSITION Chromosomal position KEGG/GENOME map GenBank DBLINKS Outside links Original databases and NCBI Entrez database CODON_USAGE Codon usage Computed AASEQ Amino acid sequence see footnote GenBank or original database NTSEQ Nucleotide sequence GenBank or original database Computational links are available including sequence similarity searches (FASTA and BLAST), motif search (MOTIF), membrane protein predictions (SOSUI and TSEG), and cellular localization site prediction (PSORT). Gene expression profiles may examine if, for example, a group of co-regulated genes are also correlatedinthe pathwayorare encodedinaclusterof The backbone retrieval system for the GENES database is the genes on the chromosome. DBGET/LinkDB system (7), but there are additional ways of accessing this database. One is the Java-based genome map browser for graphical manipulation of gene positions on the PATHWAY INFORMATION chromosome. The other is what we call the hierarchical text PATHWAY database browser for handling functional hierarchy of gene catalogs. Here we report another Java graphics browser, the expression Currently the best organized part of the KEGG/PATHWAY map browser, for analysis of gene expression profiles obtained database is metabolism, which is represented by ~90 graphical by cDNA microarray or oligonucleotide array experiments. diagrams for the reference metabolic pathways. Each reference The vast amount of data generated by such functional genomics pathway can be viewed as a network of enzymes or a network experiments are likely to contain valuable information, which of EC numbers. Once enzyme genes are identified in the will supplement genomic sequence information toward under- genome based on sequence similarity and positional correlation of standing higher biological functions of the cell. A preliminary genes, and the EC numbers are properly assigned, organism- version of the expression map browser is linked to both the specific pathways can be constructed computationally by KEGG pathway data and the genome map data, so that the user correlating genes in the genome with gene products (enzymes) Nucleic Acids Research, 2000, Vol. 28, No. 1 29 Figure 2. The generalized protein–protein interaction includes an indirect protein–protein interaction by two successive enzymes, a direct protein–protein interaction, and another indirect protein–protein interaction by gene expression. The nodes of the generalized protein–protein interaction network are gene products, which can be directly correlated with genes in the genome. Figure 1. Procedures used to organize and annotate the GENES database. Ortholog group tables in the reference pathways according to the matching EC Orthologs are identified in KEGG not only by sequence similarity numbers. We are trying to extend this mechanism to include of individual genes but also by examining if all constituent members are found for a functional group, such as a conserved various regulatory pathways, such as signal transduction, cell cycle and apoptosis. There are, however, two major problems subpathway or a molecular complex. The KEGG ortholog group table is a representation of three features: whether an in automating the construction of regulatory pathways. organism contains a complete set of genes that constitutes a Because the metabolic pathway, especially for intermediary functional group, whether those genes are physically coupled metabolism, is well conserved among most organisms from on the chromosome, and what are orthologous genes among mammals to bacteria, it is possible to manually draw one reference different organisms. Currently there are 61 ortholog group pathway and then to computationally generate many organism- tables, which contain, for example, a gene cluster in the specific pathways. In contrast, the regulatory pathways are far genome coding for a functionally related enzyme cluster in the more divergent and are difficult to combine into common reference metabolic pathway. In KEGG such correlated clusters are first pathway diagrams. Thus, we basically draw a pathway detected by a heuristic graph comparison algorithm, and then diagram separately for each organism. At the same time, we manually edited and compiled into the ortholog group tables. are trying to identify groups of organisms that share common There are two types of graph comparisons that we use: pathways or assemblies and whose diagrams may be genome–pathway and genome–genome comparisons (1). An combined. Examples include one common apoptosis pathway ortholog group table is a composite of such pairwise comparisons, diagram for human and mouse, three ribosome assembly representing a conserved portion of the pathway, or what we diagrams separately for bacteria, archaea and eukaryotes. call a pathway motif. The other related problem is the absence of proper identifiers Generalized protein–protein interaction for functions in the regulatory pathways. The EC numbers in the metabolic pathways play roles as identifiers of the nodes The KEGG pathway representation focuses on the network of (enzymes) and also as keys for linking with the genomic gene products, mostly proteins but including functional RNAs. information. We are preparing for the introduction of the As illustrated in Figure 2, the metabolic pathway is a network ortholog identifiers to extend such capabilities of the EC of indirect protein–protein interactions, which is actually a numbers. The ortholog identifiers will be used to identify nodes network of enzyme–enzyme relations. In contrast, the regulatory (proteins) in the regulatory pathways and also to link with the pathway often consists of direct protein–protein interactions, genomic information. In addition, the ortholog identifiers will such as binding and phosphorylation, and another class of replace the EC numbers in the metabolic pathways in order to indirect protein–protein interactions, which are relations of distinguish multiple genes that match one EC number, for transcription factors and transcribed gene products via gene example, different subunits of an enzyme complex or different expressions. The generalized protein–protein interaction genes expressed under different conditions. network that includes these three types of interactions is an 30 Nucleic Acids Research, 2000, Vol. 28, No. 1 abstract network, but it is especially useful to link with ACKNOWLEDGEMENTS genomic information because the nodes (gene products) of this We thank the present and past KEGG project members for their network can be directly correlated with the nodes (genes) in the excellent work. We also thank Kotaro Shiraishi for developing the genome. With this concept of generalized protein–protein KEGG annotation tool and other useful programs. This work interaction network, we are expanding the collection of manually was supported by a Grant-in-Aid for Scientific Research on the drawn reference pathway diagrams. Priority Area ‘Genome Science’ from the Ministry of Education, Science, Sports and Culture of Japan. The computational resource was provided by the Supercomputer Laboratory, AVAILABILITY Institute for Chemical Research, Kyoto University. All the data in KEGG and associated analysis tools are provided as part of the Japanese GenomeNet service (3) at REFERENCES http://www.genome.ad.jp/ 1. Kanehisa,M. (1999) Post-Genome Informatics. Oxford University Press, The Internet version of KEGG in GenomeNet can be accessed Oxford, UK. at the following address: http://www.genome.ad.jp/kegg/ 2. Kanehisa,M. (1997) Trends Genet., 13, 375–376. For strictly academic research purposes at academic institutions 3. Kanehisa,M. (1997) Trends Biochem. Sci., 22, 442–444. 4. Goto,S., Nishioka,T. and Kanehisa,M. (2000) Nucleic Acids Res., 28, the KEGG mirror server package may be installed. The package, 380–382. which also includes a minimal set of DBGET/LinkDB, can be 5. Ogata,H., Goto,S., Sato,K., Fujibuchi,W., Bono,H. and Kanehisa,M. obtained from the KEGG anonymous FTP site: ftp://kegg.genome. (1999) Nucleic Acids Res., 27, 29–34. ad.jp/ 6. Bairoch,A. and Apweiler,R. (1999) Nucleic Acids Res., 27, 49–54. See also this issue: Nucleic Acids Res. (2000) 28, 45–48. The mirror package runs on a Solaris, IRIX or Linux 7. Fujibuchi,W., Goto,S., Migimatsu,H., Uchiyama,I., Ogiwara,A., machine. The individual databases PATHWAY, GENES and Akiyama,Y. and Kanehisa,M. (1998) Pac. Symp. Biocomput. 1998, LIGAND can also be mirrored or obtained by anonymous FTP. 683–694.
Nucleic Acids Research
Oxford University Press
KEGG: Kyoto Encyclopedia of Genes and Genomes
Nucleic Acids Research
, Volume 28 (1) –
Jan 1, 2000
Share Full Text for Free
Add to Folder
Web of Science