Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 14-Day Trial for You or Your Team.

Learn More →

The PRIDE database resources in 2022: a hub for mass spectrometry-based proteomics evidences

The PRIDE database resources in 2022: a hub for mass spectrometry-based proteomics evidences Published online 1 November 2021 Nucleic Acids Research, 2022, Vol. 50, Database issue D543–D552 https://doi.org/10.1093/nar/gkab1038 The PRIDE database resources in 2022: a hub for mass spectrometry-based proteomics evidences 1,* 1 1 1 Yasset Perez-Riverol , Jingwen Bai , Chakradhar Bandla ,David Garc´ ıa-Seisdedos , 1 1 1 Suresh Hewapathirana , Selvakumar Kamatchinathan , Deepti J. Kundu , 1 2,3 2,3 1 Ananth Prakash , Anika Frericks-Zipper , Martin Eisenacher , Mathias Walzer , 1 1 1,* Shengbo Wang , Alvis Brazma and Juan Antonio Vizca´ ıno European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK, Ruhr University Bochum, Medical Faculty, Medizinisches Proteom-Center, D-44801 Bochum, Germany and Ruhr University Bochum, Center for Protein Diagnostics (PRODI), Medical Proteome Analysis, 44801 Bochum, Germany Received September 11, 2021; Revised October 12, 2021; Editorial Decision October 13, 2021; Accepted October 14, 2021 ABSTRACT INTRODUCTION Data sharing in the public domain has become the standard The PRoteomics IDEntifications (PRIDE) database for proteomics researchers. The growth in recent years has (https://www.ebi.ac.uk/pride/) is the world’s largest been very remarkable and as a result, the number of pro- data repository of mass spectrometry-based pro- teomics datasets deposited every year in open public reposi- teomics data. PRIDE is one of the founding mem- tories is now comparable to transcriptomics (1). Since 2004, bers of the global ProteomeXchange (PX) consor- the PRoteomics IDEntifications (PRIDE) database ( https: tium and an ELIXIR core data resource. In this //www.ebi.ac.uk/pride/) at the European Bioinformatics In- manuscript, we summarize the developments in stitute (EMBL-EBI, Hinxton, Cambridge, UK) has enabled PRIDE resources and related tools since the previous public data deposition of mass spectrometry (MS)-based update manuscript was published in Nucleic Acids proteomics data, providing access to the experimental data Research in 2019. The number of submitted datasets described in scientific publications ( 2). Since then, and es- to PRIDE Archive (the archival component of PRIDE) pecially in recent years, PRIDE Archive (the archival com- ponent of PRIDE) has become the largest repository for has reached on average around 500 datasets per proteomics data sharing worldwide (2,3). month during 2021. In addition to continuous im- PRIDE stores datasets coming from all proteomics ex- provements in PRIDE Archive data pipelines and in- perimental approaches, with a focus on discovery-driven frastructure, the PRIDE Spectra Archive has been de- techniques such data dependent acquisition (DDA) and veloped to provide direct access to the submitted data independent acquisition (DIA) bottom-up proteomics, mass spectra using Universal Spectrum Identifiers. but also top-down proteomics and MS imaging, among As a key point, the file format MAGE-TAB for pro- others. For each dataset submitted to PRIDE Archive teomics has been developed to enable the improve- (the archival component of PRIDE), the MS raw files ment of sample metadata annotation. Additionally, (output files from the mass spectrometers) and the pro- the resource PRIDE Peptidome provides access to cessed results (at least peptide/protein identification results, aggregated peptide/protein evidences across PRIDE quantification information is optional) must be provided. In addition, each dataset in PRIDE Archive can contain Archive. Furthermore, we will describe how PRIDE peptide/protein quantitation result files, the mass spectra has increased its efforts to reuse and disseminate as peak list files, the searched protein sequence databases or high-quality proteomics data into other added-value spectral libraries, programming scripts, and any other tech- resources such as UniProt, Ensembl and Expression nical and/or biological metadata provided by the data sub- Atlas. mitters (4). The PRIDE team has led within the Proteomics Standards Initiative (PSI) organization, the creation and implementation of multiple standard open file formats such To whom correspondence should be addressed. Tel: +44 1223 492686; Email: juan@ebi.ac.uk Correspondence may also be addressed to Yasset Perez-Riverol. Tel: +44 1223 492513; Email: yperez@ebi.ac.uk C The Author(s) 2021. Published by Oxford University Press on behalf of Nucleic Acids Research. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited. D544 Nucleic Acids Research, 2022, Vol. 50, Database issue as mzTab (5), mzIdentML (6) and mzML (7)tostore,pro- source Java libraries including jmzTab (21), jmzIdentML cess and visualize the proteomics data deposited. (22), ms-data-core-api (23) and the protein inference algo- The stand-alone ProteomeXchange (PX) Submission rithms toolkit (PIA) (24,25) supported and maintained by tool (8) allows the researchers to perform the data sub- the PRIDE team, allows to read, validate, process, and store missions to PRIDE Archive, while PRIDE Inspector (9) proteomics data encoded in PSI open file formats. PRIDE enables users to review the dataset before, during, and af- Archive pipelines (2) perform a set of validation and qual- ter has been deposited in the resource. After the submis- ity checks to make sure the deposited files are semantically sion is completed, different pipelines perform the valida- valid, and that the metadata provided during the submis- tion and quality assessment of the reported results and sion is correct, in addition to moving the submitted datasets store the data into multiple databases for enabling data ac- into the EMBL-EBI production filesystem. cess and visualization in the PRIDE Archive web interface When a given dataset is made public, a group of post- (https://www.ebi.ac.uk/pride/archive) and also program- submission pipelines parses the peptides and proteins iden- matically via the PRIDE Application Programming Inter- tified in the dataset––if the dataset is a ‘complete’ submis- face (API, https://www.ebi.ac.uk/pride/ws/archive/v2/). In sion (4)––and index them into Apache Solr and MongoDB- recent years, PRIDE Archive has been moving its visual- based infrastructure enabling to search datasets by the iden- ization components from desktop-based applications (e.g., tified peptides and proteins. The PRIDE Spectra Archive PRIDE Inspector) to Restful APIs and web-based inter- and PRIDE Peptidome provide access to the mass spectra faces. All submitted files are available to download via FTP identified in the PRIDE Archive and to a condensed view or the Aspera file transfer protocol. of high-quality identified peptides across PRIDE Archive PRIDE resources have two main missions for the pro- datasets, respectively. All data from PRIDE Archive and re- teomics community: (i) support data deposition and quality lated resources are served through the PRIDE Restful API assessment of submitted proteomics experiments, to help re- and the web application. producible research; and (ii) promote and facilitate the reuse of public proteomics data, and disseminate high-quality Data submission proteomics evidences into added-value resources, including Ensembl (10), UniProt (11) and Expression Atlas (12). The PRIDE Archive guidelines for data submission includ- The PRIDE database was one of the founders of the ing the required data files and metadata have not changed PX consortium in 2011 (3,8). PX defines the guidelines for substantially in recent years, in parallel to PX requirements. data submission and dissemination of public proteomics Previous publications (2,4) explain in detail the main for- data worldwide. As of 2021, the resources PeptideAtlas mats supported, the type of submissions (‘complete’ or ‘par- (13), including its related resource PASSEL (PeptideAtlas tial’), and the required metadata for each dataset. Com- SRM Experiment Library) (14), MassIVE (15), jPOST (16), plete submissions are those where the processed results are iProX (17) and Panorama Public (18) are the active mem- submitted in the PSI standard file formats mzIdentML or bers of the consortium. PX coordinates the release of ac- mzTab. A web tutorial explaining the process of submis- cession numbers for every submitted dataset and a set of sion is available at https://www.ebi.ac.uk/training/online/ services for providing unified access to publicly available courses/pride-quick-tour/, explaining the main steps for datasets (http://proteomecentral.proteomexchange.org/cgi/ data submission. GetDataset), including specific data types such as mass In 2019, complete submissions containing quantitative spectra, using Universal Spectrum Identifiers ( 19)(http:// information based on the PRIDE XML file format were proteomecentral.proteomexchange.org/usi/). Additionally, discontinued and replaced by mzTab-based complete sub- in 2017, PRIDE became an ELIXIR (http://www.elixir- missions. mzTab (5) is a PSI tab-delimited format that europe.org) core data resource (20) and ELIXIR deposition supports the representation of not only identification re- database, recognizing its key role in the life sciences. sults but also quantitative results and post-translational In this manuscript, we will summarize the main PRIDE- modification (PTM) localization information. Since 2019, related developments in the last three years, since the Mascot (26), MaxQuant (27) and OpenMS (28)can ex- previous Nucleic Acids Research (NAR) database update port the resulting identification /quantification results into manuscript was published (2). We will discuss PRIDE mzTab. Since 2020, overall, 240 and 30 dataset submissions Archive first but will also provide updated information have been performed using mzTab generated from Mas- about the PRIDE-related tools and other ongoing activi- cot and MaxQuant, respectively. Recently, the MaxQuant ties including the updates in the PRIDE Spectra Archive and PRIDE teams worked together to enable the novel tool and PRIDE Peptidome. Additionally, we will also report MaxDIA (29) to export results from DIA approaches to about the work performed to disseminate and integrate pro- mzTab. teomics data in other EMBL-EBI resources. Minor improvements have been done to the PX Sub- mission tool including performance improvements in the OLS Dialog (30) component, which allows searching for CURRENT STATUS OF THE PRIDE ECOSYSTEM: RE- ontology/controlled vocabulary terms in the Ontology SOURCES AND TOOLS Lookup Service (https://www.ebi.ac.uk/ols/index). As a key The PRIDE database ecosystem (https://www.ebi.ac.uk/ point, file checksums are now computed during the submis- pride/) is composed of a comprehensive set of libraries, sion and validated by the PRIDE pipelines to ensure the in- desktop tools, databases, large-scale pipelines, Restful tegrity of the submitted files. Two additional improvements APIs and web applications (Figure 1). A set of open- have been implemented as part of the submission process: Nucleic Acids Research, 2022, Vol. 50, Database issue D545 Figure 1. Schema of the PRIDE resources ecosystem. PRIDE Archive users must provide the raw files, the processed results files, and metadata about every given dataset. Standard file formats (for processed result files) can be provided for ’Complete’ submissions. A group of open-source libraries is used by the PX Submission tool, and the PRIDE pipelines to validate, assess the quality of the reported peptides and proteins, and store the information (metadata, peptides/proteins and spectra) into multiple databases. The PRIDE Peptidome resource selects high-quality peptides across all the datasets in PRIDE Archive. All the data from PRIDE Archive and PRIDE Peptidome is served to external users such as Ensembl and UniProt through the PRIDE API and PRIDE web interface. Additionally, proteomics quantitative datasets are reanalyzed and integrated into Expression Atlas. (i) add information about datasets license; and (ii) submis- level, including the experimental design (e.g. samples treat- sion of sample metadata and experimental design informa- ment, fractionation steps, etc.), prevents a more stream- tion using the newly developed file format MAGE-TAB for lined reuse of the available data, especially in the case of proteomics. reanalyses of quantitative proteomics datasets. The MAGE- TAB for proteomics (34), an extension of the format origi- nal MAGE-TAB format used in transcriptomics (35), has Datasets licenses been recently proposed to capture the sample metadata, and the experimental design for proteomics experiments Licenses for datasets stored in PX resources had not been (Figure 2). originally defined or agreed upon ( 3). In 2020, PX partners MAGE-TAB for proteomics has two main components: decided to move towards a default Creative Commons CC0 the Investigation Description Format (IDF) and the Sample license as a minimum level for each dataset, making it avail- and Data Relationship Format (SDRF). The IDF contains able globally datasets without any restrictions. PRIDE used the general description of the study which is the same infor- to follow the EMBL-EBI ‘Terms of use’ (https://www.ebi. mation annotated with the PX Submission tool. Then users ac.uk/about/terms-of-use). The CC0 license can only be en- do not need to provide it upon submission. The SDRF- sured for prospective newly submitted datasets since 2020. Proteomics format includes the representation of the ex- It is expected that for PRIDE, a CC0 license will be the de- perimental design, and the relationship between the sam- fault one in the foreseeable future, in parallel to the policy ples analyzed in the experiment and the MS data files in other EMBL-EBI resources. (raw files). The SDRF-Proteomics is a tab-delimited format where each column is a property of the sample or the data file. Each row corresponds to the relation between a sample MAGE-TAB for proteomics: improving sample metadata and and a data file, and each cell is the value of the property for experimental design the sample or the data file ( 34)(https://github.com/bigbio/ For every submitted dataset to PRIDE Archive, general proteomics-metadata-standard). metadata about the study must be provided including the ti- SDRF-Proteomics files can now be added manually tle, submitters’ details, dataset description, sample and data by the user and selecting the ‘EXPERIMENTAL DE- protocols, instrument, and the associated publication once SIGN’ as the file type during the submission. Once it is published (2,4,8). It has been highlighted multiple times the data arrives at PRIDE, a BioSample database ac- (31–33) how the lack of appropriate metadata at the sample cession is requested for each sample and added into D546 Nucleic Acids Research, 2022, Vol. 50, Database issue Figure 2. PRIDE Archive users can now provide SDRF-Proteomics files to represent the experimental design and the relationship between the samples analyzed and the instrument raw files. The samples included in the SDRF-Proteomics files are submitted to BioSamples getting each of them a unique accession number. In addition, the PRIDE web interface represents the information contained in SDRF-Proteomics files in an ‘Experimental Design’ ta ble, including all samples and data files. the BioSample resource (36)(e.g. https://www.ebi. that enables to query each endpoint in the API (see ac.uk/biosamples/samples/SAMEA7710319) via the https://github.com/PRIDE-Archive/pridepy#examples). PRIDE Archive pipelines. In addition, the corresponding The PRIDE Archive web interface provides visualiza- experimental design table (e.g. - https://www.ebi.ac.uk/ tion components that allow to search, find and inspect pride/archive/projects/PXD000792)(Figure 2)can be all the dataset information. A large number of the fea- accessed through the PRIDE Archive web interface. tures from PRIDE Inspector have been moved into the As of September 2021, more than 130 public datasets PRIDE web, enabling the inspection of the peptide/protein have been re-annotated by third parties (33) and the evidences and the spectra identified in each complete resulting information is available via PRIDE Archive submission (Figure 3). In the results exploration viewer, (https://www.ebi.ac.uk/pride/archive?keyword=sdrf.tsv). users can explore the identification results, including the protein coverage in the identified proteins and the mass spectra that are part of each PSM (Peptide Spec- PRIDE Archive web interface and Restful API: accessing trum Match) (Figure 3, https://www.ebi.ac.uk/pride/ proteomics evidences archive/projects/PXD008613/results?reportedAccession= SPTB2 HUMAN&assayAccession=83415). It is impor- The PRIDE Restful API (https://www.ebi.ac.uk/pride/ tant to highlight that these features are only available for ws/archive/v2/) can be used to query and access all complete submissions. the data in PRIDE resources. By using the API it is possible, for example, to query and find datasets by their date of publication, the proteins that have PRIDE Spectra Archive: accessing and visualizing all spectra been identified, or the name of a data file within the for complete submissions study (e.g., https://www.ebi.ac.uk/pride/ws/archive/v2/ search/projects?keyword=Subject1 FACS145 B C10). The public availability and direct access to mass spectra A powerful query language allows users to combine data create the opportunity for scientists to directly assess multiple keywords (properties of the project) into an whether, e.g., a novel peptide evidence, PTM, or amino acid SQL-based query to search datasets. A Python package variant (SAAV) are supported by a good-quality and well- and tool (https://github.com/PRIDE-Archive/pridepy) annotated mass spectrum (19,37). PSI and PX partners have have been developed to programmatically interact recently created a novel mechanism to uniquely resolve each with the PRIDE Archive Restful API. The pack- mass spectrum in public proteomics resources. The Univer- age provides a data model for all the data structures sal Spectrum Identifier (USI) enables greater transparency provided by the API but also includes functionality of spectral evidence making it more ‘FAIR’ (Findable, Nucleic Acids Research, 2022, Vol. 50, Database issue D547 Figure 3. The PRIDE web interface provides functionality to assess the quality of each Complete submission, including components to: (A) visualize the sequence coverage of a particular protein; and (B) visualize the spectrum used to identify a given peptide. Accessible, Interoperable, and Reusable), with more than challenges, due to the continuous and remarkable growth 1 billion USI identifications from over 3 billion spectra al- in the amount of submitted data. Although spectrum ready available through PX repositories (19). clustering algorithms have been recently improved using The PRIDE Spectra Archive (https://www.ebi.ac.uk/ deep-learning models to avoid all the comparisons between pride/archive/spectra) provides access to over 540 million all the spectra in the data (39,40), applications of these PSMs (as of September 2021) originally submitted to novel algorithms in large-scale data repositories have not PRIDE Archive. Users can search by peptide sequences and yet been implemented. USIs, enabling them to find specific PSMs from complete Instead of spectrum clustering, a novel platform and submitted datasets. A list of PSMs is shown after the search, algorithm (https://github.com/bigbio/sparkms) have been including peptide sequences, PTMs, search engine scores, used to select the best-peptide evidence for each peptide and charges, and two additional columns that highlight whether protein combination. The best peptide is selected based on the PSM has passed or not the original analysis thresh- two rules: (i) the peptide passes the peptide FDR thresh- old and PRIDE internal pipelines thresholds––for example, old for the assay; and (ii) the peptide sequence is longer PSM false discovery rate (FDR) <0.1 computed using the than seven amino acids. The sparkMS (https://github.com/ PIA algorithm (24,25). The accession column in the result bigbio/sparkms) used Spark (https://spark.apache.org/)and table provides a direct link to the project result page, where PySpark to group millions of PSMs in less than 6 hours, users can check all the results for a given dataset. which enabled the data analysis of such a large-scale amount of data. The PRIDE Peptidome web interface enables users PRIDE Peptidome: a condensed view of peptide evidences to search by peptide sequence and protein accession across PRIDE Archive numbers (e.g. https://www.ebi.ac.uk/pride/peptidome/ peptidesearch?keyword=SPTB2 HUMAN). The search PRIDE Peptidome (https://www.ebi.ac.uk/pride/ table shows the sequence for each peptide, protein acces- peptidome/) is a resource that groups all PSMs by peptide sion, the number of PSMs across PRIDE Archive, the sequence and the corresponding protein accession. Until number of datasets where this peptide has been identified recently, the grouping was performed using a spectrum and the best posterior error probability (PEP), as computed clustering approach (38). However, this approach pre- by PIA (25). When a given peptide-protein combination sented major challenges because each spectrum needed to is selected, the peptide viewer shows the sequence, the be compared between each other, prompting performance D548 Nucleic Acids Research, 2022, Vol. 50, Database issue spectrum that justifies the best scored PSM, the list of of purposes. For instance, recent resources that have been all PTMs identified, and the corresponding tissues and started by reusing mostly PRIDE public datasets include diseases where the peptide was identified (e.g. https:// OpenProt (43), MatrisomeDB (44), Scop3P (45) and Pro- www.ebi.ac.uk/pride/peptidome/peptidedetails?keyword= teomeHD (46). Additionally, as just one among many ex- DASVAEAWLLGQEPYLSSR&proteinAccession= amples of high-profile data reuse, PRIDE datasets are rou- SPTB2 HUMAN). tinely reanalyzed in the context of the Human Proteome Project (47). Figure 5A shows the increase in volumes of PRIDE ARCHIVE SUBMISSION STATISTICS data downloaded from PRIDE Archive since 2013. Re- cently, PRIDE has started to track the reuse of public As of 1 August 2021, PRIDE Archive stored 23 168 PRIDE datasets in publications. This information (if ap- datasets––compared to the 10 100 datasets available on Au- plicable) is available in the dataset web page when click- gust 2018 (2)––, which means that 56.4% of the data in ing on the term ‘Dataset reuses’. Figure 5B shows the in- PRIDE Archive has been submitted in the last 3 years. crease in manuscripts (including pre-prints) published per Figure 4 shows the distribution of submissions by month, year, where PRIDE datasets are reused. species, and disease in PRIDE Archive since 2012, and the Rather than in the creation of new resources, for sustain- cumulative size of PRIDE Archive data in terabytes. ability reasons, our focus in-house has been put in dissemi- In 2019, PRIDE Archive received 314 datasets per month nating and integrating PRIDE proteomics data into added- on average, 436 during 2020, and so far in 2021, this number value EMBL-EBI resources such as UniProt (11), Ensembl has grown to 499 datasets on average (Figure 4A), which af- (10), and Expression Atlas (12). Additionally, we have just firms the increasing huge demand and growing prominence started in the first steps of the work required to dissemi- of PRIDE. At the time of writing, PRIDE hosts∼83% of all nate and integrate metaproteomics data into MGnify (48), PX datasets, coming from >8 000 research groups, from 66 an EMBL-EBI resource for the analysis, archiving, and countries. The number of submitted datasets that are now browsing of metagenomic and metatranscriptomic data. publicly available is currently 64%, reflecting an improve- The dissemination of public proteomics data into differ- ment of around 8% when compared with 2019. With this ent resources has different goals depending on each specific aim in mind, the team has developed multiple mechanisms resource but can be grouped in three main categories: (i) to detect datasets already published that have not been re- provide aggregated peptide/protein evidences as originally ported to PRIDE by the original submitters. As a concrete submitted to PRIDE Archive, in the case of UniProt and example, submitters can report via the PRIDE web inter- Ensembl; (ii) provide peptide/protein evidences, variant se- face datasets that have already a corresponding manuscript quences and PTM information from reanalyzed datasets to published, if the dataset is still private. The size of PRIDE UniProt, Ensembl and in the near future, to MGNify. In Archive data has doubled from 2019 to 2021 (Figure 4B). As this case, an open analysis pipeline is used, including well- a result, PRIDE Archive is the third-largest omics Archive defined quality control metrics; and (iii) provide quantita- at EMBL-EBI only exceeded by the genomics resources tive protein expression information into Expression Atlas, ENA (European Nucleotide Archive) and EGA (European using data coming from reanalyzed datasets. Genome-phenome Archive) (41). As of September 2021, the majority of data in PRIDE Archive (including both public and private datasets) are In-house data reuse: proteogenomics reanalysis integration human datasets (including cell lines) (39.1%), followed with Ensembl by mouse (13.7%), Saccharomyces cerevisiae (2.8%), Ara- Since 2019, PRIDE has started to provide peptide evidences bidopsis thaliana (2.7%), Rattus norvegicus (2.5%) and Es- to Ensembl using the ‘TrackHub’ registry (2). More than cherichia coli (2.3%). Whereas most of the datasets come 4 million canonical peptide sequences, coming from 184 from model organisms, overall, datasets coming from PRIDE public datasets, have been disseminated into En- >3 224 different taxonomy identifiers are stored in PRIDE sembl ‘TrackHubs’ which are available at https://ftp.pride. Archive (Supplementary File S1). ebi.ac.uk/pride/data/proteogenomics/latest/archive/. The number of submitted datasets split by tissues and Some obvious benefits of integrating genomics and pro- diseases are more heterogeneous (Figure 4C and D), be- teomics data in genome browsers include linking somatic ing ‘cell-culture (non-specific tissue)’, and ‘disease-free variants and MS evidences and/or gene sequences and (healthy/normal samples)’ the most predominant anno- PTMs. Recently, we developed a group of tools and work- tations. Altogether, cancer is the most studied disease flows to enable large-scale reanalysis of public proteomics followed by Alzheimer’s and Parkinson’s disease. Impor- data to identify non-canonical peptides (49). Using cus- tantly, as of September 2021, more than 180 COVID-19 tom proteogenomics databases created with pgdb (https: related datasets have been submitted to PRIDE Archive. //github.com/nf-core/pgdb) and the pypgatk (https://github. These datasets, once they become publicly available, are com/bigbio/py-pgatk) we have managed to identify 43 501 integrated into the EMBL-EBI resource COVID-19 Data non-canonical peptides and 786 variant peptide sequences Portal (https://www.covid19dataportal.org/), enabling re- in four public datasets. searchers to access all public data at EMBL-EBI resources in a unified interface ( 42). In-house data reuse: data dissemination into UniProt PRIDE ARCHIVE AS A HUB OF MS EVIDENCES Aggregated high-quality evidences (as submitted to PRIDE Proteomics researchers are increasingly reusing public data Archive) are linked to UniProt enabling users to check from PRIDE (and other PX resources) for a broad range whether one particular protein has been seen detected in Nucleic Acids Research, 2022, Vol. 50, Database issue D549 Figure 4. (A) Number of submitted datasets to PRIDE Archive per month (from the beginning of PX in 2012 till August 2021); (B) cumulative size of PRIDE Archive data since 2012; (C) number of submitted datasets per species or taxonomy identifier (as of August 2021). All species that had less than 100 datasets are grouped in one category; (D) distribution of the number of submitted datasets to PRIDE Archive per annotated disease. PRIDE Archive. As part of an ongoing effort, we are cur- of writing. Most of them are DDA label-free datasets, in- rently aiming to link all peptide evidences from PRIDE Pep- volving cell lines and tumor samples (52), and baseline tidome to populate the UniProt ProtVista viewer (50). tissue datasets coming from human, mouse and rat sam- Additionally, we are currently working in the develop- ples. MaxQuant was used as the analysis software in all ment of infrastructure to reanalyse in a reliable manner, cases. Additionally, ten SWATH-MS DIA datasets coming store, visualize and disseminate PTM data (starting with mainly from cell line and human tumor samples have also phosphorylation) from PRIDE into UniProt. This is taking already been re-analysed and integrated into Expression At- place in the context of the ‘PTMeXchange’ project, in col- las. In this case, an in-house open analysis pipeline based laboration with the PeptideAtlas team and the University of on OpenSWATH (https://github.com/PRIDE-reanalysis/ Liverpool. Previously to this more systematic effort, we re- DIA-reanalysis) was developed and used for the re-analysis analysed 112 human phospho-enriched datasets, generated (53). These datasets constituted a pilot project to study the from 104 different human cell types or tissues (51). Using a feasibility of performing a systematic reanalysis and inte- machine learning approach, some of the generated informa- gration of DIA datasets. Expression Atlas users can now tion from the reanalysis together with other sequence fea- access more comprehensively proteomics expression infor- tures were used to create a single functional score for human mation in the same interface as gene expression, providing phosphosites. an effective manner of integrating the results of transcrip- tomics and proteomics experiments. In-house data reuse: integration of quantitative analyses in Expression Atlas DISCUSSION AND FUTURE PLANS More than 65 quantitative datasets have been annotated, Data deposition and dissemination have changed the pro- reanalysed and the corresponding results have already or teomics community since the creation of PX almost 10 years are being integrated into Expression Atlas at the moment ago. Most of the proteomics journals require nowadays the D550 Nucleic Acids Research, 2022, Vol. 50, Database issue Addressing ethical issues for genomics and transcriptomics data led to processes to control who may access the data, so-called ‘controlled access’. Resources supporting the stor- age and dissemination of controlled access DNA/RNA se- quencing datasets include the EGA and others internation- ally such as dbGAP (USA) and the Japanese Genotype- phenotype Archive. At present, all data in PRIDE (and in all PX resources) is fully open. Therefore, there is an increas- ing number of clinical sensitive human datasets that cannot be made available via PRIDE due to ethical-related issues (55). To address this problem, we will be working in de- veloping a tailored infrastructure for sensitive human pro- teomics data, and in all the related policies. Additionally, in the context of data archiving activities, we plan to improve the support for cross-linking data - as outlined here (56)- and to provide better data integration for structural pro- teomics datasets between PRIDE Archive and the Protein Data Bank (PDB). As shown above, we are already working on developing open and reproducible data analysis pipelines for different flavours of proteomics workflows (e.g., DDA, DIA, pro- teogenomics) (49,53,57). The main rationale is to make pos- sible the use of that software in cloud infrastructures so that in the future the pipelines can be used by the community in Figure 5. (A) Volumes of PRIDE Archive data downloads per year, from the cloud using software container technologies (58). In ad- 2013 to 2020. (B) Number of manuscripts (including pre-prints) per year dition, we aim to increasingly perform in-house data reuse (2013–2021), where datasets from PRIDE Archive are reused. The figures from 2021 are estimated at the end of the year, according to the existing (including data re-analysis) and disseminate high-quality data at the end of September. It should be noted that the figures represent proteomics data from PRIDE into the already mentioned an underestimation since they only include those manuscripts that could added-value resources (Ensembl, UniProt, Expression At- be tracked successfully. las, and MGnify in the near future). In this context, we will also work in improving the PRIDE Archive infrastructure authors to deposit their data in a PX resource, which has en- to store dataset reanalyses appropriately, linking them to abled a better reproducibility and traceability of the claims the relevant resources. One aim is to further develop data reported in a given manuscript. The proteomics community dissemination and integration practices also involving re- is now widely embracing open data policies, an opposite sce- sources outside of EMBL-EBI. nario to the situation just a few years ago. At the same time, To finalize, we invite interested parties in PRIDE- public proteomics data are being increasingly reused with related developments to follow the PRIDE Twitter account multiple applications (1). We next outline some of the main (@pride ebi). For regular announcements of all the new working areas for PRIDE in the near future. publicly available datasets, users can follow the PX Twitter account (@proteomexchange). First of all, PRIDE is raising the bar of metadata an- notation for all submitted datasets. MAGE-TAB for pro- teomics has been created with the aim that every submitted SUPPLEMENTARY DATA dataset provides information about the sample and the ex- perimental design. The improvement in the annotation is Supplementary Data are available at NAR Online. also required to facilitate further data reuse for third par- ties. We expect that, gradually, the SDRF-Proteomics com- ACKNOWLEDGEMENTS ponent will be made required for every dataset submission, after the community understands and get a full idea of the We would like to thank all the members of the PRIDE file format and of the mandatory information that needs to Scientific Advisory Board during the period 2019 to 2021, be provided. Multiple materials (https://github.com/bigbio/ namely Ruedi Aebersold, Jurgen Cox, Pedro Cutillas, Con- proteomics-metadata-standard/wiki), including examples cha Gil, Juri Rappsilber and Hans Vissers. Finally, we and video tutorials, have been made available to better un- would like to thank all data submitters and collaborators derstand the file format and how it can be submitted to for their contributions. PRIDE Archive. With the growing importance of clinical proteomics, FUNDING i.e. in the context of multi-omics studies, another impor- tant area is the management of clinical sensitive human Wellcome [208391/Z/17/Z]; BBSRC grants ‘Proteomics proteomics data. Ethical issues in proteomics are start- DIA’ [BB/P024599/1], ‘PTMeXchange’ [BB/S01781X/1], ing to be discussed and becoming increasingly relevant. A ‘GRAPPA’ [BB/T019670/1]; UK-Japan Partnership award community-driven white paper on the topic has been re- [BB/N022440/1]; NIH ‘Proteomics Standards’ grant [R24 cently published describing the current state-of-the-art (54). GM127667-01]; EU H2020 project EPIC-XS [823839]; Nucleic Acids Research, 2022, Vol. 50, Database issue D551 Open Targets [OTAR-043]; Luxembourg National Re- 15. Choi,M., Carver,J., Chiva,C., Tzouros,M., Huang,T., Tsai,T.H., Pullman,B., Bernhardt,O.M., Huttenhain,R., Teo,G.C. et al. (2020) search Fund [C19/BM/13684739]; several ELIXIR Imple- MassIVE.quant: a community resource of quantitative mass mentation Studies and EMBL core funding; M.E. and spectrometry-based proteomics datasets. Nat. Methods, 17, 981–984. A.F.-Z. would like to acknowledge funding from de.NBI, a 16. Moriya,Y., Kawano,S., Okuda,S., Watanabe,Y., Matsumoto,M., project of the German Federal Ministry of Education and Takami,T., Kobayashi,D., Yamanouchi,Y., Araki,N., Yoshizawa,A.C. et al. (2019) The jPOST environment: an integrated proteomics data Research (BMBF) [FKZ 031 A 534A]; Center for Protein repository and database. Nucleic. Acids. Res., 47, D1218–D1224. Diagnostics (PPRODI), a grant of the Ministry of Inno- 17. Ma,J., Chen,T., Wu,S., Yang,C., Bai,M., Shu,K., Li,K., Zhang,G., vation, Science and Research of North-Rhine Westphalia, Jin,Z., He,F. et al. (2019) iProX: an integrated proteome resource. Germany. Funding for open access charge: Wellcome. Nucleic Acids Res., 47, D1211–D1217. 18. Sharma,V., Eckels,J., Schilling,B., Ludwig,C., Jaffe,J.D., Conflict of interest statement. None declared. MacCoss,M.J. and MacLean,B. (2018) Panorama public: a public repository for quantitative data sets processed in skyline. Mol. Cell. REFERENCES Proteomics, 17, 1239–1244. 19. Deutsch,E.W., Perez-Riverol,Y., Carver,J., Kawano,S., Mendoza,L., 1. Perez-Riverol,Y., Zorin,A., Dass,G., Vu,M.T., Xu,P., Glont,M., Van Den Bossche,T., Gabriels,R., Binz,P.A., Pullman,B., Sun,Z. et al. Vizcaino,J.A., Jarnuczak,A.F., Petryszak,R., Ping,P. et al. (2019) (2021) Universal Spectrum Identifier for mass spectra. Nat. Methods, Quantifying the impact of public omics data. Nat. Commun., 10, 18, 768–770. 20. Drysdale,R., Cook,C.E., Petryszak,R., Baillie-Gerritsen,V., 2. Perez-Riverol,Y., Csordas,A., Bai,J., Bernal-Llinares,M., Barlow,M., Gasteiger,E., Gruhl,F., Haas,J., Lanfear,J., Lopez,R. Hewapathirana,S., Kundu,D.J., Inuganti,A., Griss,J., Mayer,G., et al. (2020) The ELIXIR Core Data Resources: fundamental Eisenacher,M. et al. (2019) The PRIDE database and related tools infrastructure for the life sciences. Bioinformatics, 36, 2636–2642. and resources in 2019: improving support for quantification data. 21. Xu,Q.W., Griss,J., Wang,R., Jones,A.R., Hermjakob,H. and Nucleic Acids Res., 47, D442–D450. Vizcaino,J.A. (2014) jmzTab: a java interface to the mzTab data 3. Deutsch,E.W., Bandeira,N., Sharma,V., Perez-Riverol,Y., Carver,J.J., standard. Proteomics, 14, 1328–1332. Kundu,D.J., Garcia-Seisdedos,D., Jarnuczak,A.F., Hewapathirana,S., 22. Reisinger,F., Krishna,R., Ghali,F., Rios,D., Hermjakob,H., Pullman,B.S. et al. (2020) The ProteomeXchange consortium in 2020: Vizcaino,J.A. and Jones,A.R. (2012) jmzIdentML API: a Java enabling ‘big data’ approaches in proteomics. Nucleic Acids Res., 48, interface to the mzIdentML standard for peptide and protein D1145–D1152. identification data. Proteomics, 12, 790–794. 4. Ternent,T., Csordas,A., Qi,D., Gomez-Baena,G., Beynon,R.J., 23. Perez-Riverol,Y., Uszkoreit,J., Sanchez,A., Ternent,T., Del Toro,N., Jones,A.R., Hermjakob,H. and Vizcaino,J.A. (2014) How to submit Hermjakob,H., Vizcaino,J.A. and Wang,R. (2015) ms-data-core-api: MS proteomics data to ProteomeXchange via the PRIDE database. an open-source, metadata-oriented library for computational Proteomics, 14, 2233–2241. proteomics. Bioinformatics, 31, 2903–2905. 5. Griss,J., Jones,A.R., Sachsenberg,T., Walzer,M., Gatto,L., Hartler,J., 24. Uszkoreit,J., Perez-Riverol,Y., Eggers,B., Marcus,K. and Thallinger,G.G., Salek,R.M., Steinbeck,C., Neuhauser,N. et al. Eisenacher,M. (2019) Protein inference using PIA workflows and PSI (2014) The mzTab data exchange format: communicating standard file formats. J. Proteome Res., 18, 741–747. mass-spectrometry-based proteomics and metabolomics experimental 25. Uszkoreit,J., Maerkens,A., Perez-Riverol,Y., Meyer,H.E., Marcus,K., results to a wider audience. Mol. Cell. Proteomics, 13, 2765–2775. Stephan,C., Kohlbacher,O. and Eisenacher,M. (2015) PIA: an 6. Vizcaino,J.A., Mayer,G., Perkins,S., Barsnes,H., Vaudel,M., intuitive protein inference engine with a web-based user interface. J. Perez-Riverol,Y., Ternent,T., Uszkoreit,J., Eisenacher,M., Fischer,L. Proteome Res., 14, 2988–2997. et al. (2017) The mzIdentML Data Standard Version 1.2, Supporting 26. Perkins,D.N., Pappin,D.J., Creasy,D.M. and Cottrell,J.S. (1999) Advances in Proteome Informatics. Mol. Cell. Proteomics, 16, Probability-based protein identification by searching sequence 1275–1285. databases using mass spectrometry data. Electrophoresis, 20, 7. Martens,L., Chambers,M., Sturm,M., Kessner,D., Levander,F., 3551–3567. Shofstahl,J., Tang,W.H., Rompp,A., Neumann,S., Pizarro,A.D. et al. 27. Cox,J. and Mann,M. (2008) MaxQuant enables high peptide (2011) mzML–a community standard for mass spectrometry data. identification rates, individualized p.p.b.-range mass accuracies and Mol. Cell. Proteomics, 10, R110 000133. proteome-wide protein quantification. Nat. Biotechnol., 26, 8. Vizcaino,J.A., Deutsch,E.W., Wang,R., Csordas,A., Reisinger,F., 1367–1372. Rios,D., Dianes,J.A., Sun,Z., Farrah,T., Bandeira,N. et al. (2014) 28. Pfeuffer,J., Sachsenberg,T., Alka,O., Walzer,M., Fillbrunn,A., ProteomeXchange provides globally coordinated proteomics data Nilse,L., Schilling,O., Reinert,K. and Kohlbacher,O. (2017) submission and dissemination. Nat. Biotechnol., 32, 223–226. OpenMS–a platform for reproducible analysis of mass spectrometry 9. Perez-Riverol,Y., Xu,Q.W., Wang,R., Uszkoreit,J., Griss,J., data. J. Biotechnol., 261, 142–148. Sanchez,A., Reisinger,F., Csordas,A., Ternent,T., Del-Toro,N. et al. 29. Sinitcyn,P., Hamzeiy,H., Salinas Soto,F., Itzhak,D., McCarthy,F., (2016) PRIDE Inspector Toolsuite: moving toward a universal Wichmann,C., Steger,M., Ohmayer,U., Distler,U., visualization tool for proteomics data standard formats and quality Kaspar-Schoenefeld,S. et al. (2021) MaxDIA enables library-based assessment of ProteomeXchange datasets. Mol. Cell. Proteomics, 15, and library-free data-independent acquisition proteomics. Nat. 305–317. Biotechnol., https://doi.org/10.1038/s41587-021-00968-7. 10. Yates,A.D., Achuthan,P., Akanni,W., Allen,J., Allen,J., 30. Perez-Riverol,Y., Ternent,T., Koch,M., Barsnes,H., Vrousgou,O., Alvarez-Jarreta,J., Amode,M.R., Armean,I.M., Azov,A.G., Jupp,S. and Vizcaino,J.A. (2017) OLS client and OLS dialog: open Bennett,R. et al. (2020) Ensembl 2020. Nucleic Acids Res., 48, source tools to annotate public omics datasets. Proteomics, 17, D682–D688. 11. UniProt,C. (2021) UniProt: the universal protein knowledgebase in 31. Mischak,H., Apweiler,R., Banks,R.E., Conaway,M., Coon,J., 2021. Nucleic Acids Res., 49, D480–D489. Dominiczak,A., Ehrich,J.H., Fliser,D., Girolami,M., Hermjakob,H. 12. Papatheodorou,I., Moreno,P., Manning,J., Fuentes,A.M., George,N., et al. (2007) Clinical proteomics: a need to define the field and to Fexova,S., Fonseca,N.A., Fullgrabe,A., Green,M., Huang,N. et al. begin to set adequate standards. Proteomics Clin Appl, 1, 148–156. (2020) Expression Atlas update: from tissues to single cells. Nucleic 32. Griss,J., Perez-Riverol,Y., Hermjakob,H. and Vizcaino,J.A. (2015) Acids Res., 48, D77–D83. Identifying novel biomarkers through data mining-a realistic 13. Deutsch,E.W., Lam,H. and Aebersold,R. (2008) PeptideAtlas: a scenario? Proteomics Clin. Appl., 9, 437–443. resource for target selection for emerging targeted proteomics 33. Perez-Riverol,Y. and European Bioinformatics Community for Mass, workflows. EMBO Rep., 9, 429–434. S. (2020) Toward a sample metadata standard in public proteomics 14. Farrah,T., Deutsch,E.W., Kreisberg,R., Sun,Z., Campbell,D.S., repositories. J. Proteome Res., 19, 3906–3909. Mendoza,L., Kusebauch,U., Brusniak,M.Y., Huttenhain,R., 34. Dai,C., Fullgrabe,A., Pfeuffer,J., Solovyeva,E.M., Deng,J., Schiess,R. et al. (2012) PASSEL: the PeptideAtlas SRMexperiment Moreno,P., Kamatchinathan,S., Kundu,D.J., George,N., Fexova,S. library. Proteomics, 12, 1170–1175. D552 Nucleic Acids Research, 2022, Vol. 50, Database issue et al. (2021) A proteomics sample metadata representation for 47. Omenn,G.S., Lane,L., Overall,C.M., Cristea,I.M., Corrales,F.J., multiomics integration and big data analysis. Nat. Commun., 12, Lindskog,C., Paik,Y.K., Van Eyk,J.E., Liu,S., Pennington,S.R. et al. 5854. (2020) Research on the human proteome reaches a major milestone: 35. Rayner,T.F., Rocca-Serra,P., Spellman,P.T., Causton,H.C., Farne,A., >90% of predicted human proteins now credibly detected, according Holloway,E., Irizarry,R.A., Liu,J., Maier,D.S., Miller,M. et al. (2006) to the HUPO human proteome project. J. Proteome Res., 19, A simple spreadsheet-based, MIAME-supportive format for 4735–4746. microarray data: MAGE-TAB. BMC Bioinformatics, 7, 489. 48. Mitchell,A.L., Almeida,A., Beracochea,M., Boland,M., Burgin,J., 36. Gostev,M., Faulconbridge,A., Brandizi,M., Fernandez-Banet,J., Cochrane,G., Crusoe,M.R., Kale,V., Potter,S.C., Richardson,L.J. Sarkans,U., Brazma,A. and Parkinson,H. (2012) The BioSample et al. (2020) MGnify: the microbiome analysis resource in 2020. Database (BioSD) at the European Bioinformatics Institute. Nucleic Nucleic Acids Res., 48, D570–D578. Acids Res., 40, D64–D70. 49. Umer,H.M., Zhu,Y., Pfeuffer,J., Sachsenberg,T., Lehtio,J ¨ ., Branca,R. 37. Schmidt,T., Samaras,P., Dorfer,V., Panse,C., Kockmann,T., and Perez-Riverol,Y. (2021) Generation of ENSEMBL-based Bichmann,L., van Puyvelde,B., Perez-Riverol,Y., Deutsch,E.W., proteogenomics databases boosts the identification of non-canonical Kuster,B. et al. (2021) Universal spectrum explorer: a standalone peptides. bioRxiv doi: https://doi.org/10.1101/2021.06.08.447496,09 (web-)application for cross-resource spectrum comparison. J. June 2021, preprint: not peer reviewed. Proteome Res., 20, 3388–3394. 50. Watkins,X., Garcia,L.J., Pundir,S., Martin,M.J. and UniProt,C. 38. Griss,J., Perez-Riverol,Y., Lewis,S., Tabb,D.L., Dianes,J.A., (2017) ProtVista: visualization of protein sequence annotations. Del-Toro,N., Rurik,M., Walzer,M.W., Kohlbacher,O., Hermjakob,H. Bioinformatics, 33, 2040–2041. et al. (2016) Recognizing millions of consistently unidentified spectra 51. Ochoa,D., Jarnuczak,A.F., Vieitez,C., Gehre,M., Soucheray,M., across hundreds of shotgun proteomics datasets. Nat. Methods, 13, Mateus,A., Kleefeldt,A.A., Hill,A., Garcia-Alonso,L., Stein,F. et al. 651–656. (2020) The functional landscape of the human phosphoproteome. 39. Qin,C., Luo,X., Deng,C., Shu,K., Zhu,W., Griss,J., Hermjakob,H., Nat. Biotechnol., 38, 365–373. Bai,M. and Perez-Riverol,Y. (2021) Deep learning embedder method 52. Jarnuczak,A.F., Najgebauer,H., Barzine,M., Kundu,D.J., and tool for mass spectra similarity search. J. Proteomics, 232, Ghavidel,F., Perez-Riverol,Y., Papatheodorou,I., Brazma,A. and 104070. Vizcaino,J.A. (2021) An integrated landscape of protein expression in 40. Bittremieux,W., Laukens,K., Noble,W.S. and Dorrestein,P.C. (2021) human cancer. Sci Data, 8, 115. 53. Walzer,M., Garc´ıa-Seisdedos,D., Prakash,A., Brack,P., Crowther,P., Large-scale tandem mass spectrum clustering using fast nearest Graham,R.L., George,N., Mohammed,S., Moreno,P., neighbor searching. Rapid Commun. Mass Spectrom., e9153, Papathedourou,I. et al. (2021) Implementing the re-use of public DIA https://doi.org/10.1002/rcm.9153. proteomics datasets: from the PRIDE database to Expression Atlas. 41. Cook,C.E., Stroe,O., Cochrane,G., Birney,E. and Apweiler,R. (2020) bioRxiv doi: https://doi.org/10.1101/2021.06.08.447493, 09 June The European Bioinformatics Institute in 2020: building a global 2021, preprint: not peer reviewed. infrastructure of interconnected data resources for the life sciences. 54. Bandeira,N., Deutsch,E.W., Kohlbacher,O., Martens,L. and Nucleic Acids Res., 48, D17–D23. Vizcaino,J.A. (2021) Data management of sensitive human 42. Harrison,P.W., Lopez,R., Rahman,N., Allen,S.G., Aslam,R., proteomics data: current practices, recommendations, and Buso,N., Cummins,C., Fathy,Y., Felix,E., Glont,M. et al. (2021) The perspectives for the future. Mol. Cell. Proteomics, 20, 100071. COVID-19 Data Portal: accelerating SARS-CoV-2 and COVID-19 55. Keane,T.M., O’Donovan,C. and Vizca´ıno,J.A. (2021) The growing research through rapid open access data sharing. Nucleic Acids Res., 49, W619–W623. need for controlled data access models in clinical proteomics and 43. Brunet,M.A., Lucier,J.F., Levesque,M., Leblanc,S., Jacques,J.F., metabolomics. Nat. Commun., 12, 5787. Al-Saedi,H.R.H., Guilloy,N., Grenier,F., Avino,M., Fournier,I. et al. 56. Leitner,A., Bonvin,A., Borchers,C.H., Chalkley,R.J., (2021) OpenProt 2021: deeper functional annotation of the coding Chamot-Rooke,J., Combe,C.W., Cox,J., Dong,M.Q., Fischer,L., potential of eukaryotic genomes. Nucleic Acids Res., 49, D380–D388. Gotze,M. et al. (2020) Toward increased reliability, transparency, and 44. Shao,X., Taha,I.N., Clauser,K.R., Gao,Y.T. and Naba,A. (2020) accessibility in cross-linking mass spectrometry. Structure, 28, MatrisomeDB: the ECM-protein knowledge database. Nucleic Acids 1259–1268. Res., 48, D1136–D1144. 57. Bai,J., Bandla,C., Guo,J., Vera Alvarez,R., Bai,M., Vizcaino,J.A., 45. Ramasamy,P., Turan,D., Tichshenko,N., Hulstaert,N., Moreno,P., Gruning,B., Sallou,O. and Perez-Riverol,Y. (2021) Vandermarliere,E., Vranken,W. and Martens,L. (2020) Scop3P: a BioContainers Registry: searching bioinformatics and proteomics comprehensive resource of human phosphosites within their full tools, packages, and containers. J. Proteome Res., 20, 2056–2061. context. J. Proteome Res., 19, 3478–3486. 58. Perez-Riverol,Y. and Moreno,P. (2020) Scalable data analysis in 46. Kustatscher,G., Grabowski,P., Schrader,T.A., Passmore,J.B., proteomics and metabolomics using BioContainers and workflows Schrader,M. and Rappsilber,J. (2019) Co-regulation map of the engines. Proteomics, 20, e1900147. human proteome enables identification of protein functions. Nat. Biotechnol., 37, 1361–1371. http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Nucleic Acids Research Oxford University Press

Loading next page...
 
/lp/oxford-university-press/the-pride-database-resources-in-2022-a-hub-for-mass-spectrometry-based-7XV3TXO37i

References (59)

Publisher
Oxford University Press
Copyright
© The Author(s) 2021. Published by Oxford University Press on behalf of Nucleic Acids Research.
ISSN
0305-1048
eISSN
1362-4962
DOI
10.1093/nar/gkab1038
Publisher site
See Article on Publisher Site

Abstract

Published online 1 November 2021 Nucleic Acids Research, 2022, Vol. 50, Database issue D543–D552 https://doi.org/10.1093/nar/gkab1038 The PRIDE database resources in 2022: a hub for mass spectrometry-based proteomics evidences 1,* 1 1 1 Yasset Perez-Riverol , Jingwen Bai , Chakradhar Bandla ,David Garc´ ıa-Seisdedos , 1 1 1 Suresh Hewapathirana , Selvakumar Kamatchinathan , Deepti J. Kundu , 1 2,3 2,3 1 Ananth Prakash , Anika Frericks-Zipper , Martin Eisenacher , Mathias Walzer , 1 1 1,* Shengbo Wang , Alvis Brazma and Juan Antonio Vizca´ ıno European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK, Ruhr University Bochum, Medical Faculty, Medizinisches Proteom-Center, D-44801 Bochum, Germany and Ruhr University Bochum, Center for Protein Diagnostics (PRODI), Medical Proteome Analysis, 44801 Bochum, Germany Received September 11, 2021; Revised October 12, 2021; Editorial Decision October 13, 2021; Accepted October 14, 2021 ABSTRACT INTRODUCTION Data sharing in the public domain has become the standard The PRoteomics IDEntifications (PRIDE) database for proteomics researchers. The growth in recent years has (https://www.ebi.ac.uk/pride/) is the world’s largest been very remarkable and as a result, the number of pro- data repository of mass spectrometry-based pro- teomics datasets deposited every year in open public reposi- teomics data. PRIDE is one of the founding mem- tories is now comparable to transcriptomics (1). Since 2004, bers of the global ProteomeXchange (PX) consor- the PRoteomics IDEntifications (PRIDE) database ( https: tium and an ELIXIR core data resource. In this //www.ebi.ac.uk/pride/) at the European Bioinformatics In- manuscript, we summarize the developments in stitute (EMBL-EBI, Hinxton, Cambridge, UK) has enabled PRIDE resources and related tools since the previous public data deposition of mass spectrometry (MS)-based update manuscript was published in Nucleic Acids proteomics data, providing access to the experimental data Research in 2019. The number of submitted datasets described in scientific publications ( 2). Since then, and es- to PRIDE Archive (the archival component of PRIDE) pecially in recent years, PRIDE Archive (the archival com- ponent of PRIDE) has become the largest repository for has reached on average around 500 datasets per proteomics data sharing worldwide (2,3). month during 2021. In addition to continuous im- PRIDE stores datasets coming from all proteomics ex- provements in PRIDE Archive data pipelines and in- perimental approaches, with a focus on discovery-driven frastructure, the PRIDE Spectra Archive has been de- techniques such data dependent acquisition (DDA) and veloped to provide direct access to the submitted data independent acquisition (DIA) bottom-up proteomics, mass spectra using Universal Spectrum Identifiers. but also top-down proteomics and MS imaging, among As a key point, the file format MAGE-TAB for pro- others. For each dataset submitted to PRIDE Archive teomics has been developed to enable the improve- (the archival component of PRIDE), the MS raw files ment of sample metadata annotation. Additionally, (output files from the mass spectrometers) and the pro- the resource PRIDE Peptidome provides access to cessed results (at least peptide/protein identification results, aggregated peptide/protein evidences across PRIDE quantification information is optional) must be provided. In addition, each dataset in PRIDE Archive can contain Archive. Furthermore, we will describe how PRIDE peptide/protein quantitation result files, the mass spectra has increased its efforts to reuse and disseminate as peak list files, the searched protein sequence databases or high-quality proteomics data into other added-value spectral libraries, programming scripts, and any other tech- resources such as UniProt, Ensembl and Expression nical and/or biological metadata provided by the data sub- Atlas. mitters (4). The PRIDE team has led within the Proteomics Standards Initiative (PSI) organization, the creation and implementation of multiple standard open file formats such To whom correspondence should be addressed. Tel: +44 1223 492686; Email: juan@ebi.ac.uk Correspondence may also be addressed to Yasset Perez-Riverol. Tel: +44 1223 492513; Email: yperez@ebi.ac.uk C The Author(s) 2021. Published by Oxford University Press on behalf of Nucleic Acids Research. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited. D544 Nucleic Acids Research, 2022, Vol. 50, Database issue as mzTab (5), mzIdentML (6) and mzML (7)tostore,pro- source Java libraries including jmzTab (21), jmzIdentML cess and visualize the proteomics data deposited. (22), ms-data-core-api (23) and the protein inference algo- The stand-alone ProteomeXchange (PX) Submission rithms toolkit (PIA) (24,25) supported and maintained by tool (8) allows the researchers to perform the data sub- the PRIDE team, allows to read, validate, process, and store missions to PRIDE Archive, while PRIDE Inspector (9) proteomics data encoded in PSI open file formats. PRIDE enables users to review the dataset before, during, and af- Archive pipelines (2) perform a set of validation and qual- ter has been deposited in the resource. After the submis- ity checks to make sure the deposited files are semantically sion is completed, different pipelines perform the valida- valid, and that the metadata provided during the submis- tion and quality assessment of the reported results and sion is correct, in addition to moving the submitted datasets store the data into multiple databases for enabling data ac- into the EMBL-EBI production filesystem. cess and visualization in the PRIDE Archive web interface When a given dataset is made public, a group of post- (https://www.ebi.ac.uk/pride/archive) and also program- submission pipelines parses the peptides and proteins iden- matically via the PRIDE Application Programming Inter- tified in the dataset––if the dataset is a ‘complete’ submis- face (API, https://www.ebi.ac.uk/pride/ws/archive/v2/). In sion (4)––and index them into Apache Solr and MongoDB- recent years, PRIDE Archive has been moving its visual- based infrastructure enabling to search datasets by the iden- ization components from desktop-based applications (e.g., tified peptides and proteins. The PRIDE Spectra Archive PRIDE Inspector) to Restful APIs and web-based inter- and PRIDE Peptidome provide access to the mass spectra faces. All submitted files are available to download via FTP identified in the PRIDE Archive and to a condensed view or the Aspera file transfer protocol. of high-quality identified peptides across PRIDE Archive PRIDE resources have two main missions for the pro- datasets, respectively. All data from PRIDE Archive and re- teomics community: (i) support data deposition and quality lated resources are served through the PRIDE Restful API assessment of submitted proteomics experiments, to help re- and the web application. producible research; and (ii) promote and facilitate the reuse of public proteomics data, and disseminate high-quality Data submission proteomics evidences into added-value resources, including Ensembl (10), UniProt (11) and Expression Atlas (12). The PRIDE Archive guidelines for data submission includ- The PRIDE database was one of the founders of the ing the required data files and metadata have not changed PX consortium in 2011 (3,8). PX defines the guidelines for substantially in recent years, in parallel to PX requirements. data submission and dissemination of public proteomics Previous publications (2,4) explain in detail the main for- data worldwide. As of 2021, the resources PeptideAtlas mats supported, the type of submissions (‘complete’ or ‘par- (13), including its related resource PASSEL (PeptideAtlas tial’), and the required metadata for each dataset. Com- SRM Experiment Library) (14), MassIVE (15), jPOST (16), plete submissions are those where the processed results are iProX (17) and Panorama Public (18) are the active mem- submitted in the PSI standard file formats mzIdentML or bers of the consortium. PX coordinates the release of ac- mzTab. A web tutorial explaining the process of submis- cession numbers for every submitted dataset and a set of sion is available at https://www.ebi.ac.uk/training/online/ services for providing unified access to publicly available courses/pride-quick-tour/, explaining the main steps for datasets (http://proteomecentral.proteomexchange.org/cgi/ data submission. GetDataset), including specific data types such as mass In 2019, complete submissions containing quantitative spectra, using Universal Spectrum Identifiers ( 19)(http:// information based on the PRIDE XML file format were proteomecentral.proteomexchange.org/usi/). Additionally, discontinued and replaced by mzTab-based complete sub- in 2017, PRIDE became an ELIXIR (http://www.elixir- missions. mzTab (5) is a PSI tab-delimited format that europe.org) core data resource (20) and ELIXIR deposition supports the representation of not only identification re- database, recognizing its key role in the life sciences. sults but also quantitative results and post-translational In this manuscript, we will summarize the main PRIDE- modification (PTM) localization information. Since 2019, related developments in the last three years, since the Mascot (26), MaxQuant (27) and OpenMS (28)can ex- previous Nucleic Acids Research (NAR) database update port the resulting identification /quantification results into manuscript was published (2). We will discuss PRIDE mzTab. Since 2020, overall, 240 and 30 dataset submissions Archive first but will also provide updated information have been performed using mzTab generated from Mas- about the PRIDE-related tools and other ongoing activi- cot and MaxQuant, respectively. Recently, the MaxQuant ties including the updates in the PRIDE Spectra Archive and PRIDE teams worked together to enable the novel tool and PRIDE Peptidome. Additionally, we will also report MaxDIA (29) to export results from DIA approaches to about the work performed to disseminate and integrate pro- mzTab. teomics data in other EMBL-EBI resources. Minor improvements have been done to the PX Sub- mission tool including performance improvements in the OLS Dialog (30) component, which allows searching for CURRENT STATUS OF THE PRIDE ECOSYSTEM: RE- ontology/controlled vocabulary terms in the Ontology SOURCES AND TOOLS Lookup Service (https://www.ebi.ac.uk/ols/index). As a key The PRIDE database ecosystem (https://www.ebi.ac.uk/ point, file checksums are now computed during the submis- pride/) is composed of a comprehensive set of libraries, sion and validated by the PRIDE pipelines to ensure the in- desktop tools, databases, large-scale pipelines, Restful tegrity of the submitted files. Two additional improvements APIs and web applications (Figure 1). A set of open- have been implemented as part of the submission process: Nucleic Acids Research, 2022, Vol. 50, Database issue D545 Figure 1. Schema of the PRIDE resources ecosystem. PRIDE Archive users must provide the raw files, the processed results files, and metadata about every given dataset. Standard file formats (for processed result files) can be provided for ’Complete’ submissions. A group of open-source libraries is used by the PX Submission tool, and the PRIDE pipelines to validate, assess the quality of the reported peptides and proteins, and store the information (metadata, peptides/proteins and spectra) into multiple databases. The PRIDE Peptidome resource selects high-quality peptides across all the datasets in PRIDE Archive. All the data from PRIDE Archive and PRIDE Peptidome is served to external users such as Ensembl and UniProt through the PRIDE API and PRIDE web interface. Additionally, proteomics quantitative datasets are reanalyzed and integrated into Expression Atlas. (i) add information about datasets license; and (ii) submis- level, including the experimental design (e.g. samples treat- sion of sample metadata and experimental design informa- ment, fractionation steps, etc.), prevents a more stream- tion using the newly developed file format MAGE-TAB for lined reuse of the available data, especially in the case of proteomics. reanalyses of quantitative proteomics datasets. The MAGE- TAB for proteomics (34), an extension of the format origi- nal MAGE-TAB format used in transcriptomics (35), has Datasets licenses been recently proposed to capture the sample metadata, and the experimental design for proteomics experiments Licenses for datasets stored in PX resources had not been (Figure 2). originally defined or agreed upon ( 3). In 2020, PX partners MAGE-TAB for proteomics has two main components: decided to move towards a default Creative Commons CC0 the Investigation Description Format (IDF) and the Sample license as a minimum level for each dataset, making it avail- and Data Relationship Format (SDRF). The IDF contains able globally datasets without any restrictions. PRIDE used the general description of the study which is the same infor- to follow the EMBL-EBI ‘Terms of use’ (https://www.ebi. mation annotated with the PX Submission tool. Then users ac.uk/about/terms-of-use). The CC0 license can only be en- do not need to provide it upon submission. The SDRF- sured for prospective newly submitted datasets since 2020. Proteomics format includes the representation of the ex- It is expected that for PRIDE, a CC0 license will be the de- perimental design, and the relationship between the sam- fault one in the foreseeable future, in parallel to the policy ples analyzed in the experiment and the MS data files in other EMBL-EBI resources. (raw files). The SDRF-Proteomics is a tab-delimited format where each column is a property of the sample or the data file. Each row corresponds to the relation between a sample MAGE-TAB for proteomics: improving sample metadata and and a data file, and each cell is the value of the property for experimental design the sample or the data file ( 34)(https://github.com/bigbio/ For every submitted dataset to PRIDE Archive, general proteomics-metadata-standard). metadata about the study must be provided including the ti- SDRF-Proteomics files can now be added manually tle, submitters’ details, dataset description, sample and data by the user and selecting the ‘EXPERIMENTAL DE- protocols, instrument, and the associated publication once SIGN’ as the file type during the submission. Once it is published (2,4,8). It has been highlighted multiple times the data arrives at PRIDE, a BioSample database ac- (31–33) how the lack of appropriate metadata at the sample cession is requested for each sample and added into D546 Nucleic Acids Research, 2022, Vol. 50, Database issue Figure 2. PRIDE Archive users can now provide SDRF-Proteomics files to represent the experimental design and the relationship between the samples analyzed and the instrument raw files. The samples included in the SDRF-Proteomics files are submitted to BioSamples getting each of them a unique accession number. In addition, the PRIDE web interface represents the information contained in SDRF-Proteomics files in an ‘Experimental Design’ ta ble, including all samples and data files. the BioSample resource (36)(e.g. https://www.ebi. that enables to query each endpoint in the API (see ac.uk/biosamples/samples/SAMEA7710319) via the https://github.com/PRIDE-Archive/pridepy#examples). PRIDE Archive pipelines. In addition, the corresponding The PRIDE Archive web interface provides visualiza- experimental design table (e.g. - https://www.ebi.ac.uk/ tion components that allow to search, find and inspect pride/archive/projects/PXD000792)(Figure 2)can be all the dataset information. A large number of the fea- accessed through the PRIDE Archive web interface. tures from PRIDE Inspector have been moved into the As of September 2021, more than 130 public datasets PRIDE web, enabling the inspection of the peptide/protein have been re-annotated by third parties (33) and the evidences and the spectra identified in each complete resulting information is available via PRIDE Archive submission (Figure 3). In the results exploration viewer, (https://www.ebi.ac.uk/pride/archive?keyword=sdrf.tsv). users can explore the identification results, including the protein coverage in the identified proteins and the mass spectra that are part of each PSM (Peptide Spec- PRIDE Archive web interface and Restful API: accessing trum Match) (Figure 3, https://www.ebi.ac.uk/pride/ proteomics evidences archive/projects/PXD008613/results?reportedAccession= SPTB2 HUMAN&assayAccession=83415). It is impor- The PRIDE Restful API (https://www.ebi.ac.uk/pride/ tant to highlight that these features are only available for ws/archive/v2/) can be used to query and access all complete submissions. the data in PRIDE resources. By using the API it is possible, for example, to query and find datasets by their date of publication, the proteins that have PRIDE Spectra Archive: accessing and visualizing all spectra been identified, or the name of a data file within the for complete submissions study (e.g., https://www.ebi.ac.uk/pride/ws/archive/v2/ search/projects?keyword=Subject1 FACS145 B C10). The public availability and direct access to mass spectra A powerful query language allows users to combine data create the opportunity for scientists to directly assess multiple keywords (properties of the project) into an whether, e.g., a novel peptide evidence, PTM, or amino acid SQL-based query to search datasets. A Python package variant (SAAV) are supported by a good-quality and well- and tool (https://github.com/PRIDE-Archive/pridepy) annotated mass spectrum (19,37). PSI and PX partners have have been developed to programmatically interact recently created a novel mechanism to uniquely resolve each with the PRIDE Archive Restful API. The pack- mass spectrum in public proteomics resources. The Univer- age provides a data model for all the data structures sal Spectrum Identifier (USI) enables greater transparency provided by the API but also includes functionality of spectral evidence making it more ‘FAIR’ (Findable, Nucleic Acids Research, 2022, Vol. 50, Database issue D547 Figure 3. The PRIDE web interface provides functionality to assess the quality of each Complete submission, including components to: (A) visualize the sequence coverage of a particular protein; and (B) visualize the spectrum used to identify a given peptide. Accessible, Interoperable, and Reusable), with more than challenges, due to the continuous and remarkable growth 1 billion USI identifications from over 3 billion spectra al- in the amount of submitted data. Although spectrum ready available through PX repositories (19). clustering algorithms have been recently improved using The PRIDE Spectra Archive (https://www.ebi.ac.uk/ deep-learning models to avoid all the comparisons between pride/archive/spectra) provides access to over 540 million all the spectra in the data (39,40), applications of these PSMs (as of September 2021) originally submitted to novel algorithms in large-scale data repositories have not PRIDE Archive. Users can search by peptide sequences and yet been implemented. USIs, enabling them to find specific PSMs from complete Instead of spectrum clustering, a novel platform and submitted datasets. A list of PSMs is shown after the search, algorithm (https://github.com/bigbio/sparkms) have been including peptide sequences, PTMs, search engine scores, used to select the best-peptide evidence for each peptide and charges, and two additional columns that highlight whether protein combination. The best peptide is selected based on the PSM has passed or not the original analysis thresh- two rules: (i) the peptide passes the peptide FDR thresh- old and PRIDE internal pipelines thresholds––for example, old for the assay; and (ii) the peptide sequence is longer PSM false discovery rate (FDR) <0.1 computed using the than seven amino acids. The sparkMS (https://github.com/ PIA algorithm (24,25). The accession column in the result bigbio/sparkms) used Spark (https://spark.apache.org/)and table provides a direct link to the project result page, where PySpark to group millions of PSMs in less than 6 hours, users can check all the results for a given dataset. which enabled the data analysis of such a large-scale amount of data. The PRIDE Peptidome web interface enables users PRIDE Peptidome: a condensed view of peptide evidences to search by peptide sequence and protein accession across PRIDE Archive numbers (e.g. https://www.ebi.ac.uk/pride/peptidome/ peptidesearch?keyword=SPTB2 HUMAN). The search PRIDE Peptidome (https://www.ebi.ac.uk/pride/ table shows the sequence for each peptide, protein acces- peptidome/) is a resource that groups all PSMs by peptide sion, the number of PSMs across PRIDE Archive, the sequence and the corresponding protein accession. Until number of datasets where this peptide has been identified recently, the grouping was performed using a spectrum and the best posterior error probability (PEP), as computed clustering approach (38). However, this approach pre- by PIA (25). When a given peptide-protein combination sented major challenges because each spectrum needed to is selected, the peptide viewer shows the sequence, the be compared between each other, prompting performance D548 Nucleic Acids Research, 2022, Vol. 50, Database issue spectrum that justifies the best scored PSM, the list of of purposes. For instance, recent resources that have been all PTMs identified, and the corresponding tissues and started by reusing mostly PRIDE public datasets include diseases where the peptide was identified (e.g. https:// OpenProt (43), MatrisomeDB (44), Scop3P (45) and Pro- www.ebi.ac.uk/pride/peptidome/peptidedetails?keyword= teomeHD (46). Additionally, as just one among many ex- DASVAEAWLLGQEPYLSSR&proteinAccession= amples of high-profile data reuse, PRIDE datasets are rou- SPTB2 HUMAN). tinely reanalyzed in the context of the Human Proteome Project (47). Figure 5A shows the increase in volumes of PRIDE ARCHIVE SUBMISSION STATISTICS data downloaded from PRIDE Archive since 2013. Re- cently, PRIDE has started to track the reuse of public As of 1 August 2021, PRIDE Archive stored 23 168 PRIDE datasets in publications. This information (if ap- datasets––compared to the 10 100 datasets available on Au- plicable) is available in the dataset web page when click- gust 2018 (2)––, which means that 56.4% of the data in ing on the term ‘Dataset reuses’. Figure 5B shows the in- PRIDE Archive has been submitted in the last 3 years. crease in manuscripts (including pre-prints) published per Figure 4 shows the distribution of submissions by month, year, where PRIDE datasets are reused. species, and disease in PRIDE Archive since 2012, and the Rather than in the creation of new resources, for sustain- cumulative size of PRIDE Archive data in terabytes. ability reasons, our focus in-house has been put in dissemi- In 2019, PRIDE Archive received 314 datasets per month nating and integrating PRIDE proteomics data into added- on average, 436 during 2020, and so far in 2021, this number value EMBL-EBI resources such as UniProt (11), Ensembl has grown to 499 datasets on average (Figure 4A), which af- (10), and Expression Atlas (12). Additionally, we have just firms the increasing huge demand and growing prominence started in the first steps of the work required to dissemi- of PRIDE. At the time of writing, PRIDE hosts∼83% of all nate and integrate metaproteomics data into MGnify (48), PX datasets, coming from >8 000 research groups, from 66 an EMBL-EBI resource for the analysis, archiving, and countries. The number of submitted datasets that are now browsing of metagenomic and metatranscriptomic data. publicly available is currently 64%, reflecting an improve- The dissemination of public proteomics data into differ- ment of around 8% when compared with 2019. With this ent resources has different goals depending on each specific aim in mind, the team has developed multiple mechanisms resource but can be grouped in three main categories: (i) to detect datasets already published that have not been re- provide aggregated peptide/protein evidences as originally ported to PRIDE by the original submitters. As a concrete submitted to PRIDE Archive, in the case of UniProt and example, submitters can report via the PRIDE web inter- Ensembl; (ii) provide peptide/protein evidences, variant se- face datasets that have already a corresponding manuscript quences and PTM information from reanalyzed datasets to published, if the dataset is still private. The size of PRIDE UniProt, Ensembl and in the near future, to MGNify. In Archive data has doubled from 2019 to 2021 (Figure 4B). As this case, an open analysis pipeline is used, including well- a result, PRIDE Archive is the third-largest omics Archive defined quality control metrics; and (iii) provide quantita- at EMBL-EBI only exceeded by the genomics resources tive protein expression information into Expression Atlas, ENA (European Nucleotide Archive) and EGA (European using data coming from reanalyzed datasets. Genome-phenome Archive) (41). As of September 2021, the majority of data in PRIDE Archive (including both public and private datasets) are In-house data reuse: proteogenomics reanalysis integration human datasets (including cell lines) (39.1%), followed with Ensembl by mouse (13.7%), Saccharomyces cerevisiae (2.8%), Ara- Since 2019, PRIDE has started to provide peptide evidences bidopsis thaliana (2.7%), Rattus norvegicus (2.5%) and Es- to Ensembl using the ‘TrackHub’ registry (2). More than cherichia coli (2.3%). Whereas most of the datasets come 4 million canonical peptide sequences, coming from 184 from model organisms, overall, datasets coming from PRIDE public datasets, have been disseminated into En- >3 224 different taxonomy identifiers are stored in PRIDE sembl ‘TrackHubs’ which are available at https://ftp.pride. Archive (Supplementary File S1). ebi.ac.uk/pride/data/proteogenomics/latest/archive/. The number of submitted datasets split by tissues and Some obvious benefits of integrating genomics and pro- diseases are more heterogeneous (Figure 4C and D), be- teomics data in genome browsers include linking somatic ing ‘cell-culture (non-specific tissue)’, and ‘disease-free variants and MS evidences and/or gene sequences and (healthy/normal samples)’ the most predominant anno- PTMs. Recently, we developed a group of tools and work- tations. Altogether, cancer is the most studied disease flows to enable large-scale reanalysis of public proteomics followed by Alzheimer’s and Parkinson’s disease. Impor- data to identify non-canonical peptides (49). Using cus- tantly, as of September 2021, more than 180 COVID-19 tom proteogenomics databases created with pgdb (https: related datasets have been submitted to PRIDE Archive. //github.com/nf-core/pgdb) and the pypgatk (https://github. These datasets, once they become publicly available, are com/bigbio/py-pgatk) we have managed to identify 43 501 integrated into the EMBL-EBI resource COVID-19 Data non-canonical peptides and 786 variant peptide sequences Portal (https://www.covid19dataportal.org/), enabling re- in four public datasets. searchers to access all public data at EMBL-EBI resources in a unified interface ( 42). In-house data reuse: data dissemination into UniProt PRIDE ARCHIVE AS A HUB OF MS EVIDENCES Aggregated high-quality evidences (as submitted to PRIDE Proteomics researchers are increasingly reusing public data Archive) are linked to UniProt enabling users to check from PRIDE (and other PX resources) for a broad range whether one particular protein has been seen detected in Nucleic Acids Research, 2022, Vol. 50, Database issue D549 Figure 4. (A) Number of submitted datasets to PRIDE Archive per month (from the beginning of PX in 2012 till August 2021); (B) cumulative size of PRIDE Archive data since 2012; (C) number of submitted datasets per species or taxonomy identifier (as of August 2021). All species that had less than 100 datasets are grouped in one category; (D) distribution of the number of submitted datasets to PRIDE Archive per annotated disease. PRIDE Archive. As part of an ongoing effort, we are cur- of writing. Most of them are DDA label-free datasets, in- rently aiming to link all peptide evidences from PRIDE Pep- volving cell lines and tumor samples (52), and baseline tidome to populate the UniProt ProtVista viewer (50). tissue datasets coming from human, mouse and rat sam- Additionally, we are currently working in the develop- ples. MaxQuant was used as the analysis software in all ment of infrastructure to reanalyse in a reliable manner, cases. Additionally, ten SWATH-MS DIA datasets coming store, visualize and disseminate PTM data (starting with mainly from cell line and human tumor samples have also phosphorylation) from PRIDE into UniProt. This is taking already been re-analysed and integrated into Expression At- place in the context of the ‘PTMeXchange’ project, in col- las. In this case, an in-house open analysis pipeline based laboration with the PeptideAtlas team and the University of on OpenSWATH (https://github.com/PRIDE-reanalysis/ Liverpool. Previously to this more systematic effort, we re- DIA-reanalysis) was developed and used for the re-analysis analysed 112 human phospho-enriched datasets, generated (53). These datasets constituted a pilot project to study the from 104 different human cell types or tissues (51). Using a feasibility of performing a systematic reanalysis and inte- machine learning approach, some of the generated informa- gration of DIA datasets. Expression Atlas users can now tion from the reanalysis together with other sequence fea- access more comprehensively proteomics expression infor- tures were used to create a single functional score for human mation in the same interface as gene expression, providing phosphosites. an effective manner of integrating the results of transcrip- tomics and proteomics experiments. In-house data reuse: integration of quantitative analyses in Expression Atlas DISCUSSION AND FUTURE PLANS More than 65 quantitative datasets have been annotated, Data deposition and dissemination have changed the pro- reanalysed and the corresponding results have already or teomics community since the creation of PX almost 10 years are being integrated into Expression Atlas at the moment ago. Most of the proteomics journals require nowadays the D550 Nucleic Acids Research, 2022, Vol. 50, Database issue Addressing ethical issues for genomics and transcriptomics data led to processes to control who may access the data, so-called ‘controlled access’. Resources supporting the stor- age and dissemination of controlled access DNA/RNA se- quencing datasets include the EGA and others internation- ally such as dbGAP (USA) and the Japanese Genotype- phenotype Archive. At present, all data in PRIDE (and in all PX resources) is fully open. Therefore, there is an increas- ing number of clinical sensitive human datasets that cannot be made available via PRIDE due to ethical-related issues (55). To address this problem, we will be working in de- veloping a tailored infrastructure for sensitive human pro- teomics data, and in all the related policies. Additionally, in the context of data archiving activities, we plan to improve the support for cross-linking data - as outlined here (56)- and to provide better data integration for structural pro- teomics datasets between PRIDE Archive and the Protein Data Bank (PDB). As shown above, we are already working on developing open and reproducible data analysis pipelines for different flavours of proteomics workflows (e.g., DDA, DIA, pro- teogenomics) (49,53,57). The main rationale is to make pos- sible the use of that software in cloud infrastructures so that in the future the pipelines can be used by the community in Figure 5. (A) Volumes of PRIDE Archive data downloads per year, from the cloud using software container technologies (58). In ad- 2013 to 2020. (B) Number of manuscripts (including pre-prints) per year dition, we aim to increasingly perform in-house data reuse (2013–2021), where datasets from PRIDE Archive are reused. The figures from 2021 are estimated at the end of the year, according to the existing (including data re-analysis) and disseminate high-quality data at the end of September. It should be noted that the figures represent proteomics data from PRIDE into the already mentioned an underestimation since they only include those manuscripts that could added-value resources (Ensembl, UniProt, Expression At- be tracked successfully. las, and MGnify in the near future). In this context, we will also work in improving the PRIDE Archive infrastructure authors to deposit their data in a PX resource, which has en- to store dataset reanalyses appropriately, linking them to abled a better reproducibility and traceability of the claims the relevant resources. One aim is to further develop data reported in a given manuscript. The proteomics community dissemination and integration practices also involving re- is now widely embracing open data policies, an opposite sce- sources outside of EMBL-EBI. nario to the situation just a few years ago. At the same time, To finalize, we invite interested parties in PRIDE- public proteomics data are being increasingly reused with related developments to follow the PRIDE Twitter account multiple applications (1). We next outline some of the main (@pride ebi). For regular announcements of all the new working areas for PRIDE in the near future. publicly available datasets, users can follow the PX Twitter account (@proteomexchange). First of all, PRIDE is raising the bar of metadata an- notation for all submitted datasets. MAGE-TAB for pro- teomics has been created with the aim that every submitted SUPPLEMENTARY DATA dataset provides information about the sample and the ex- perimental design. The improvement in the annotation is Supplementary Data are available at NAR Online. also required to facilitate further data reuse for third par- ties. We expect that, gradually, the SDRF-Proteomics com- ACKNOWLEDGEMENTS ponent will be made required for every dataset submission, after the community understands and get a full idea of the We would like to thank all the members of the PRIDE file format and of the mandatory information that needs to Scientific Advisory Board during the period 2019 to 2021, be provided. Multiple materials (https://github.com/bigbio/ namely Ruedi Aebersold, Jurgen Cox, Pedro Cutillas, Con- proteomics-metadata-standard/wiki), including examples cha Gil, Juri Rappsilber and Hans Vissers. Finally, we and video tutorials, have been made available to better un- would like to thank all data submitters and collaborators derstand the file format and how it can be submitted to for their contributions. PRIDE Archive. With the growing importance of clinical proteomics, FUNDING i.e. in the context of multi-omics studies, another impor- tant area is the management of clinical sensitive human Wellcome [208391/Z/17/Z]; BBSRC grants ‘Proteomics proteomics data. Ethical issues in proteomics are start- DIA’ [BB/P024599/1], ‘PTMeXchange’ [BB/S01781X/1], ing to be discussed and becoming increasingly relevant. A ‘GRAPPA’ [BB/T019670/1]; UK-Japan Partnership award community-driven white paper on the topic has been re- [BB/N022440/1]; NIH ‘Proteomics Standards’ grant [R24 cently published describing the current state-of-the-art (54). GM127667-01]; EU H2020 project EPIC-XS [823839]; Nucleic Acids Research, 2022, Vol. 50, Database issue D551 Open Targets [OTAR-043]; Luxembourg National Re- 15. Choi,M., Carver,J., Chiva,C., Tzouros,M., Huang,T., Tsai,T.H., Pullman,B., Bernhardt,O.M., Huttenhain,R., Teo,G.C. et al. (2020) search Fund [C19/BM/13684739]; several ELIXIR Imple- MassIVE.quant: a community resource of quantitative mass mentation Studies and EMBL core funding; M.E. and spectrometry-based proteomics datasets. Nat. Methods, 17, 981–984. A.F.-Z. would like to acknowledge funding from de.NBI, a 16. Moriya,Y., Kawano,S., Okuda,S., Watanabe,Y., Matsumoto,M., project of the German Federal Ministry of Education and Takami,T., Kobayashi,D., Yamanouchi,Y., Araki,N., Yoshizawa,A.C. et al. (2019) The jPOST environment: an integrated proteomics data Research (BMBF) [FKZ 031 A 534A]; Center for Protein repository and database. Nucleic. Acids. Res., 47, D1218–D1224. Diagnostics (PPRODI), a grant of the Ministry of Inno- 17. Ma,J., Chen,T., Wu,S., Yang,C., Bai,M., Shu,K., Li,K., Zhang,G., vation, Science and Research of North-Rhine Westphalia, Jin,Z., He,F. et al. (2019) iProX: an integrated proteome resource. Germany. Funding for open access charge: Wellcome. Nucleic Acids Res., 47, D1211–D1217. 18. Sharma,V., Eckels,J., Schilling,B., Ludwig,C., Jaffe,J.D., Conflict of interest statement. None declared. MacCoss,M.J. and MacLean,B. (2018) Panorama public: a public repository for quantitative data sets processed in skyline. Mol. Cell. REFERENCES Proteomics, 17, 1239–1244. 19. Deutsch,E.W., Perez-Riverol,Y., Carver,J., Kawano,S., Mendoza,L., 1. Perez-Riverol,Y., Zorin,A., Dass,G., Vu,M.T., Xu,P., Glont,M., Van Den Bossche,T., Gabriels,R., Binz,P.A., Pullman,B., Sun,Z. et al. Vizcaino,J.A., Jarnuczak,A.F., Petryszak,R., Ping,P. et al. (2019) (2021) Universal Spectrum Identifier for mass spectra. Nat. Methods, Quantifying the impact of public omics data. Nat. Commun., 10, 18, 768–770. 20. Drysdale,R., Cook,C.E., Petryszak,R., Baillie-Gerritsen,V., 2. Perez-Riverol,Y., Csordas,A., Bai,J., Bernal-Llinares,M., Barlow,M., Gasteiger,E., Gruhl,F., Haas,J., Lanfear,J., Lopez,R. Hewapathirana,S., Kundu,D.J., Inuganti,A., Griss,J., Mayer,G., et al. (2020) The ELIXIR Core Data Resources: fundamental Eisenacher,M. et al. (2019) The PRIDE database and related tools infrastructure for the life sciences. Bioinformatics, 36, 2636–2642. and resources in 2019: improving support for quantification data. 21. Xu,Q.W., Griss,J., Wang,R., Jones,A.R., Hermjakob,H. and Nucleic Acids Res., 47, D442–D450. Vizcaino,J.A. (2014) jmzTab: a java interface to the mzTab data 3. Deutsch,E.W., Bandeira,N., Sharma,V., Perez-Riverol,Y., Carver,J.J., standard. Proteomics, 14, 1328–1332. Kundu,D.J., Garcia-Seisdedos,D., Jarnuczak,A.F., Hewapathirana,S., 22. Reisinger,F., Krishna,R., Ghali,F., Rios,D., Hermjakob,H., Pullman,B.S. et al. (2020) The ProteomeXchange consortium in 2020: Vizcaino,J.A. and Jones,A.R. (2012) jmzIdentML API: a Java enabling ‘big data’ approaches in proteomics. Nucleic Acids Res., 48, interface to the mzIdentML standard for peptide and protein D1145–D1152. identification data. Proteomics, 12, 790–794. 4. Ternent,T., Csordas,A., Qi,D., Gomez-Baena,G., Beynon,R.J., 23. Perez-Riverol,Y., Uszkoreit,J., Sanchez,A., Ternent,T., Del Toro,N., Jones,A.R., Hermjakob,H. and Vizcaino,J.A. (2014) How to submit Hermjakob,H., Vizcaino,J.A. and Wang,R. (2015) ms-data-core-api: MS proteomics data to ProteomeXchange via the PRIDE database. an open-source, metadata-oriented library for computational Proteomics, 14, 2233–2241. proteomics. Bioinformatics, 31, 2903–2905. 5. Griss,J., Jones,A.R., Sachsenberg,T., Walzer,M., Gatto,L., Hartler,J., 24. Uszkoreit,J., Perez-Riverol,Y., Eggers,B., Marcus,K. and Thallinger,G.G., Salek,R.M., Steinbeck,C., Neuhauser,N. et al. Eisenacher,M. (2019) Protein inference using PIA workflows and PSI (2014) The mzTab data exchange format: communicating standard file formats. J. Proteome Res., 18, 741–747. mass-spectrometry-based proteomics and metabolomics experimental 25. Uszkoreit,J., Maerkens,A., Perez-Riverol,Y., Meyer,H.E., Marcus,K., results to a wider audience. Mol. Cell. Proteomics, 13, 2765–2775. Stephan,C., Kohlbacher,O. and Eisenacher,M. (2015) PIA: an 6. Vizcaino,J.A., Mayer,G., Perkins,S., Barsnes,H., Vaudel,M., intuitive protein inference engine with a web-based user interface. J. Perez-Riverol,Y., Ternent,T., Uszkoreit,J., Eisenacher,M., Fischer,L. Proteome Res., 14, 2988–2997. et al. (2017) The mzIdentML Data Standard Version 1.2, Supporting 26. Perkins,D.N., Pappin,D.J., Creasy,D.M. and Cottrell,J.S. (1999) Advances in Proteome Informatics. Mol. Cell. Proteomics, 16, Probability-based protein identification by searching sequence 1275–1285. databases using mass spectrometry data. Electrophoresis, 20, 7. Martens,L., Chambers,M., Sturm,M., Kessner,D., Levander,F., 3551–3567. Shofstahl,J., Tang,W.H., Rompp,A., Neumann,S., Pizarro,A.D. et al. 27. Cox,J. and Mann,M. (2008) MaxQuant enables high peptide (2011) mzML–a community standard for mass spectrometry data. identification rates, individualized p.p.b.-range mass accuracies and Mol. Cell. Proteomics, 10, R110 000133. proteome-wide protein quantification. Nat. Biotechnol., 26, 8. Vizcaino,J.A., Deutsch,E.W., Wang,R., Csordas,A., Reisinger,F., 1367–1372. Rios,D., Dianes,J.A., Sun,Z., Farrah,T., Bandeira,N. et al. (2014) 28. Pfeuffer,J., Sachsenberg,T., Alka,O., Walzer,M., Fillbrunn,A., ProteomeXchange provides globally coordinated proteomics data Nilse,L., Schilling,O., Reinert,K. and Kohlbacher,O. (2017) submission and dissemination. Nat. Biotechnol., 32, 223–226. OpenMS–a platform for reproducible analysis of mass spectrometry 9. Perez-Riverol,Y., Xu,Q.W., Wang,R., Uszkoreit,J., Griss,J., data. J. Biotechnol., 261, 142–148. Sanchez,A., Reisinger,F., Csordas,A., Ternent,T., Del-Toro,N. et al. 29. Sinitcyn,P., Hamzeiy,H., Salinas Soto,F., Itzhak,D., McCarthy,F., (2016) PRIDE Inspector Toolsuite: moving toward a universal Wichmann,C., Steger,M., Ohmayer,U., Distler,U., visualization tool for proteomics data standard formats and quality Kaspar-Schoenefeld,S. et al. (2021) MaxDIA enables library-based assessment of ProteomeXchange datasets. Mol. Cell. Proteomics, 15, and library-free data-independent acquisition proteomics. Nat. 305–317. Biotechnol., https://doi.org/10.1038/s41587-021-00968-7. 10. Yates,A.D., Achuthan,P., Akanni,W., Allen,J., Allen,J., 30. Perez-Riverol,Y., Ternent,T., Koch,M., Barsnes,H., Vrousgou,O., Alvarez-Jarreta,J., Amode,M.R., Armean,I.M., Azov,A.G., Jupp,S. and Vizcaino,J.A. (2017) OLS client and OLS dialog: open Bennett,R. et al. (2020) Ensembl 2020. Nucleic Acids Res., 48, source tools to annotate public omics datasets. Proteomics, 17, D682–D688. 11. UniProt,C. (2021) UniProt: the universal protein knowledgebase in 31. Mischak,H., Apweiler,R., Banks,R.E., Conaway,M., Coon,J., 2021. Nucleic Acids Res., 49, D480–D489. Dominiczak,A., Ehrich,J.H., Fliser,D., Girolami,M., Hermjakob,H. 12. Papatheodorou,I., Moreno,P., Manning,J., Fuentes,A.M., George,N., et al. (2007) Clinical proteomics: a need to define the field and to Fexova,S., Fonseca,N.A., Fullgrabe,A., Green,M., Huang,N. et al. begin to set adequate standards. Proteomics Clin Appl, 1, 148–156. (2020) Expression Atlas update: from tissues to single cells. Nucleic 32. Griss,J., Perez-Riverol,Y., Hermjakob,H. and Vizcaino,J.A. (2015) Acids Res., 48, D77–D83. Identifying novel biomarkers through data mining-a realistic 13. Deutsch,E.W., Lam,H. and Aebersold,R. (2008) PeptideAtlas: a scenario? Proteomics Clin. Appl., 9, 437–443. resource for target selection for emerging targeted proteomics 33. Perez-Riverol,Y. and European Bioinformatics Community for Mass, workflows. EMBO Rep., 9, 429–434. S. (2020) Toward a sample metadata standard in public proteomics 14. Farrah,T., Deutsch,E.W., Kreisberg,R., Sun,Z., Campbell,D.S., repositories. J. Proteome Res., 19, 3906–3909. Mendoza,L., Kusebauch,U., Brusniak,M.Y., Huttenhain,R., 34. Dai,C., Fullgrabe,A., Pfeuffer,J., Solovyeva,E.M., Deng,J., Schiess,R. et al. (2012) PASSEL: the PeptideAtlas SRMexperiment Moreno,P., Kamatchinathan,S., Kundu,D.J., George,N., Fexova,S. library. Proteomics, 12, 1170–1175. D552 Nucleic Acids Research, 2022, Vol. 50, Database issue et al. (2021) A proteomics sample metadata representation for 47. Omenn,G.S., Lane,L., Overall,C.M., Cristea,I.M., Corrales,F.J., multiomics integration and big data analysis. Nat. Commun., 12, Lindskog,C., Paik,Y.K., Van Eyk,J.E., Liu,S., Pennington,S.R. et al. 5854. (2020) Research on the human proteome reaches a major milestone: 35. Rayner,T.F., Rocca-Serra,P., Spellman,P.T., Causton,H.C., Farne,A., >90% of predicted human proteins now credibly detected, according Holloway,E., Irizarry,R.A., Liu,J., Maier,D.S., Miller,M. et al. (2006) to the HUPO human proteome project. J. Proteome Res., 19, A simple spreadsheet-based, MIAME-supportive format for 4735–4746. microarray data: MAGE-TAB. BMC Bioinformatics, 7, 489. 48. Mitchell,A.L., Almeida,A., Beracochea,M., Boland,M., Burgin,J., 36. Gostev,M., Faulconbridge,A., Brandizi,M., Fernandez-Banet,J., Cochrane,G., Crusoe,M.R., Kale,V., Potter,S.C., Richardson,L.J. Sarkans,U., Brazma,A. and Parkinson,H. (2012) The BioSample et al. (2020) MGnify: the microbiome analysis resource in 2020. Database (BioSD) at the European Bioinformatics Institute. Nucleic Nucleic Acids Res., 48, D570–D578. Acids Res., 40, D64–D70. 49. Umer,H.M., Zhu,Y., Pfeuffer,J., Sachsenberg,T., Lehtio,J ¨ ., Branca,R. 37. Schmidt,T., Samaras,P., Dorfer,V., Panse,C., Kockmann,T., and Perez-Riverol,Y. (2021) Generation of ENSEMBL-based Bichmann,L., van Puyvelde,B., Perez-Riverol,Y., Deutsch,E.W., proteogenomics databases boosts the identification of non-canonical Kuster,B. et al. (2021) Universal spectrum explorer: a standalone peptides. bioRxiv doi: https://doi.org/10.1101/2021.06.08.447496,09 (web-)application for cross-resource spectrum comparison. J. June 2021, preprint: not peer reviewed. Proteome Res., 20, 3388–3394. 50. Watkins,X., Garcia,L.J., Pundir,S., Martin,M.J. and UniProt,C. 38. Griss,J., Perez-Riverol,Y., Lewis,S., Tabb,D.L., Dianes,J.A., (2017) ProtVista: visualization of protein sequence annotations. Del-Toro,N., Rurik,M., Walzer,M.W., Kohlbacher,O., Hermjakob,H. Bioinformatics, 33, 2040–2041. et al. (2016) Recognizing millions of consistently unidentified spectra 51. Ochoa,D., Jarnuczak,A.F., Vieitez,C., Gehre,M., Soucheray,M., across hundreds of shotgun proteomics datasets. Nat. Methods, 13, Mateus,A., Kleefeldt,A.A., Hill,A., Garcia-Alonso,L., Stein,F. et al. 651–656. (2020) The functional landscape of the human phosphoproteome. 39. Qin,C., Luo,X., Deng,C., Shu,K., Zhu,W., Griss,J., Hermjakob,H., Nat. Biotechnol., 38, 365–373. Bai,M. and Perez-Riverol,Y. (2021) Deep learning embedder method 52. Jarnuczak,A.F., Najgebauer,H., Barzine,M., Kundu,D.J., and tool for mass spectra similarity search. J. Proteomics, 232, Ghavidel,F., Perez-Riverol,Y., Papatheodorou,I., Brazma,A. and 104070. Vizcaino,J.A. (2021) An integrated landscape of protein expression in 40. Bittremieux,W., Laukens,K., Noble,W.S. and Dorrestein,P.C. (2021) human cancer. Sci Data, 8, 115. 53. Walzer,M., Garc´ıa-Seisdedos,D., Prakash,A., Brack,P., Crowther,P., Large-scale tandem mass spectrum clustering using fast nearest Graham,R.L., George,N., Mohammed,S., Moreno,P., neighbor searching. Rapid Commun. Mass Spectrom., e9153, Papathedourou,I. et al. (2021) Implementing the re-use of public DIA https://doi.org/10.1002/rcm.9153. proteomics datasets: from the PRIDE database to Expression Atlas. 41. Cook,C.E., Stroe,O., Cochrane,G., Birney,E. and Apweiler,R. (2020) bioRxiv doi: https://doi.org/10.1101/2021.06.08.447493, 09 June The European Bioinformatics Institute in 2020: building a global 2021, preprint: not peer reviewed. infrastructure of interconnected data resources for the life sciences. 54. Bandeira,N., Deutsch,E.W., Kohlbacher,O., Martens,L. and Nucleic Acids Res., 48, D17–D23. Vizcaino,J.A. (2021) Data management of sensitive human 42. Harrison,P.W., Lopez,R., Rahman,N., Allen,S.G., Aslam,R., proteomics data: current practices, recommendations, and Buso,N., Cummins,C., Fathy,Y., Felix,E., Glont,M. et al. (2021) The perspectives for the future. Mol. Cell. Proteomics, 20, 100071. COVID-19 Data Portal: accelerating SARS-CoV-2 and COVID-19 55. Keane,T.M., O’Donovan,C. and Vizca´ıno,J.A. (2021) The growing research through rapid open access data sharing. Nucleic Acids Res., 49, W619–W623. need for controlled data access models in clinical proteomics and 43. Brunet,M.A., Lucier,J.F., Levesque,M., Leblanc,S., Jacques,J.F., metabolomics. Nat. Commun., 12, 5787. Al-Saedi,H.R.H., Guilloy,N., Grenier,F., Avino,M., Fournier,I. et al. 56. Leitner,A., Bonvin,A., Borchers,C.H., Chalkley,R.J., (2021) OpenProt 2021: deeper functional annotation of the coding Chamot-Rooke,J., Combe,C.W., Cox,J., Dong,M.Q., Fischer,L., potential of eukaryotic genomes. Nucleic Acids Res., 49, D380–D388. Gotze,M. et al. (2020) Toward increased reliability, transparency, and 44. Shao,X., Taha,I.N., Clauser,K.R., Gao,Y.T. and Naba,A. (2020) accessibility in cross-linking mass spectrometry. Structure, 28, MatrisomeDB: the ECM-protein knowledge database. Nucleic Acids 1259–1268. Res., 48, D1136–D1144. 57. Bai,J., Bandla,C., Guo,J., Vera Alvarez,R., Bai,M., Vizcaino,J.A., 45. Ramasamy,P., Turan,D., Tichshenko,N., Hulstaert,N., Moreno,P., Gruning,B., Sallou,O. and Perez-Riverol,Y. (2021) Vandermarliere,E., Vranken,W. and Martens,L. (2020) Scop3P: a BioContainers Registry: searching bioinformatics and proteomics comprehensive resource of human phosphosites within their full tools, packages, and containers. J. Proteome Res., 20, 2056–2061. context. J. Proteome Res., 19, 3478–3486. 58. Perez-Riverol,Y. and Moreno,P. (2020) Scalable data analysis in 46. Kustatscher,G., Grabowski,P., Schrader,T.A., Passmore,J.B., proteomics and metabolomics using BioContainers and workflows Schrader,M. and Rappsilber,J. (2019) Co-regulation map of the engines. Proteomics, 20, e1900147. human proteome enables identification of protein functions. Nat. Biotechnol., 37, 1361–1371.

Journal

Nucleic Acids ResearchOxford University Press

Published: Nov 1, 2021

There are no references for this article.