TY - JOUR
T1 - Discrimination of non-protein-coding transcripts from protein-coding mRNA
AU - Frith, Martin C.
AU - Bailey, Timothy L.
AU - Kasukawa, Takeya
AU - Mignone, Flavio
AU - Kummerfeld, Sarah K.
AU - Madera, Martin
AU - Sunkara, Sirisha
AU - Furuno, Masaaki
AU - Bult, Carol J.
AU - Quackenbush, John
AU - Kai, Chikatoshi
AU - Kawai, Jun
AU - Carninci, Piero
AU - Hayashizaki, Yoshihide
AU - Pesole, Graziano
AU - Mattick, John S.
N1 - Funding Information:
We are grateful to Jinfeng Liu for providing the Swiss-Prot dataset, and to Julian Gough, Ken Pang and Pär Engstrom helpful advice. This work was funded by a Research Grant for the RIKEN Genome Exploration Research Project from the Ministry of Education, Culture, Sports, Science and Technology of the Japanese Government to Y.H., a Research Grant for Advanced and Innovational Research Program in Life Science to Y.H., a grant of the Genome Network Project from the Ministry of Education, Culture, Sports, Science and Technology, Japan to Y.H., a Grant for the Strategic Programs for R&D of RIKEN to Y.H., and Research Grants for Preventure Program C of the Japan Science and Technology Agency (JST) to Y.H. J.S.M. and T.L.B. are supported by the Queensland State Government and the Australian Research Council. M.C.F. is a University of Queensland Postdoctoral Fellow, and J.S.M. is a Federation Fellow of the Australian Research Council.
PY - 2006
Y1 - 2006
N2 - Several recent studies indicate that mammals and other organisms produce large numbers of RNA transcripts that do not correspond to known genes. It has been suggested that these transcripts do not encode proteins, but may instead function as RNAs. However, discrimination of coding and non-coding transcripts is not straightforward, and different laboratories have used different methods, whose ability to perform this discrimination is unclear. In this study, we examine ten bioinformatic methods that assess protein-coding potential and compare their ability and congruency in the discrimination of non-coding from coding sequences, based on four underlying principles: open reading frame size, sequence similarity to known proteins or protein domains, statistical models of protein-coding sequence, and synonymous versus non-synonymous substitution rates. Despite these different approaches, the methods show broad concordance, suggesting that coding and non-coding transcripts can, in general, be reliably discriminated, and that many of the recently discovered extra-genic transcripts are indeed non-coding. Comparison of the methods indicates reasons for unreliable predictions, and approaches to increase confidence further. Conversely and surprisingly, our analyses also provide evidence that as much as ∼10% of entries in the manually curated protein database Swiss-Prot are erroneous translations of actually non-coding transcripts.
AB - Several recent studies indicate that mammals and other organisms produce large numbers of RNA transcripts that do not correspond to known genes. It has been suggested that these transcripts do not encode proteins, but may instead function as RNAs. However, discrimination of coding and non-coding transcripts is not straightforward, and different laboratories have used different methods, whose ability to perform this discrimination is unclear. In this study, we examine ten bioinformatic methods that assess protein-coding potential and compare their ability and congruency in the discrimination of non-coding from coding sequences, based on four underlying principles: open reading frame size, sequence similarity to known proteins or protein domains, statistical models of protein-coding sequence, and synonymous versus non-synonymous substitution rates. Despite these different approaches, the methods show broad concordance, suggesting that coding and non-coding transcripts can, in general, be reliably discriminated, and that many of the recently discovered extra-genic transcripts are indeed non-coding. Comparison of the methods indicates reasons for unreliable predictions, and approaches to increase confidence further. Conversely and surprisingly, our analyses also provide evidence that as much as ∼10% of entries in the manually curated protein database Swiss-Prot are erroneous translations of actually non-coding transcripts.
KW - Bioinformatics
KW - Proteome
KW - Transcriptome
KW - mRNA
KW - ncRNA
UR - http://www.scopus.com/inward/record.url?scp=33646478817&partnerID=8YFLogxK
U2 - 10.4161/rna.3.1.2789
DO - 10.4161/rna.3.1.2789
M3 - Article
SN - 1547-6286
VL - 3
SP - 40
EP - 48
JO - RNA Biology
JF - RNA Biology
IS - 1
ER -