This page provides a comprehensive comparison of programs devoted to the prediction of non-coding RNAs. From a computational viewpoint, non-coding RNAs lack a simple statistical signal in their primary sequence, such as codon bias for protein encoding genes. This makes it a difficult task to detect novel non-coding RNAs in a genomic sequence. Comparative methods turn out to be the most promising approach: only structures that are conserved along evolution are likely to be biologically significant.
By now, the study includes four programs. All require a set of aligned sequences as input.
ddbRNA - reference, program
It searches for common stems in the multiple alignment in a greedy
fashion. The assessment of the significance of the conserved
structure is based on shuffled alignments.
QRNA - reference, program
QRNA is a supervised learning method whose key idea is to test the pattern
of substitutions observed in the alignment against three models: non-coding
RNA, protein encoding and a null hypothesis (position-independent). QRNA
is restrained to pairwise alignments.
MSARi - reference, program
It employs a distribution-mixture method to detect the conserved
common stems. It also allows for small variations between positions of
complementary base pairs in the alignment. The current version is
restrained to alignments composed of at least 10 sequences.
RNAz - reference, program
RNAz uses a structure inference method to compute the minimal free energy
of a consensus structure. The distribution of energies for equivalent randomized
alignments, provided by a SVM (Support Vector Machine), is then used to classify
the alignment.
Feel free to contact us if you want us to incorporate another non-coding RNA detection method !
Benchmark sequences
We used three kinds of data sets :
families of non coding RNAs, coming from Rfam and the Micro RNA registry
[see full data]
randomized alignments, that were generated from the alignments of non-coding RNAs by conservative shuffling and then re-aligned (see shuffle-aln.pl for more information on the shuffling techniques).
families of coding regions of homologous mRNAs, which are not supposed to share a global conserved structure
[see full data]
Alignments
All methods deal with aligned sequences.
We built pairwise alignments with Blast, Needle, Clustalw, Dialign2 and T-coffee.
Alignments for groups 3, 5 and 10 sequences were generated with Clustalw,
Dialign2 and T-coffee. We obtained more than 80 000 alignments.
For each family we generated all pairwise, 3-wise and 10-wise alignments exhaustively. Because of a very high number of possible alignments containing 5 sequences in many families, only a random part of them has been choosen in each family. All alignment methods have been used with the default parameters, except Blast (open gap = 5, gap extension = 2, mismatch = 2, word size = 11 and e-value limit = 100.0).
For each family we generated all pairwise, 3-wise and 10-wise alignments exhaustively. Because of a very high number of possible alignments containing 5 sequences in many families, only a random part of them has been choosen in each family. All alignment methods have been used with the default parameters, except Blast (open gap = 5, gap extension = 2, mismatch = 2, word size = 11 and e-value limit = 100.0).
Results
Results for all data sets (classified according to the identity percentage)
More information on results for non-coding RNA families
More information on results for mRNA families
Sensitivity vs Specificity. The first general observation is that
sensitivity is the main limitation common to all methods. RNAz clearly outperforms the other
methods. The overall specificity is good: it is very high for
shuffled sequences (except for MSARi) and still good for mRNA
sequences ( even if RNAz gives relatively mediocre results here, which
should be the price to pay for its higher sensitivity). QRNA is the
most selective method on mRNAs data: the use of a coding model seems
to be a good choice to improve the sensitivity/specifity balance.
The accuracy is strongly related to the quality of the input alignment.
All methods usually provides better results with Clustalw alignments,
even on pairwise alignments, because Clustalw alignments contains few gaps. Graphical results are available here.
The accuracy is closely related to conservation of the sequences.
A better sensitivity is noted on alignments with an average conservation
between 60% and 95%. Accuracy may be very low (less than 0.45) when the identity
percentage is poor, even for thermodynamically-stable well-conserved structures
such as Human microRNAs. High conservation with absence of compensatory
mutations is a cause of error for ddbRNA, QRNA and MSARi. Graphical results are available here.
Abundance of data can damage sensitivity.
Neither ddbRNA nor RNAz are able to detect non-coding RNAs with alignments
of 10 sequences. MSARi is an exception because it is tuned for alignments
containing at least 10 sequences and it is more permissive to search for
common stems. However, the overall accuracy of MSARi appears to be low
on our test data. Graphical results are available here.