This page provides a comprehensive comparison of programs devoted to the prediction of non-coding RNAs. From a computational viewpoint, non-coding RNAs lack a simple statistical signal in their primary sequence, such as codon bias for protein encoding genes. This makes it a difficult task to detect novel non-coding RNAs in a genomic sequence. Comparative methods turn out to be the most promising approach: only structures that are conserved along evolution are likely to be biologically significant.

By now, the study includes four programs. All require a set of aligned sequences as input.

ddbRNA - reference, program
It searches for common stems in the multiple alignment in a greedy fashion. The assessment of the significance of the conserved structure is based on shuffled alignments.

QRNA - reference, program
QRNA is a supervised learning method whose key idea is to test the pattern of substitutions observed in the alignment against three models: non-coding RNA, protein encoding and a null hypothesis (position-independent). QRNA is restrained to pairwise alignments.

MSARi - reference, program
It employs a distribution-mixture method to detect the conserved common stems. It also allows for small variations between positions of complementary base pairs in the alignment. The current version is restrained to alignments composed of at least 10 sequences.

RNAz - reference, program
RNAz uses a structure inference method to compute the minimal free energy of a consensus structure. The distribution of energies for equivalent randomized alignments, provided by a SVM (Support Vector Machine), is then used to classify the alignment.

Feel free to contact us if you want us to incorporate another non-coding RNA detection method !

Benchmark sequences

We used three kinds of data sets :

families of non coding RNAs, coming from Rfam and the Micro RNA registry
[see full data]

randomized alignments, that were generated from the alignments of non-coding RNAs by conservative shuffling and then re-aligned (see shuffle-aln.pl for more information on the shuffling techniques).

families of coding regions of homologous mRNAs, which are not supposed to share a global conserved structure
[see full data]

Alignments

All methods deal with aligned sequences. We built pairwise alignments with Blast, Needle, Clustalw, Dialign2 and T-coffee. Alignments for groups 3, 5 and 10 sequences were generated with Clustalw, Dialign2 and T-coffee. We obtained more than 80 000 alignments.

For each family we generated all pairwise, 3-wise and 10-wise alignments exhaustively. Because of a very high number of possible alignments containing 5 sequences in many families, only a random part of them has been choosen in each family. All alignment methods have been used with the default parameters, except Blast (open gap = 5, gap extension = 2, mismatch = 2, word size = 11 and e-value limit = 100.0).

Results

Results for all data sets (classified according to the identity percentage)

More information on results for non-coding RNA families

More information on results for mRNA families


Sensitivity vs Specificity. The first general observation is that sensitivity is the main limitation common to all methods. RNAz clearly outperforms the other methods. The overall specificity is good: it is very high for shuffled sequences (except for MSARi) and still good for mRNA sequences ( even if RNAz gives relatively mediocre results here, which should be the price to pay for its higher sensitivity). QRNA is the most selective method on mRNAs data: the use of a coding model seems to be a good choice to improve the sensitivity/specifity balance.
The accuracy is strongly related to the quality of the input alignment. All methods usually provides better results with Clustalw alignments, even on pairwise alignments, because Clustalw alignments contains few gaps. Graphical results are available here.
The accuracy is closely related to conservation of the sequences. A better sensitivity is noted on alignments with an average conservation between 60% and 95%. Accuracy may be very low (less than 0.45) when the identity percentage is poor, even for thermodynamically-stable well-conserved structures such as Human microRNAs. High conservation with absence of compensatory mutations is a cause of error for ddbRNA, QRNA and MSARi. Graphical results are available here.
Abundance of data can damage sensitivity. Neither ddbRNA nor RNAz are able to detect non-coding RNAs with alignments of 10 sequences. MSARi is an exception because it is tuned for alignments containing at least 10 sequences and it is more permissive to search for common stems. However, the overall accuracy of MSARi appears to be low on our test data. Graphical results are available here.