*******************************************************
* *
* G A R D E N I A *
* *
*******************************************************
1. WHAT IS IT ?
Gardenia is a program that compares RNA secondary structures,
without pseudoknots. The resulting alignment takes into account both
the primary structure and the secondary structure.
2. HOW TO INSTALL IT ?
First you must extract the .zip archive file. It will create a main
directory called 'gardenia/' and three subdirectories:
headers/ contains the header files of the source code
sample/ contains a list of sample RNA sequences
src/ contains the source code and the make file
Go to 'src' directory, and type 'make'. The compilation operation
creates an executable file named 'gardenia'. The program is written in
ANSI C, and you need a C compiler to build it. The name of the C
compiler may be modified in the 'makefile' file that is in the 'src'
directory.
3. HOW TO RUN IT ?
3.1 Synopsis
------------
gardenia inputFile [-S scoreFile] [-o outFile] [-M matrixFile] [-lefghcnms] [--fasta] [--eq]
Input sequences should be given in one file. Gardenia recognizes
bracket-dot format, which is a fasta-like format. The secondary
structure accompanying the sequence is described in the third line:
Each base pair is denoted by a pair of brackets, and each free base is
denoted by a dot.
> sequence 1
AUCGGCGGACGCC
.(....)((..))
> sequence 2
CGGCCAAUCCGGCUU
(((.....)))....
By default, the multiple alignment is displayed in Clustal format, and
the scoring scheme comes from the 'score.txt' file. See Section 4.3 to
have more information on the syntax of this file.
More information on the multiple alignment algorithm is available in
Section 4.1. Gardenia also offers a series of pairwise alignment
algorithms. By default, pairwise alignment uses affine gap penalties.
Alternative algorithms are described in Section 4.2.
3.2 Options
-----------
General options: to be used both for multiple or pairwise alignment.
-S specify the score scheme in an external file. If no file name
is given, the file 'score.txt' in used. See Section 4.3 to
have more information on the syntax of this file.
-c add constraints coming from the primary structure in the alignment
--fasta output in fasta format
-o specify the output file in which to save the alignment
-M use RIBOSUM scoring matrices.
-s similar structures. This option should be used when the input sequences
are similar. It speeds up the alignment algorithm.
--eq This option allows you to give external constraints that will guide the
alignment process. For that, each sequence in the input file shoud contain
a line with a one character identifier for each position of the sequence,
or a blank character when there is no constraint.
> sequence 1
AUCGGCGGACGCC
.(....)((..))
# # RR
Constraints correspond approximately to equivalence classes. A
position with constraint # can be matched only with a
position of constraint # or with a position with no constraint.
All these options are mutually compatible.
Additional options: for pairwise alignment only.
-e select the edit distance method for comparing the structures
(see algorithm #2 below)
-f select the edit distance method with full edit operations
for comparing the structures (see algorithm #3 below)
-g select the tree alignment method for comparing the structures
(see algorithm #4 below)
-h select the tree alignment method with full edit operation
for comparing the structures (see algorithm #5 below)
-l local comparison
-m multiple alignment (no affine gap penalty)
-n compute only the value of the optimal score, not the mapping.
These options are valid provided that your input file contains exactly
two sequences. They are all compatible with general options. Options
-e, -h, -g, -h and -m are mutually incompatible. Option -l is compatible
with options -e, -f, -g, -h and -m, but not yet fully implemented.
4 FURTHER EXPLANATION ON THE COMPARISON METHODS AND THE SCORING SYSTEM
4.1 Multiple alignment
----------------------
The method relies on the dynamic programming algorithm for pairwise
comparison, as described in Algorithm #1 below. RNA sequences are
encoded as arc-annotated sequences, and a multiple alignment for a set
of arc-annotated sequences is a nested common supersequence. The edit
scheme incorporates evolutionary operations concerning free bases
(base substitution, base deletion) and base pairs (arc-mismatch,
arc-removing, arc-breaking, arc-altering), originally defined in [6]. We take a heuristic approach and use a progressive
method. It starts with constructing all pairwise alignments to
determine the degree of similarity for each pair of sequences. Then it
combines sequences into a multiple alignment by an ascending
hierarchical clustering. Pairwise alignment of supersequences rely on
the same algorithm as paiwise alignments for arc-annotated sequences.
This is made possible because supersequences can be viewed as a nested
arc-annotated sequences on an extended alphabet. The score of one node
is its SP (sum-of-pairs) score.
4.2 Pairwise alignment
----------------------
Gardenia provides seven different algorithms for comparing two
RNA structures given as input. We briefly present the distinctive
features of each of them.
Algorithm #0 - Alignment of arc-annotated sequences with affine
gap penalty.
This algorithm searches for the optimal nested common superstructure
of the two structures given as input, such as described in [1]. This
is our favorite method. It can explicitly deal with all edit
operations: base-match, base-mismatch, base-deletion, arc-match,
arc-mismatch, arc-removing, arc-breaking and arc-altering, and is
still very fast. Affine gap costs are also allowed (which is not
described in [1]). Concerning edit operation on free bases, gap
penalties apply in he same way as with sequence alignment. Concerning
edit operation on base pairings, gap opening penalty applies when the
general shape of the structure is modified by the edit operation with
the creation of a new helix. See below.
Gap opening: creation of a new helix
(((...))) (((...)))
--------- .........
Gap extension: modification of an existing helix
(((...))) (((...))) (((...))) (((...)))
-((...))- .((...)). ((-...-)) ((.....))
Algorithm #0.1 - Local alignment of arc-annotated sequences(option -l).
This is a variation of algorithm #0 that performs local comparison
instead of global comparison. The best local alignment is the best
alignment between any two closed subsequences of the input sequences.
Option -l is also compatible with Algorithms #1, #2, #3, #4 and #5.
Algorithm #1 - Alignment of arc-annotated sequences with linear
gap penalty (option -m)
This is the same algorithm as algorithm #0, but without affine gap
cost (so this is exactly algorithm described in [1]). This algorithm
is also used for multiple alignment construction.
Algorithm #2 - Usual tree edit distance (option -e)
Here, RNA structures are represented as ordered trees: each base pair
is an internal node, and each free base is a leaf. This model is
described in further detail in [2]. It authorizes only base-match,
base-mismatch, base-deletion, arc-match, arc-mismatch and arc-removing
operations. The implementation of the edit distance algorithm is based
on Zhang-Shasha's algorithm [3] using the edit graph structure introduced
in [4].
Algorithm #3 - Tree edit distance with enriched tree encoding (option -f)
The difference with algorithm #2 lies in the construction of the tree
for the RNA sequence. Each base pair is encoded by three nodes:
an inner node for the arc and two leaves for the nucleotides (the
same encoding is used in [4]). This model allows one to simulate
arc-breaking operations (it consists in deleting the inner node), and
arc-altering operations (it consists in deleting it and one of its
children). With edit edistance, the method may be slow, and gives
permissive alignments.
Algorithm #4 - Tree alignment (option -g)
This algorithm is based on the tree alignment, such as described in [6].
Algorithm #5 - Tree alignment with enriched tree encoding (option -h)
This last method combines the tree alignment algorithm (#4) with the tree
encoding of algorithm #3. This solution is very similar to RNAforester [5].
[1] Alignment of RNA secondary structures. G. Blin, A. Denise, S. Dulucq, C. Herrbach,
H. Touzet, IEEE/ACM Transactions on Computational Biology and Bioinformatics (2008)
[2] Computing similarity between RNA structures. B. Ma, L. Wang,and K. Zhang,
Theoretical Computer Sciences 276(1-2): 111-132 (2002)
[3] Simple fast algorithms for the editing distance between trees and related
problems. K. Zhang and D. Shasha,SIAM Journal of Computing, 18(6)} :1245-1262 (1989)
[4] Comparing similar ordered trees in linear-time. H. Touzet,
Journal of Discrete Algorithms, to appear, available on WWW
[5] Local Similarity in RNA Secondary Structures. M. Höchsmann, T. Töller,
R. Giegerich, S. Kurtz, IEEE Bioinformatics Conference 2003 (CSB 2003):
159-168
[6] Alignment of trees - an alternative to tree edit. T. Jiang, L. Wang, K.Zhang
Theoretical Computer Science, 143-1: 137-148 (1995)
4.3 Scoring system
-------------------
For all methods, you can specify your own score scheme in an external file
with -S option. This file should contain one line per edit operation, in the
following order.
base-match
base-mismatch
base-deletion
arc-match
arc-mismatch(1) - one base renaming
arc-mismatch(2) - two base renamings
arc-removing
arc-altering(1) - no base renaming
arc-altering(2) - one base renaming
arc-breaking(1) - no base renaming
arc-breaking(2) - one base renaming
arc-breaking(3) - two base renamings
gap-opening
Look at the 'score.txt' file in the 'src' directory to see an example.
Only algorithms #0 and #0.1 use all edit operations with arbitratry
score penalties. Multiple alignment and algorithm #1 do not require
any value for gap opening, and algorithms #2,#3,#4 and #5 do not
require values for arc-altering and arc-breaking operations. Indeed,
algorithms #2 and #4 do not involve such operations. For algorithms #3
and #5, scores are mutually dependent. They fulfill the following
relations:
arc-removing = arc-breaking(1) + 2 base-deletion
arc-altering = arc-breaking(1) + 1 base-deletion+ 1 base-match/mismatch
arc-breaking(2)= arc-breaking(1) + 1 base-mismatch
arc-breaking(3)= arc-breaking(1) + 2 base-mismatch
So be careful when using it. For pairwise alignment, if you find it
complicated, use algorithm #0 for global comparison or #0.1 for local
comparison.
-------------------------------------------------------------------------------
Thank you for using Gardenia.
For questions or for bug reports, please contact Helene Touzet
(helene.touzet@univ-lille1.fr)