miRkwood small RNA-seq

How to prepare my input files ?

The input of miRkwood is a set of reads produced by deep sequencing of small RNAs and mapped to a reference genome. Typically, length of the reads should range between 15nt and 35nt. The user is required to upload a BED file that contains all positions of mapped sequence tags. This file can be obtained from the raw sequencing data by taking three easy steps on your computer. If you are new to miRkwood, you might also want to test it with the sample BED file provided below.

Remove adapter sequences

This can be performed with cutadapt, for example.

cutadapt -a AACCGGTT -o output.fastq input.fastq

Run quality control

The aim of this step is to filter too short or too long sequences and to remove or to trim the low quality sequences. This can be achieved using prinseq, with this command line as example.

 prinseq-lite.pl -fastq <short_reads_file.fastq> -min_len 18 -max_len 25 -noniupac 
-min_qual_mean 25 -trim_qual_right 20 -ns_max_n 0

Are conserved only the sequences between 18 and 25 nt with a mean quality of at least 25 (phred score) and composed of nucleotides ACGT only. The sequences are trimmed by quality score from the 3'-end with a value of 20 as threshold.

Map the trimmed reads on the reference genome

The goal of this step is to generate a BAM file that contains the alignments of the expressed reads with the reference genome.

We recommend to perform exact matching. For that, you can use Bowtie with the following parameters. Any other read mapper can also do the job.

bowtie -v 0 -f/q --all --best --strata -S <genome> <reads> > output.sam  

Reads file must be in FASTA, FASTQ, or colorspace-fasta format. Genome file must be in FASTA format.

The list of assemblies accepted by miRkwood is given in Section "Select an assembly" on the help page.

Convert the BAM file into a BED file

For this step, you should use our custom script mirkwood-bam2bed.pl (download the script). mirkwood-bam2bed.pl is a perl script dependent upon the installation of SAMtools. In practice, the BED file is up to 10 times smaller than the BAM file and up to XXX times smaller than the set of raw reads, while retaining all information needed to conduct the analysis. This allows to reduce significantly the bandwidth necessary to upload the data to miRkwood server.

mirkwood-bam2bed.pl --in /input/file --bed /output/file/ --min X --max Y

--in : path to your input file (format BAM or SAM)
--bed : path to your output BED file
--min : keep only reads with length ≥ min (default 18)
--max : keep only reads with length ≤ max (default 25)

The generated BED file has the following syntax.

1    18092    18112    SRR051927.5475072    1    -
1    18094    18118    SRR051927.2544175    2    +
1    18096    18119    SRR051927.3033336    1    +
1    18100    18124    SRR051927.172198     9    +

In this file, each line is a unique read. The fields are, from left to right: name of the chromosome, starting position, ending position, read identifier, number of occurrences of the read in the data, strand. Positions follow the BED numbering convention: the first base of the chromosome is considered position 0 (0-based position) and the feature does not include the stop position.

You are now ready to use miRkwood small RNA-seq on your data.

Run mirkwood