DRAFT

[for original scripts, see this webpage]

This webpage provides **only extra experimental datasets and scripts** related to the section 4.2 Coverage sensitivity and alignment-free distance for sequence comparison.

For the ** main datasets and scripts used in the paper, please see the original index.html before**.

These experiments are not available (not even mentioned) in the paper, but, as they are more complete than the ones presented in section 4.2, we share them :

We provide additional scripts to perform the correlation measure, **without using a sampling procedure** :

- The first method is based on the
**full enumeration**of the alignment sequences of size 32.

It means that 2^{32}alignments are generated for each seed; each alignment*i*is evaluated on one of the two criteria, namely its*coverage*or the*multihit*value*y*, and its percentage of identity_{i}*p*is measured too. The Pearson correlation coefficient is computed using this cumulative (and single pass) formula:_{i}*n*Σ*p*- Σ_{i}y_{i}*p*Σ_{i}*y*_{i}

√*n*Σ*p*- (Σ^{2}_{i}*p*)_{i}^{2}√*n*Σ*y*- (Σ^{2}_{i}*y*)_{i}^{2}*multihit*and*coverage*criteria, even using optimized SIMD SSE2 code (otherwise it is more than 4 minutes long). This code has now only an interest to debug the following one : - The second method is a full
**dynamic programming**algorithm based on language recognized by the*coverage automaton*(or the*multihit automaton*) of the seed :

- it is possible, by using a
*counting semi-ring*(and not a*probabilistic semi-ring*) to know**how many**alignments have a coverage (or multihit) value of*y*(for any*y*from 0 to the length*l*of the alignment considered), - it is then possible, by intersecting the previous automaton with an automaton that counts the number
*x*of matches (two states), to know**how many**alignments have**at the same time**:- a coverage (or multihit) value of
*y*, for any*y*in [0...*l*] - a percentage of identity of
*p=x/l*, for any*x*in [0...*l*]

- a coverage (or multihit) value of

*x,y*〉. In practice, between 5 and 20 seeds can be computed this way per second; Note that coverage is about twice slower than multihit to compute. - it is possible, by using a

**All** the seeds of weight *w* from 2 to **8**, span *s* from *w* to *w+4*, single or pair of seeds, have been considered; Correlation has been computed for each seed, on both multihit/coverage criteria with the percentage of identify.

Plots of the *multihit* (x-axis) vs *coverage* (y-axis) correlation coefficient for each seed is provided below.

Varying the *minimal percentage of identity* required for
an alignment animates the plots; The alignment length is here
fixed to 32.

**All** the seeds of weight *w* from 2 to **9**, span *s* from *w* to *w+4*, single or pair of seeds, have been considered; Correlation has been computed for each seed, on both multihit/coverage criteria with the percentage of identify.

Plots of the *multihit* (x-axis) vs *coverage*
(y-axis) correlation coefficient for each seed is provided
below.

Varying the *minimal percentage of identity* required for an alignment animates the plot; The alignment length is here fixed to 32.

Varying the *alignment length* animates the plot; Colors are given for some minimal percentage of indentity.