Heuristics for multiobjective multiple sequence alignment

Background Aligning multiple sequences arises in many tasks in Bioinformatics. However, the alignments produced by the current software packages are highly dependent on the parameters setting, such as the relative importance of opening gaps with respect to the increase of similarity. Choosing only one parameter setting may provide an undesirable bias in further steps of the analysis and give too simplistic interpretations. In this work, we reformulate multiple sequence alignment from a multiobjective point of view. The goal is to generate several sequence alignments that represent a trade-off between maximizing the substitution score and minimizing the number of indels/gaps in the sum-of-pairs score function. This trade-off gives to the practitioner further information about the similarity of the sequences, from which she could analyse and choose the most plausible alignment. Methods We introduce several heuristic approaches, based on local search procedures, that compute a set of sequence alignments, which are representative of the trade-off between the two objectives (substitution score and indels). Several algorithm design options are discussed and analysed, with particular emphasis on the influence of the starting alignment and neighborhood search definitions on the overall performance. A perturbation technique is proposed to improve the local search, which provides a wide range of high-quality alignments. Results and conclusions The proposed approach is tested experimentally on a wide range of instances. We performed several experiments with sequences obtained from the benchmark database BAliBASE 3.0. To evaluate the quality of the results, we calculate the hypervolume indicator of the set of score vectors returned by the algorithms. The results obtained allow us to identify reasonably good choices of parameters for our approach. Further, we compared our method in terms of correctly aligned pairs ratio and columns correctly aligned ratio with respect to reference alignments. Experimental results show that our approaches can obtain better results than TCoffee and Clustal Omega in terms of the first ratio.


Background
Multiple sequence alignment (MSA) is of central importance to bioinformatics. This technique is useful to compare new sequences with other genomic sequences, unveiling their shared information and their significant differences. MSA methods are mainly essential to analyse biological sequences and to design applications in structure modeling, functional prediction, phylogenetic analysis and sequence database searching [1]. Currently, these approaches are also used in applications, for example, to compare protein structures, to predict protein mutations and interactions or to reconstruct phylogenetic trees [2,3]. Moreover, these tools have found their place in medicine as well, mainly in the context of genetic screening and genetic engineering [4].
Most of the known alignment approaches have diverse optimization functions, along with assorted heuristics to search for the optimum alignment. These techniques consider a weighted sum formulation that maximizes the substitution score and penalizes indels/gaps. This is the usual procedure even in the case of pairwise sequence alignment [5]. However, the way of setting up the weights is very often not trivial. Moreover, analyzing only one alignment may lead to too simplistic interpretations of the data [6].
A multiobjective formulation of sequence alignment provides the practitioner a set of alignments that represents the trade-off between decreasing the number of gaps and increasing similarity. In bioinformatics, this formulation and algorithms can be found already for pairwise sequence (DNA/Protein) alignment [7][8][9][10]. Abbasi et al. [7] present dynamic programming algorithms to compute the optimal set of alignments by treating the number of indels/gaps and the scores for (mis)matches/substitution as separate objectives. They also apply this method to analyze the construction of phylogenetic trees. Taneda [11] describes a heuristic approach for pairwise RNA sequence alignment that incorporates RNA structure information to approximate a set of optimal alignments. Schnattinger et al. [12] extend the work of Taneda by computing the optimal set. They treat the sequence alignment and the consensus structure calculation as separate objectives and solve both problems simultaneously with a dynamic programming approach. An extensive review about other problems in bioinformatics that are formulated as multiobjective optimization problems is explained in Handl et al. [13].
Few work has been done on multiobjective MSA (MMSA), which is much harder to be solved from a computational point of view. Only very recently, MSA has been treated as multiobjective optimization problem. Ortuño et al. [14] used a multiobjective evolutionary algorithm based on the NSGA-II to optimize sum-of-pairs score, total columns and number of gaps. In another article [15], the authors extended the previous work by applying further biological features and considering different objectives such as strike score, non-gaps percentage and totally conserved columns. Likewise, Soto et al. [16] applied a multiobjective evolutionary algorithm to optimize pre-aligned sequences by considering entropy and the metric Metal.
In this article, we propose to approach MMSA with heuristic methods, extending the formulation given in Abbasi et al. [7] for an arbitrary number of sequences. The approach considers the sum-of-pairs score vector by maximizing the substitution score, based on a substitution matrix, and minimizing the number of indels/gaps. This article is an extended version of a conference article [17] providing a more thorough experimental analysis of the algorithms proposed.

Methods
We first give a definition of multiple sequence alignment, followed by its multiobjective counterpart. Then, a local search strategy is proposed.

Multiple sequence alignment
In MSA, homogeneous characters of a group of sequences are aligned together in columns. The following definition introduces the problem of MSA in a more mathematical form [18].

Definition 1
Multiple sequence alignment. Let A 1 = (a 1,1 , . . . , a 1,n 1 ), . . . , A m = (a m,1 , . . . , a m,n m ) be m strings over an alphabet . Let ′ − ′ � ∈ be an indel symbol, let ′ = ∪ {−} and let denote the empty string. Let h : (� ′ ) * � → � * be a homomorphism defined by h(a) = a for all a ∈ , and h(−) = . A multiple sequence alignment φ of ( over the alphabet ′ , such that the following conditions are satisfied: A way of evaluating the quality of an alignment φ is to score its columns by the sum-ofpairs score function. The score of a column j = 1, . . . , ℓ is defined as: where the score sc(b i,j , b k,j ), for b i,j , b k,j ∈ , comes from a substitution matrix used for scoring pairwise sequence alignments (such as PAM and BLOSUM). Indels are scored by defining sc( ′ − ′ , ·) = sc(·, ′ − ′ ) = W d , where W d is the weight of an indel, and sc( ′ − ′ , ′ − ′ ) = 0. The score for alignment φ is computed as

Multiobjective multiple sequence alignment
Many problems of practical relevance are frequently characterized by different objectives that simultaneously have to be taken into account when it comes to solve the problem. In most cases, these objectives conflict with each other and optimizing the problem for a specific objective might compromise other objectives. An appropriate approach to a multiobjective problem is to obtain a set of solutions in such a way that each of which cannot be improved in one objective without deteriorating at least one of the others [19].
Traditionally, approaches for solving sequence alignment are performed with a single objective function, which is based on a weighted sum of matches, mismatches, insertions and deletions. However, there is no agreement on how to specify weights for these parameters. In a multiobjective formulation, the practitioner does not have to define weights and can get access to much more information [10].
Let φ be an alignment of m sequences (A 1 , . . . , A m ), as defined in the previous section. We define the following two score functions where the score s(b i,j , b k,j ), for b i,j , b k,j ∈ is obtained from a substitution matrix and , and 0, otherwise. The multiobjective score sum-of-pairs for alignment φ is Given two alignments φ and φ ′ , , with at least one strict inequality. An alignment φ * is Pareto optimal if there exists no other alignment φ such that BSP(φ) ≥ BSP(φ * ). The set of all Pareto optimal alignments is called Pareto optimal alignment set. The image of a Pareto optimal alignment in the score space is a non-dominated score and the set of all non-dominated scores is called non-dominated score set.
Although pairwise sequence alignment can be solved efficiently, that is, the running time to find the non-dominated score set is a polynomial function of the size of the sequences, this is no longer the case for the multiple counterpart for an arbitrary number of sequences. Thus, the goal, in practice, is to find an approximation to the non-dominated score set in a reasonable amount of time. In this article we explore the application of heuristic methods, in particular, local search algorithms.

Pareto local search
A local search algorithm starts from a feasible solution and searches locally for better neighbors to replace the current one. This neighborhood search is repeated until no improvement is found anymore and the algorithm stops in a local optimum. In the context of MMSA, the neighborhood function associates a set of feasible alignments N (φ) to every feasible alignment φ. An alignment φ is a Pareto local optimum if there exists no Pareto local search (PLS) [20] is a generic local search framework for multiobjective optimization problems that follows the principle above by keeping the best alignments into a special data structure, known as the archive. Each neighboring alignment that is non-dominated with respect to the alignments in the archive, is added to it. The algorithm "naturally" terminates once there is no neighbor of any alignment in the archive that is not dominated (a local optimal set). See [20,21] for more details about this approach for multiobjective combinatorial optimization problems and its successful application to scheduling and graph problems.
In the following we describe some options that were considered to apply PLS to MMSA: neighborhood function and starting alignment.

Neighborhood
We consider a k-block neighborhood for this problem, which consists of exchanging a substring of at most k characters in with indels in a gap, that is, a contiguous sequence of indels. In the following, we establish the conditions for two alignments to be k-block neighbors. Let A 1 = (a 1,1 , . . . , a 1,n 1 ), ..., A m = (a m,1 , . . . , a m,n m ) denote m strings and let φ B and φ C be two alignments, where φ B = B 1 , . . . , B m and φ C = C 1 , . . . , C m . The alignments φ B and φ C are k-block neighbors if and only if the following conditions hold: Conditions (i) and (ii) state that both alignments should have the same size, and differ only in the jth string, respectively. In addition, condition (iii) and (iv) state that at most k characters from string A j do not occupy the same position in both alignments, and that those characters are contiguous. For illustration purpose, consider the following alignment: We list all possible 2-block neighbors as follows: It is possible to visit all k-block neighbors of an alignment φ in a straightforward way. Given the first string of an alignment φ, consider the leftmost substring of size one that contains only a character and an indel to its right. Then, exchange it with every indel in its right and stop when no indel is found. In case the substring has also indels to its left, repeat the same exchange procedure. If it is still possible, increase the size of the substring by one and repeat the same moves to its right and to its left; repeat the overall procedure until reaching a substring of size k. Then, consider the next substring of size one to the right and repeat the same procedure as above. Note that each move generates a new k-block neighbor of alignment φ. In order to maintain feasibility, the columns that contain only indels are deleted from the alignment.

Starting alignment
The starting solution may have a strong impact on the overall performance of the local search. We considered the three following possibilities for PLS: (i) Rand: A random feasible alignment, which is obtained by inserting indels randomly into the strings, except in the largest one. This initialization option tends to generate a feasible alignment with a small number of indels. (ii) Clust: A feasible alignment obtained from program Clustal Omega [22] (available at http://www.clustal.org/omega). (iii) Tcoffee: A feasible alignment obtained from program Tcoffee [23] (available at http://www.tcoffee.org/Projects/tcoffee/).
We choose Tcoffee since it is one of the best consistency based methods that outperforms the others existing programs in terms of accuracy and Clustal Omega since it outperforms Tcoffee whenever sequences with large N/C terminal extensions exist [24]. We used the default parameters suggested for these two programs.

Pareto iterated local search
The local search framework described in the previous section may get trapped in a local optima set of poor quality. One possibility for escaping from local optima is to perturb one or more alignments in the archive and restart the local search from those alignments. Pareto iterated local search (PILS) implements this search behavior and its main steps are presented in Algorithm 1. Similar to the single objective counterpart (see [25]), four components have to be specified in PILS: GenerateInitialSolution, which generates an initial set of alignments

Discussion and results
We performed several experiments in order to understand the effect of different parameters of PLS and PILS and to establish functional relationships between algorithm performance and instance features, such as the number of sequences and their sizes. Furthermore, we compare the quality of the alignments produced by our approaches with those produced by TCoffee and Clustal Omega. The implementations were coded in C and compiled with GCC version 4.6.1 with the −O3 compiler option. Experiments were performed in a cluster with 16 nodes, each one comprising an Intel Core i7 CPU with 4-core and 2 GB Ram, with operating system Ubuntu 11.10. Except the compiler option, no other code optimization technique was used in the experiments. For this experimental analysis, we have chosen sequences obtained from the benchmark database BAliBASE 3.0 [26]. In our substitution matrix, each cell value is 1 when the both residues are equal, and −1 otherwise. BAliBASE 3.0 is a well-known benchmark for evaluating the multiple sequence alignment algorithms. The data set Reference 1 consists of equidistant family sequences with two subgroups: RV11 and RV12. RV11 contains 38 data sets with less than 20 % residue identity between groups and RV12 contains 44 data sets with residue identity between 20 and 40 %. Reference 2 (RV20) contains 41 alignments comprising family sequences with more than 40 % similarity and a highly divergent orphan sequence. Reference 3, with 30 data sets, contains subfamilies such that the sequences within a given subfamily share more than 40 % identity, but any two sequences from different subfamilies share less than 20 % identity. The reference set RV40, with 49 data sets, contains sequences that are composed of groups with N/C-terminal extensions. The reference set RV50 contains 16 data sets with large internal insertions. We tested our approaches on all the instances from the sets RV11 and RV20, composed by sequences with different sizes and varying percentage identities.
Since the heuristics proposed in this paper are stochastic, we ran each variant 10 times for each instance and recorded the contents of the archive for each run, namely the approximate set, once the termination criterion was met. To evaluate the quality of each approximate set, we computed its hypervolume indicator value. This indicator measures the area that is dominated by the approximate set, bounded by a reference point [27]; see an in-depth discussion about the hypervolume indicator in relation with other performance assessment methods in Zitzler et al. [28]. Moreover, it is known that the hypervolume is maximized when the optimal set is found [28]. Figure 1 illustrates the hypervolume indicator (shaded area) for a given approximate set (black points) and a given reference point (white point). We have chosen, as the reference point for each instance, the minimum substitution score minus one and the maximum number of indels plus one that were obtained from the runs of all variants. We merged all approximate sets produced from all runs for a given instance, extracted the non-dominated scores and computed the hypervolume indicator value of the resulting set, namely, the reference hypervolume indicator value, which is then used as a reference value to evaluate the relative performance of each approach. Then, for each approximate Indels Substitution score set, we computed its hypervolume indicator value and the percentage of this value with respect to the reference hypervolume indicator value; the larger the value, the better is the approximate set in terms of this indicator.
We analysed the effect of different k-values in the k-block neighborhood, as well as the effect of the three starting alignments on the performance of PLS. For a given instance, let Min denote the length of the smallest string; we consider k ∈ {Min, Min/2, Min/4, Min/8, Min/16}. PLS terminates once it is not possible to find non-dominated neighboring alignments or the time limit of 5 minutes of CPU-time is reached. Tables 1 and 2 report the results obtained for PLS for the benchmark sets RV11 and RV20. The results were averaged over 10 runs, for each of the five k values and the three starting alignments. Column id corresponds to the instance id from the two benchmark sets (BB1100*.tfa and BB200*.tfa, where * denotes the id); column m gives the number of sequences; columns Min and Max correspond to the length of the smallest and the largest sequence, respectively. The instances are lexicographically ordered in terms of the number of sequences and the length of the smallest sequence. The values in italics correspond to the best result obtained for each instance. The last column Init shows the hypervolume indicator obtained for the non-dominated score of the two starting alignments obtained with Clustal Omega and Tcoffee.
The results clearly indicate the best performance of PLS when starting with alignments Clust and Tcoffee, instead of a feasible random alignment. Also, the best k-block value strongly depends on the number of sequences and, to a smaller extent, to their sizes. As a rule, for the given cut-off time, large (small) k values achieve better performance on problems with a smaller (larger) number of sequences. Moreover, column Init indicates that PLS strongly improves upon the two starting alignments. Tables 3 and 4 report results obtained for PILS on the same benchmark sets, averaged over 10 runs, for k-block size equals to 2; we recall that we use Clust and Tcoffee as starting alignments. The results reported in the tables do not include the CPUtime taken to compute the starting alignments. PILS always terminates once the time limit of 5 minutes of CPU-time is reached. In this case, the limit time for each PLS run within PILS is 5/(P + 1) minutes. Column PLS max gives the best value of PLS from Tables 1 and 2. The best results for each sequence set are shown in italics. In most of the instances, PILS improves over PLS, although the best performance may depend on the number of sequences and their size: a small (large) number of perturbations gives better performance on larger (smaller) sequences and larger (smaller) number of sequences.
We complemented the analysis to gain insight into the absolute quality of the alignments produced by PILS, when compared to the reference alignments available in the BAliBASE benchmark. We rely on the correctly aligned pairs (SP) and columns correctly aligned (TC) measures, as performed in [15]. These ratios can be computed by the program BAliScore which is available by ftp from ftp-igbmc.u-strasbg.fr/pub/BAli-BASE [29]. We used seven different data sets from BAliBASE with specific features chosen from this benchmark; see Table 5. Columns Reference and id correspond to the reference and id number of the data set in the BAliBASE benchmark; columns m, We ran PILS 10 times on these data sets, and for each collection of runs, we chose the best alignment produced in terms of BSP s score. We computed the SP and TC ratios by using this alignment and the reference alignment available for each chosen data set in BAliBASE. For calculation of the SP ratio, we used the substitution matrix PAM 250. Moreover, for comparison purpose, we performed the same procedure for the alignments produced by Clustal Omega and Tcoffee.   Tables 6 and 7 show the results obtained with SP and TC ratios, respectively. The italics face values represent the best ratio. From Table 6, it can be observed that PILS obtained the best value for all the tested data sets. In Table 7, the best TC ratio is either from Tcoffee or Clustal Omega. Note that a null TC value was obtained by the alignment of Clustal Omega in RV12 id 20 and RV20 id 20, and by TCoffee in set from RV20 and id 1.