Wednesday, October 29, 2014
Uncertainty in multiple sequence alignments
It is well known that reticulations in phylogenetic networks can reflect variation in data sets from many sources, not only gene flow during evolutionary history. These other sources are presumably unwanted in the analysis when they are due to estimation errors. Such errors include incorrect data, inappropriate sampling, and model mis-specification.
For molecular data, one of the more obvious sources of model mis-specification is an incorrect multiple sequence alignment. This reflects wrong assessments of primary homology among the characters, so that the wrong residues are aligned in the columns. This particular issue seems not to have been addressed in the network literature in any systematic way.
However, it is obviously rather important. After all, who needs a phylogenetic network that reflects mis-alignment rather than evolutionary history? One approach to this issue would be to have some sort of measurement of our confidence in the alignment columns, which could be taken into account when the network is constructed.
One practical problem with this approach is that there has been a veritable cottage industry developing such measurements, which would need to be assessed for their suitability. So, I thought that I might list some of them here, along with a brief description of what they measure. The list is comprehensive but not necessarily exhaustive — it consists of ones for which there was at some stage a computer program (there are others that have never been named). Most of the methods are designed specifically for amino-acid sequences, so that not all of them can be used for nucleotides.
There are basically two types of measurement: (1) quantitative scoring schemes, which provide a reliability score for each aligned position, and (2) selection schemes, which select a subset of the aligned positions as being reliably aligned. So, I have divided the list roughly into these two groups.
Dopazo J (1997) A new index to find regions showing an unexpected variability or conservation in sequence alignments. Computer Applications in the Biosciences 13: 313-317.
— evolutionary index is based on conservativeness of amino acid differences as predicted from nucleotide differences
Thompson JD, Gibson TJ, Plewniak F, Jeanmougin F, Higgins DG (1997) The CLUSTAL-X windows interface: flexible strategies for multiple sequence alignment aided by quality analysis tools. Nucleic Acids Research 25: 4876-4882.
— quality is based on conservativeness of amino acid differences
Notredame C, Holm L, Higgins DG (1998) COFFEE: an objective function for multiple sequence alignments. Bioinformatics 14: 407-422.
— score represents consistency among global and local alignments
Pei J, Grishin NV (2001) AL2CO: calculation of positional conservation in a protein sequence alignment. Bioinformatics 17: 700-712.
— conservation is based on weighted entropy
Redelings BD, Suchard MA (2005) Joint Bayesian estimation of alignment and phylogeny. Systematic Biology 54: 401-418.
— approximate probability that the letter is homologous to the ancestral residue in its column
Lassmann T, Sonnhammer EL (2005) Automatic assessment of alignment quality. Nucleic Acids Research 33: 7120-7128.
— consistency based on overlap of alignments from several programs
Landan G, Graur D (2007) Heads or tails: a simple reliability check for multiple sequence alignments. Molecular Biology and Evolution 24: 1380-1383.
— measures uncertainty due to co-optimal alignments
Bradley RK, Roberts A, Smoot M, Juvekar S, Do J, Dewey C, Holmes I, Pachter L (2009) Fast Statistical Alignment. PLoS Computational Biology 5: e1000392.
— several scores based on HMM consistency, certainty, expected accuracy, expected sensitivity, expected specificity
Penn O, Privman E, Landan G, Graur D, Pupko T (2010) An alignment confidence score capturing robustness to guide tree uncertainty. Molecular Biology and Evolution 27: 1759-1767.
— robustness to guide tree uncertainty
Kim J, Ma J (2011) PSAR: measuring multiple sequence alignment reliability by probabilistic sampling. Nucleic Acids Research 39: 6359-6368.
— agreement with probabilistic sampling of suboptimal alignments
Wu M, Chatterji S, Eisen JA (2012) Accounting for alignment uncertainty in phylogenomics. PLoS One 7: e30288.
— pair Hidden Markov Model to model the sequence evolution and uses the model to calculate the posterior probabilities that residues of a column are correctly aligned
Chang J-M, Di Tommaso P, Notredame C (2014) TCS: a new multiple sequence alignment reliability measure to estimate alignment accuracy and improve phylogenetic tree reconstruction. Molecular Biology and Evolution 31: 1625-1637.
— transitive consistency score is an extended version of the Coffee scoring scheme
Martin MJ, Gonzâlez-Candelas F, Sobrino F, Dopazo J (1995) A method for determining the position and size of optimal sequence regions for phylogenetic analysis. Journal of Molecular Evolution 41: 1128-1138.
— locates the smallest blocks with similar pairwise genetic distances to the whole alignment
Castresana J (2000) Selection of conserved blocks from multiple alignments for their use in phylogenetic analysis. Molecular Biology and Evolution 17: 540-552.
— selected blocks are based on conservation of identity
Löytynoja A, Milinkovitch MC (2001) SOAP, cleaning multiple alignments from unstable blocks. Bioinformatics 17: 573-574.
— stability is measured with respect to variation in the Clustal gap-opening and gap-extension penalties
Thompson JD, Plewniak F, Ripp R, Thierry J-C, Poch O (2001) Towards a reliable objective function for multiple sequence alignments. Journal of Molecular Biology 314: 937-951.
— normalized mean distance is based on pairwise distances
Cline M, Hughey R, Karplus K (2002) Predicting reliable regions in protein sequence alignments. Bioinformatics 18: 306-314.
— uses information from near-optimal alignments
Lawrence CJ, Zmasek CM, Dawe RK, Malmberg RL (2004) LumberJack: a heuristic tool for sequence alignment exploration and phylogenetic inference. Bioinformatics 20: 1977–1979.
— identifies blocks that have their phylogenetic tree being most similar to that of the whole alignment
Dress AW, Flamm C, Fritzsch G, Grünewald S, Kruspe M, Prohaska SJ, Stadler PF. (2008) Noisy: identification of problematic columns in multiple sequence alignments. Algorithms in Molecular Biology 3: 7.
— identification of phylogenetically uninformative homoplastic sites from compatibilities in a circular split system
Capella-Gutiérrez S, Silla-Martínez JM, Gabaldón T (2009) trimAl: a tool for automated alignment trimming in large-scale phylogenetic analyses. Bioinformatics 25: 1972-1973.
— proportion of sequences with a gap, level of amino acid similarity, level of consistency across different (user-provided) alignments
Blouin C, Perry S, Lavell A, Susko E, Roger AJ. (2009) Reproducing the manual annotation of multiple sequence alignments using a SVM classifier. Bioinformatics 25: 3093-3098.
— support vector machine reproduces manual annotations from other alignments
Criscuolo A, Gribaldo S (2010) BMGE (Block Mapping and Gathering with Entropy): a new software for selection of phylogenetic informative regions from multiple sequence alignments. BMC Evolutionary Biology 10: 210.
— calculates entropy-like scores weighted by similarity matrices
Kück P, Meusemann K, Dambach J, Thormann B, von Reumont BM, Wägele JW, Misof B (2010) Parametric and non-parametric masking of randomness in sequence alignments can be improved and leads to better resolved trees. Frontiers in Zoology 7: 10.
— consensus profiles identify dominating patterns of nonrandom similarity
Rajan V (2013) A method of alignment masking for refining the phylogenetic signal of multiple sequence alignments. Molecular Biology and Evolution 30: 689-712.
— compatible subsplits define clusters of sites which are then removed based on evolutionary rate