Why do phylogenomic data sets yield conflicting trees? Data type influences the avian tree of life more than taxon sampling

Sushma Reddy; Rebecca T. Kimball; Akanksha Pandey; Peter A. Hosner; Michael J. Braun; Shannon J. Hackett; Kin Lan Han; John Harshman; Christopher J. Huddleston; Sarah Kingston; Ben D. Marks; Kathleen J. Miglia; William S. Moore; Frederick H. Sheldon; Christopher C. Witt; Tamaki Yuri; Edward L. Braun

doi:10.1093/sysbio/syx041

Why do phylogenomic data sets yield conflicting trees? Data type influences the avian tree of life more than taxon sampling

Sushma Reddy, Rebecca T. Kimball, Akanksha Pandey, Peter A. Hosner, Michael J. Braun, Shannon J. Hackett, Kin Lan Han, John Harshman, Christopher J. Huddleston, Sarah Kingston, Ben D. Marks, Kathleen J. Miglia, William S. Moore, Frederick H. Sheldon, Christopher C. Witt, Tamaki Yuri, Edward L. Braun

Research output: Contribution to journal › Article › peer-review

187 Scopus citations

Abstract

Phylogenomics, the use of large-scale data matrices in phylogenetic analyses, has been viewed as the ultimate solution to the problem of resolving difficult nodes in the tree of life. However, it has become clear that analyses of these large genomic data sets can also result in conflicting estimates of phylogeny. Here, we use the early divergences in Neoaves, the largest clade of extant birds, as a "model system" to understand the basis for incongruence among phylogenomic trees.We were motivated by the observation that trees from two recent avian phylogenomic studies exhibit conflicts. Those studies used different strategies: 1) collecting many characters [∼42 mega base pairs (Mbp) of sequence data] from 48 birds, sometimes including only one taxon for each major clade; and 2) collecting fewer characters (∼0.4 Mbp) from 198 birds, selected to subdivide long branches. However, the studies also used different data types: The taxon-poor data matrix comprised 68% non-coding sequences whereas coding exons dominated the taxon-rich datamatrix. This difference raises the question of whether the primary reason for incongruence is the number of sites, the number of taxa, or the data type. To test among these alternative hypotheses we assembled a novel, large-scale data matrix comprising 90% non-coding sequences from 235 bird species. Although increased taxon sampling appeared to have a positive impact on phylogenetic analyses the most important variable was data type. Indeed, by analyzing different subsets of the taxa in our data matrix we found that increased taxon sampling actually resulted in increased congruence with the tree from the previous taxon-poor study (which had a majority of non-coding data) instead of the taxon-rich study (which largely used coding data). We suggest that the observed differences in the estimates of topology for these studies reflect data-type effects due to violations of the models used in phylogenetic analyses, some of which may be difficult to detect. If incongruence among trees estimated using phylogenomic methods largely reflects problems with model fit developing more "biologically-realistic" models is likely to be critical for efforts to reconstruct the tree of life.

Original language	English (US)
Pages (from-to)	857-879
Number of pages	23
Journal	Systematic Biology
Volume	66
Issue number	5
DOIs	https://doi.org/10.1093/sysbio/syx041
State	Published - Sep 1 2017
Externally published	Yes

Bibliographical note

Funding Information:
FUNDING This work was supported by the U.S. National Science Foundation [grants DEB-0228682 and DEB-1118823 to R.T.K. and E.L.B.; DEB-0228675 to S.J.H.; DEB-0228688 to F.H.S.; DEB-0733029 to M.J.B.; and DEB-0228617 to W.S.M.].

Publisher Copyright:
© The Author(s) 2017.

Keywords

Birds
GTRmodel
Neoaves
coding exons
model fit
non-coding DNA
phylogenomics
taxon sampling

Access

10.1093/sysbio/syx041

OpenUrl availability

Full text

Cite this

Reddy, S., Kimball, R. T., Pandey, A., Hosner, P. A., Braun, M. J., Hackett, S. J., Han, K. L., Harshman, J., Huddleston, C. J., Kingston, S., Marks, B. D., Miglia, K. J., Moore, W. S., Sheldon, F. H., Witt, C. C., Yuri, T., & Braun, E. L. (2017). Why do phylogenomic data sets yield conflicting trees? Data type influences the avian tree of life more than taxon sampling. Systematic Biology, 66(5), 857-879. https://doi.org/10.1093/sysbio/syx041

Reddy, S, Kimball, RT, Pandey, A, Hosner, PA, Braun, MJ, Hackett, SJ, Han, KL, Harshman, J, Huddleston, CJ, Kingston, S, Marks, BD, Miglia, KJ, Moore, WS, Sheldon, FH, Witt, CC, Yuri, T & Braun, EL 2017, 'Why do phylogenomic data sets yield conflicting trees? Data type influences the avian tree of life more than taxon sampling', Systematic Biology, vol. 66, no. 5, pp. 857-879. https://doi.org/10.1093/sysbio/syx041

@article{e73a3e33ca7546b4b98926b5bb2c14ae,

title = "Why do phylogenomic data sets yield conflicting trees? Data type influences the avian tree of life more than taxon sampling",

abstract = "Phylogenomics, the use of large-scale data matrices in phylogenetic analyses, has been viewed as the ultimate solution to the problem of resolving difficult nodes in the tree of life. However, it has become clear that analyses of these large genomic data sets can also result in conflicting estimates of phylogeny. Here, we use the early divergences in Neoaves, the largest clade of extant birds, as a {"}model system{"} to understand the basis for incongruence among phylogenomic trees.We were motivated by the observation that trees from two recent avian phylogenomic studies exhibit conflicts. Those studies used different strategies: 1) collecting many characters [∼42 mega base pairs (Mbp) of sequence data] from 48 birds, sometimes including only one taxon for each major clade; and 2) collecting fewer characters (∼0.4 Mbp) from 198 birds, selected to subdivide long branches. However, the studies also used different data types: The taxon-poor data matrix comprised 68% non-coding sequences whereas coding exons dominated the taxon-rich datamatrix. This difference raises the question of whether the primary reason for incongruence is the number of sites, the number of taxa, or the data type. To test among these alternative hypotheses we assembled a novel, large-scale data matrix comprising 90% non-coding sequences from 235 bird species. Although increased taxon sampling appeared to have a positive impact on phylogenetic analyses the most important variable was data type. Indeed, by analyzing different subsets of the taxa in our data matrix we found that increased taxon sampling actually resulted in increased congruence with the tree from the previous taxon-poor study (which had a majority of non-coding data) instead of the taxon-rich study (which largely used coding data). We suggest that the observed differences in the estimates of topology for these studies reflect data-type effects due to violations of the models used in phylogenetic analyses, some of which may be difficult to detect. If incongruence among trees estimated using phylogenomic methods largely reflects problems with model fit developing more {"}biologically-realistic{"} models is likely to be critical for efforts to reconstruct the tree of life.",

keywords = "Birds, GTRmodel, Neoaves, coding exons, model fit, non-coding DNA, phylogenomics, taxon sampling",

author = "Sushma Reddy and Kimball, {Rebecca T.} and Akanksha Pandey and Hosner, {Peter A.} and Braun, {Michael J.} and Hackett, {Shannon J.} and Han, {Kin Lan} and John Harshman and Huddleston, {Christopher J.} and Sarah Kingston and Marks, {Ben D.} and Miglia, {Kathleen J.} and Moore, {William S.} and Sheldon, {Frederick H.} and Witt, {Christopher C.} and Tamaki Yuri and Braun, {Edward L.}",

note = "Funding Information: FUNDING This work was supported by the U.S. National Science Foundation [grants DEB-0228682 and DEB-1118823 to R.T.K. and E.L.B.; DEB-0228675 to S.J.H.; DEB-0228688 to F.H.S.; DEB-0733029 to M.J.B.; and DEB-0228617 to W.S.M.]. Publisher Copyright: {\textcopyright} The Author(s) 2017.",

year = "2017",

month = sep,

day = "1",

doi = "10.1093/sysbio/syx041",

language = "English (US)",

volume = "66",

pages = "857--879",

journal = "Systematic Biology",

issn = "1063-5157",

publisher = "Oxford University Press",

number = "5",

}

TY - JOUR

T1 - Why do phylogenomic data sets yield conflicting trees? Data type influences the avian tree of life more than taxon sampling

AU - Reddy, Sushma

AU - Kimball, Rebecca T.

AU - Pandey, Akanksha

AU - Hosner, Peter A.

AU - Braun, Michael J.

AU - Hackett, Shannon J.

AU - Han, Kin Lan

AU - Harshman, John

AU - Huddleston, Christopher J.

AU - Kingston, Sarah

AU - Marks, Ben D.

AU - Miglia, Kathleen J.

AU - Moore, William S.

AU - Sheldon, Frederick H.

AU - Witt, Christopher C.

AU - Yuri, Tamaki

AU - Braun, Edward L.

N1 - Funding Information: FUNDING This work was supported by the U.S. National Science Foundation [grants DEB-0228682 and DEB-1118823 to R.T.K. and E.L.B.; DEB-0228675 to S.J.H.; DEB-0228688 to F.H.S.; DEB-0733029 to M.J.B.; and DEB-0228617 to W.S.M.]. Publisher Copyright: © The Author(s) 2017.

PY - 2017/9/1

Y1 - 2017/9/1

N2 - Phylogenomics, the use of large-scale data matrices in phylogenetic analyses, has been viewed as the ultimate solution to the problem of resolving difficult nodes in the tree of life. However, it has become clear that analyses of these large genomic data sets can also result in conflicting estimates of phylogeny. Here, we use the early divergences in Neoaves, the largest clade of extant birds, as a "model system" to understand the basis for incongruence among phylogenomic trees.We were motivated by the observation that trees from two recent avian phylogenomic studies exhibit conflicts. Those studies used different strategies: 1) collecting many characters [∼42 mega base pairs (Mbp) of sequence data] from 48 birds, sometimes including only one taxon for each major clade; and 2) collecting fewer characters (∼0.4 Mbp) from 198 birds, selected to subdivide long branches. However, the studies also used different data types: The taxon-poor data matrix comprised 68% non-coding sequences whereas coding exons dominated the taxon-rich datamatrix. This difference raises the question of whether the primary reason for incongruence is the number of sites, the number of taxa, or the data type. To test among these alternative hypotheses we assembled a novel, large-scale data matrix comprising 90% non-coding sequences from 235 bird species. Although increased taxon sampling appeared to have a positive impact on phylogenetic analyses the most important variable was data type. Indeed, by analyzing different subsets of the taxa in our data matrix we found that increased taxon sampling actually resulted in increased congruence with the tree from the previous taxon-poor study (which had a majority of non-coding data) instead of the taxon-rich study (which largely used coding data). We suggest that the observed differences in the estimates of topology for these studies reflect data-type effects due to violations of the models used in phylogenetic analyses, some of which may be difficult to detect. If incongruence among trees estimated using phylogenomic methods largely reflects problems with model fit developing more "biologically-realistic" models is likely to be critical for efforts to reconstruct the tree of life.

AB - Phylogenomics, the use of large-scale data matrices in phylogenetic analyses, has been viewed as the ultimate solution to the problem of resolving difficult nodes in the tree of life. However, it has become clear that analyses of these large genomic data sets can also result in conflicting estimates of phylogeny. Here, we use the early divergences in Neoaves, the largest clade of extant birds, as a "model system" to understand the basis for incongruence among phylogenomic trees.We were motivated by the observation that trees from two recent avian phylogenomic studies exhibit conflicts. Those studies used different strategies: 1) collecting many characters [∼42 mega base pairs (Mbp) of sequence data] from 48 birds, sometimes including only one taxon for each major clade; and 2) collecting fewer characters (∼0.4 Mbp) from 198 birds, selected to subdivide long branches. However, the studies also used different data types: The taxon-poor data matrix comprised 68% non-coding sequences whereas coding exons dominated the taxon-rich datamatrix. This difference raises the question of whether the primary reason for incongruence is the number of sites, the number of taxa, or the data type. To test among these alternative hypotheses we assembled a novel, large-scale data matrix comprising 90% non-coding sequences from 235 bird species. Although increased taxon sampling appeared to have a positive impact on phylogenetic analyses the most important variable was data type. Indeed, by analyzing different subsets of the taxa in our data matrix we found that increased taxon sampling actually resulted in increased congruence with the tree from the previous taxon-poor study (which had a majority of non-coding data) instead of the taxon-rich study (which largely used coding data). We suggest that the observed differences in the estimates of topology for these studies reflect data-type effects due to violations of the models used in phylogenetic analyses, some of which may be difficult to detect. If incongruence among trees estimated using phylogenomic methods largely reflects problems with model fit developing more "biologically-realistic" models is likely to be critical for efforts to reconstruct the tree of life.

KW - Birds

KW - GTRmodel

KW - Neoaves

KW - coding exons

KW - model fit

KW - non-coding DNA

KW - phylogenomics

KW - taxon sampling

UR - http://www.scopus.com/inward/record.url?scp=85028512724&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85028512724&partnerID=8YFLogxK

U2 - 10.1093/sysbio/syx041

DO - 10.1093/sysbio/syx041

M3 - Article

C2 - 28369655

AN - SCOPUS:85028512724

SN - 1063-5157

VL - 66

SP - 857

EP - 879

JO - Systematic Biology

JF - Systematic Biology

IS - 5

ER -

Why do phylogenomic data sets yield conflicting trees? Data type influences the avian tree of life more than taxon sampling

Abstract

Bibliographical note

Keywords

Access

OpenUrl availability

Other files and links

Fingerprint

Cite this