Signatures of gene transfer in the parallel evolution of osmotrophic specialization in eukaryotes
Species tree reconstruction and molecular clock analysesThe species tree used in the reconciliation analyses was constructed in the following manner. Molecular dating of the species tree was performed with Phylobayes 4.1c, again under the CAT + GTR + G4 model and using the same alignment and the topology obtained in the species tree reconstruction process. This analysis required the species tree (see section above) and a sample of gene trees for every gene family cluster. 3e, we aggregated dated transfer events into two types: HGTs to osmotrophic groups originating from (1) osmotrophic groups or (2) other donors. We prioritized a balanced representation of as much phylogenetic diversity as possible, not only within the osmotrophic groups, but also among different clades from Amorphea and Diaphoretickes, the two divisions of eukaryotes in which the four osmotrophic groups belong to.
Protein sequences datasets
We prepared a species protein sequence dataset specifically for this study. As described in the main text, the four transitions towards osmotrophy occurred in the eukaryotic groups Opisthokonta ((1) Fungi and (2) Teretsoporea) and Stramenopiles ((3) Pseudofungi and (4) Labyrinthulea). Opisthokonta and Stramenopiles belong, respectively, to the major eukaryotic divisions Amorphea and Diaphoretickes. Taking this into consideration, the dataset employed in this study includes 1,363,672 protein sequences from 86 eukaryotic species within both divisions of eukaryotes. The use of proteome datasets derived from genome projects was prioritized over proteome predictions derived from only transcriptome data. Proteomes from transcriptomic projects were only used for groups with very few genomic data available (for example, Glaucophyta), to expand the representation of certain osmotrophic groups (for example, Teretosporea), or to sample clades that branch close to the osmotrophic groups in the eukaryotic phylogeny (for example, Opalazoa). Owing to computational constraints, we prioritized proteome data from a broad and balanced taxonomic representation within the eukaryotic divisions Amorphea and Diaphoretickes. Both eukaryotic divisions include most of the known eukaryotic diversity, and the availability of genomic data from the groups that branch outside both divisions is poor, and in many cases, the phylogenetic position of these taxa is poorly resolved. The purpose of this study was to explore gene content changes that occurred near the emergence of the osmotrophic groups and during their diversification. Since the osmotrophic groups branch late within the Amorphea and Diaphoreticles divisions, we expected that the absence of species representation outside both divisions would not be problematic for our analyses. Still, we tested this hypothesis, in addition to the impact of extending the taxonomic representation of some groups, by re-running some of our main analyses under an extended dataset (see the Methods section ‘Re-analyses with an extended dataset’ below).
Gene family clusters
We aligned the dataset proteins in an all-against-all fashion with BLASTP76 version 2.10.1+, using an E-value of 1 × 10−3, enabling the soft_masking option and keeping the 500 best-scoring targets per each query. BLASTP alignments were then used as input for the MCL77 clustering software version 14-137, which groups genes into clusters on the basis of a sequence similarity network constructed from the all-against-all BLASTP alignments. MCL was run with an inflation value of -I 2, using the −log 10 -transformed BLASTP E-values as a similarity metric. MCL output clusters with fewer than four members were discarded for downstream analyses. Clusters with >750 members were also discarded as they are unlikely to produce reliable alignments for phylogenetic inferences. The cluster definition file and the sequences included in the clusters are available in Supplementary Data.
Osmoclusters
We identified gene family clusters that are potentially important for osmotrophic specialization on the basis of the logic that these should be overrepresented in copy number in the osmotrophic groups compared with other species. For every gene family cluster tested (we excluded clusters found in only one species), we performed four statistical tests (Mann–Whitney U test, using the mannwhitneyu function from SciPy78 1.10.1), one per osmotrophic group, to evaluate whether the copy number distribution of the given osmotrophic group (for example, Fungi) in the cluster is significantly greater than that of the background set of species (the background species were all the eukaryotes sampled in our study that do not belong to any of the four osmotrophic groups). For each osmotrophic group, the overrepresentation tests were performed on those clusters that are represented in at least one species from the group. In total, we identified 4,764, 13,816, 4,676 and 6,731 families overrepresented in Fungi, Teretosporea, Labyrinthulea and Pseudofungi, respectively, compared with the background set of species (adjusted P value <0.05; adjusted P values were computed with the Benjamini–Hochberg procedure using the fdrcorrection function from statsmodels79 0.13.5). Among these families overrepresented in at least one osmotrophic group, 189 of them are overrepresented in three or more of them, that is, in at least one osmotrophic group from Amorphea and at least one osmotrophic group from Diaphoretickes. We refer to this set of clusters as osmoclusters. We also performed an additional round of the same statistical analyses but excluding the photosynthetic species from the background set of species when performing the overrepresentation tests. Note that, sensu stricto, we did not consider as osmoclusters the overrepresented clusters resulting from this second round of analyses, since these are likely to be shared gene content signatures that are unlikely to be specific to the osmotrophic groups but also shared with the photosynthetic groups (see ‘Osmoclusters—genomic convergence between the osmotrophic groups’ section in the Results). The results from the two rounds of statistical tests are available in Supplementary Data.
Functional annotation of the gene family clusters and metabolic pathways
To produce functional annotations of the clusters (related to Fig. 2b,c and Fig. 4d), we first annotated all the sequences contained in the clusters. To annotate enzyme information and functional category information, we run eggNOG-mapper80 version 2.0.1 on local mode. Among the retrieved annotations, we used the Cluster of Orthologous Groups (COG) functional categories81 and the KEGG Orthology Groups82. We considered as enzymes those KEGG Orthology Groups with associated Enzyme Commission numbers. Once individual sequences were annotated, for each cluster, we transferred the annotations made for their sequence cluster members by weightening every annotation found in at least one member of the cluster by the relative fraction of cluster members for which that specific annotation was detected, as performed in ref. 63. When transferring sequence annotations to clusters, sequences that were not annotated as enzymes were categorized as ‘No_enzyme’ before aggregating annotations on a cluster level. Regarding the transferring COG functional category annotations to clusters, for sequences with more than one COG category annotated, the contribution of each category was first divided by the amount of COG categories annotated for that sequence. All COG categories representing known functional information were considered.
We controlled for the potential interaction between cluster annotations and cluster size (number of sequence representatives of a cluster) when comparing the fraction of sequences annotated as enzymes in osmoclusters versus other clusters. In particular, we grouped gene family clusters into size bins ([1,2), [2,3), [3,4), [4,5), [5,10), [10,15), [15,20), [20,30), [30,50), [50,75), [75,100), [100,125), [125,150), [150,200), [200,300), [300,400), [400,500), [500,700), [700,1,000), [1,000,∞)), and performed 1,000 simulations, randomly sampling clusters from the set of clusters that are not osmocluster (non-osmoclusters) from every size bin so that in every simulation each bin is equally represented in the osmoclusters and in the subsampled set of clusters. During the cluster sampling process, only clusters that are not among osmoclusters, or clusters that were not previously sampled in the simulation round were considered. We then computed the average of the 1,000 values retrieved from each simulation to retrieve a single value representative of the whole range of interactions. The relative fraction of sequences with enzyme annotations corresponds to the opposite of the fraction of sequences without enzymatic annotations (85% in the non-osmoclusters clusters and 72% in osmoclusters). KEGG PATHWAY maps (related to Fig. 2c) were reconstructed using the online tool https://www.genome.jp/kegg/mapper/reconstruct.html (details in Supplementary Information 1 and 2).
Species tree reconstruction and molecular clock analyses
The species tree used in the reconciliation analyses was constructed in the following manner. We first explored the filtered clusters with the aim of identifying a reliable subset from which to construct a supermatrix for the phylogenetic inference. In particular, we targeted those clusters having broad taxonomic distributions but at the same time few sequences per species to minimize paralogy problems. This led to 214 clusters that are present in at least 78 of the 86 species sampled and with a maximum of 100 sequences per cluster. Of these, we kept the 130 showing the highest ‘mean residue score’ from the heads or tails approach, which has been proposed as a measure of uncertainty in multiple sequence alignments83.
Before being concatenated into a supermatrix, these 130 clusters were cleaned of inparalogs and potential outparalogs (false cluster members), leaving one sequence per taxon. First, to remove inparalogs, in the cases when a given taxa had more than one representative in the cluster, we kept for each taxa the representative that aligned on average with the highest average BLAST score to the rest of the members of that cluster. By doing this, we should expect to retain the less divergent paralogs while removing the more divergent ones. Then, to remove potential outparalogs, we ranked the remaining members of each taxon in all the clusters by the mean BLAST score-based similarity that each member shows to the other members of the corresponding cluster, and then we excluded the 20 most divergent members for every taxon. By doing this, we expect to remove most, if not all, the outparalogs at the expense of having discarded only a very minor fraction of the dataset.
After the cleaning process, clusters were aligned separately before being concatenated into a single supermatrix (alignment and alignment trimming were also carried out with MAFFT84 version 7.407 using the ‘-linsi’ option, and with trimAl85 1.2rev59 using the ‘-gappyout’ option, respectively). The resulting supermatrix includes 86 taxa and 31,888 positions, with a median of 13.08% missing data. The phylogenetic inference was performed using Phylobayes-MPI86 version 1.8 under the CAT + GTR + Gamma(4) model, with two chains running for more than 13,000 generations until an effective size larger than 50 was reached for every parameter represented in the trace file. A consensus tree was built from the two chains with a burn-in of 25% (Extended Data Fig. 5). The sampled species tree topologies by each chain are available in Supplementary Data. The consensus tree recovered (Extended Data Fig. 5) is consistent with the current view of the eukaryotic species tree87, and the topology demonstrates robust support with high posterior probability values retrieved throughout, including the deepest parts of the tree (Extended Data Fig. 5). In addition, the consensus-tree topology was found to be better in the gene tree–species tree reconciliation analyses than three other alternative topologies, each of which included alternative topological resolutions for poorly supported bipartitions within Basidiomycota, Ochrophyta and Haptophyta (Supplementary Table 9). In particular, we performed different reconciliation analyses, each time using an alternative species tree, and the consensus-tree topology obtained the highest value for the sum of log-likelihoods (Supplementary Table 9). We thus used consensus-tree topology for the molecular dating and for the reconciliation analyses based on the original, pre-extended taxon sampling (see the Methods section ‘Re-analyses with an extended dataset’ below).
Molecular dating of the species tree was performed with Phylobayes 4.1c, again under the CAT + GTR + G4 model and using the same alignment and the topology obtained in the species tree reconstruction process. We ran two chains for each of the following relaxed molecular clock models (log-normal auto-correlated88 and uncorrelated gamma89) until at least 10,000 trees were sampled by each chain (30% of burn-in). All chains were run, imposing a set of fossil calibrations (Supplementary Table 7) under the soft constraints option and using a birth–death prior on divergence times. The sampled chronograms are available in Supplementary Data.
Ancestral gene content estimation by phylogenetic reconciliation
We inferred the gene contents of every ancestral lineage in our phylogeny by means of gene tree–species tree reconciliation41. This analysis required the species tree (see section above) and a sample of gene trees for every gene family cluster. Before performing gene tree sampling for every cluster, we first incorporated non-eukaryotic homologues into the clusters, as performed in ref. 63. We employed DIAMOND90 version 0.9.24 with the -more-sensitive parameter and an E-value threshold of 1 × 10−5. First, all eukaryotic sequences from the protein dataset of this study (euk_db) were aligned against the reference non-eukaryotic database used in ref. 63, which included the UniProt reference proteomes for Bacteria, Archaea and Virus (release 2016_02). This constituted the forward alignment approach. Next, sequences from prok_db that aligned to euk_db were subjected to a reverse alignment back against euk_db. Alignments with query and target coverages below 75% were discarded. In addition, any hit where the highest-scoring euk_db target for a given prok_db query belonged to a different cluster than the best-scoring euk_db query for that same prok_db sequence in the forward alignment was also removed. Following these filtering steps, we incorporated only the best-scoring prok_db query for each euk_db target sequence into the clusters (for example, if a cluster contained 300 euk_db sequences and the same prok_db sequence was the top hit for all of them, only that single prok_db sequence was added, resulting in a cluster with 300 euk_db sequences and 1 prok_db sequence). After having incorporated the non-eukaryotic homologues into the gene family clusters, we proceeded to sample gene trees for each cluster. We first performed multiple sequence alignments using MAFFT version 7.407 (‘-linsi’ option). Alignments were trimmed with trimAl 1.2rev59 (‘-gappyout’ option). The 1,000 optimized ultrafast bootstrap replicates were then sampled from each trimmed alignment using IQ-TREE91 version 1.6.12 under the LG + G4 model (Shimodaira–Hasegawa approximate likelihood-ratio test (SH-aLRT) bootstrap values were also computed). The sampled bootstrap replicates were used as gene tree samples to be reconciled against the species tree using the phylogenetic reconciliation software ALE41 (ALEml_undated from ALE version 1.0). In addition, in ref. 63, the species tree incorporated a pseudospecies branch representing all non-eukaryotic sequences, and transfers to eukaryotes from non-eukaryotes and transfers between eukaryotes were considered in the reconciliation model (Extended Data Fig. 6).
Dating the genomic content of osmoclusters over time
We computed the mean osmocluster count per million years (My) time unit in the evolutionary path going from the root of our eukaryotic tree to the extant representatives of each osmotrophic group. In particular, for each osmotrophic group, we first computed the mean osmoclusters count in every branch found between the terminal nodes of the group and the root of the tree (both included). From this, we computed the mean osmocluster count for every time unit by performing the weighted arithmetic mean of the mean osmoclusters’ count of every branch under consideration, using as weights the fraction of post-burn-in chronograms supporting the existence of every lineage (branch) in the corresponding My time unit.
Dating transfer signal inferred with ALE for a subset of osmoclusters
The output of ALE includes lineage-to-lineage transfer information. We collected this information for the subset of osmoclusters in which our manual screening identified signatures of HGT cases (see ‘Manually supervised identification of osmoclusters with putative HGT patterns’ section). Even in gene trees with topologies that are strongly suggestive of HGT patterns, sometimes it becomes challenging to establish the lineages that were involved in the HGT events owing to uncertainty, particularly if transfers occurred deep in evolutionary time (for example, ref. 36). In such cases, when there are different combinations of lineages that could have been involved in the HGT as donors–receptors, probabilistic reconciliation methods such as ALE can help in dealing with uncertainty in a probabilistic manner. ALE samples gene trees that have been reconciled against the species tree. From the sampled reconciliation scenarios, we can obtain the probability of pairs of donor–receptor lineages to have been involved in a HGT by taking into account the distribution of gene tree topologies consistent with the bootstrap samples obtained during the gene tree reconstruction step. By doing so, we could add up the results from the phylogenetic reconciliation analyses on the 40 osmoclusters in which we previously identified potential HGT topologies (see ‘Manually supervised identification of osmoclusters with putative HGT patterns’), and we aggregated the signal to get insights into an approximate distribution of the HGT signal over evolutionary time. For that, we dated the transfer events inferred with ALE (Fig. 3e) using the dates provided by the molecular dating analysis of the species tree.
The logic behind dating transfer events is the following. Every transfer must have happened in the period of time in which the respective donor and receptor lineages (branches) coexisted/overlapped in the chronogram. The dating software reports ages for nodes, not for branches. In a chronogram, every branch is delimited by two nodes (here referred to as birth and death nodes), and the age of a branch is defined by the ages of the respective birth and death nodes (here referred to as birth and death ages). Note that the model of ALE considers the possibility that transfers can come either from sampled lineages or from unsampled/extinct donor lineages descending from sampled lineages92. Because of this, when ALE outputs a given lineage as a transfer donor (for example, the lineage corresponding to the immediate ancestor of Fungi), this implies that the transfer could have occurred at any time between the birth of the donor lineage until the present. For example, if ALE reports as donor lineage the last common ancestor of Fungi, this means that the receptor could have acquired the gene either from the immediate branch preceding this node or from any unsampled/extinct lineage that may have diverged from the immediate branch preceding and became extinct at some point. Therefore, transfer ages are not constrained by the death age of the donor branches. In fact, they are delimited by two values: (1) the age of the death of the receptor lineage and (2) the birth age of either the donor or the receptor (whichever is the youngest).
On the basis of the information mentioned above, we dated transfers with the following approach: for every chronogram, we evaluated every branch in the role of donor and in the role of receptor. In the role of donor, a branch existed in the period of time delimited by its birth age and current time (0 My). In the role of receptor, a branch existed in the period of time delimited by its birth and death ages. We then checked the period of time in which every possible donor–receptor pair could have coexisted in every chronogram. After inspecting all chronograms, every possible donor–receptor pair has a value for every time unit (million years, from 0 My to the oldest root age found in the chronograms), which corresponds to the relative fraction of chronograms supporting the possibility that the two lineages could have coexisted in that specific time. We then normalized the distribution of values obtained for every donor–receptor pair by dividing the value of every time unit by the sum of values of all time units. Every transfer event was then dated according to the normalized distribution obtained for the corresponding donor–receptor pair. In particular, we obtained the probability of every transfer to have occurred in every million year time unit by multiplying the corresponding transfer frequency value (output from ALE) by the normalized value obtained for that time unit for the corresponding donor–receptor pair.
For Fig. 3e, we aggregated dated transfer events into two types: HGTs to osmotrophic groups originating from (1) osmotrophic groups or (2) other donors. In both cases, the osmotrophic lineages included all extant representatives and their preceding ancestral lineages. We also ranked the total HGT signal of these 40 osmoclusters against 100 background sets of 40 clusters, which were randomly sampled to maintain the same representation of cluster size bins as in the 40 osmoclusters (cluster size bins are defined in the ‘Functional annotation of the gene family clusters and metabolic pathways’ section). This yielded 101 values per time unit (one corresponding to the 40 osmoclusters and one for each background set of clusters), allowing us to identify the relative rank of the 40 osmoclusters at each timepoint (Supplementary Table 12). This analysis was performed for the 1,500–0 Ma period, as the transfer signal identified is residual before that period (Fig. 3e).
Re-analyses with an extended dataset
The availability of protein species data, on the one hand, and the computational requirements of the analyses performed, on the other, are two important factors influencing the size and the set of species to be included in a given comparative genomics analyses. Regarding the latter point, the all-versus-all BLASTP alignments used to build the gene family clusters, the relaxed molecular clock analyses used to obtain age estimates for the osmotrophic groups and the gene tree–species tree reconciliation analyses used to estimate ancestral gene contents are particularly computationally taxing. We prioritized a balanced representation of as much phylogenetic diversity as possible, not only within the osmotrophic groups, but also among different clades from Amorphea and Diaphoretickes, the two divisions of eukaryotes in which the four osmotrophic groups belong to. Our taxon sampling allowed the execution of state-of-the-art methods; for example, we could employ BLASTP, the baseline all-against-all protein alignment tool, we could run sophisticated phylogenetic models for the inference of the species topology and the relaxed molecular clock analysis and we could perform ancestral state reconstruction by employing maximum-likelihood phylogenetic reconciliation tools. We explored whether we could identify signatures that the patterns seen in our results could have been driven to some extent by a limited representation in our taxon sampling. For that, we re-ran those analyses that are less computationally prohibitive with an extended taxon sampling that incorporates 35 additional species’ proteomes for a total of 121 represented proteomes. Apart from providing an expanded taxonomic coverage of previously represented groups, this extended dataset also incorporated proteome data from Discoba, Metamonada and Malawimonadida, all of which branch outside the Amorphea and Diaphoretickes divisions of eukaryotes (Supplementary Table 8). Proteome data for the extended dataset was selected from ‘The Comparative Set’, which consists of 196 species chosen by the developers of EukProt (version 3) on the basis of BUSCO completion and phylogenetic importance93. Among the set of species that could be selected, we discarded those with reported evidence of contamination61. We also avoided incorporating proteomes coming from transcriptomic data that showed high values of sequence redundancy according to the duplicated BUSCO score. For the re-analyses run under this extended taxon sampling (Extended Data Figs. 3 and 4), we had to incorporate sequences from the extended proteomes into the original gene family clusters, as well as incorporate the extended species set into the reconstructed species tree. Regarding the extension of the clusters, we did so by performing BLASTP alignments of the extended proteomes against a joined database with the rest of the species’ proteomes, (parameters: -evalue 1e-3 -max_target_seqs 250), assigning each sequence to the previously defined gene family clusters on the basis of the best hit criteria. Regarding the species tree extension, we identified the most likely phylogenetic placements of the extended taxa by performing a comprehensive screening of the available literature (Supplementary Table 8).
Manually supervised identification of osmoclusters with putative HGT patterns
We performed a thorough evaluation of each osmocluster, one by one, to retrieve a curated subset of osmoclusters showing putative HGT patterns in their tree topologies. When HGT takes place, it is expected to leave signatures in the genome. The younger the HGT event is, the higher the degree of conservation of HGT signatures will be in the genomes of extant species, and vice versa. For old HGT events, it can become challenging to identify the exact identity of the donor and receptor lineages involved, but there may still be incongruent phylogenetic signals supporting the occurrence of the HGT. The evaluation of HGT cases in osmoclusters consisted of checking in parallel three types of information for each osmocluster: (1) in the gene tree (poorly supported bipartitions with <80% SH-aLRT bootstrap value were collapsed), we searched for clades involving sequences from distantly related osmotrophic groups (that is, osmotrophic groups from Amorphea branching closely related to osmotrophic groups from Diaphoretickes; these are less likely to be explained by gene losses, which are pervasive in evolution43,94). Gene trees incorporated sequences from the extended species proteomes; (2) in the taxonomic distribution of the cluster, the more biased the distribution is towards osmotrophic groups, the less likely is that the observed distribution could be explained by differential duplication and loss patterns rather than by HGT; (3) sequence similarity searches between each sequence in the osmocluster and every sequence included in our study were also considered as gene-tree-independent evidence. This third source of information is important to validate that the potential HGT trends in the gene tree are not artefacts derived from errors in the gene family clustering or in the phylogenetic reconstruction processes. Alignments for (3) were performed with DIAMOND version 2.0.14.152, aligning every query sequence against a concatenated database of the eukaryotic and non-eukaryotic sequences included in our study, including the extended eukaryotic species proteomes (-evalue 1e-3 -more-sensitive). See Supplementary Information 3 for a detailed explanation on how these three sources of information were crossed to evaluate putative HGT patterns in osmoclusters, including a specific report for each osmocluster case identified.
Semi-automated quantification of HGT topologies in all clusters
Beyond osmoclusters, we also explored the whole set of gene family clusters for highly supported HGT clades involving sequences from distantly related osmotrophic groups. We used the ETE95 version 3 toolkit to search for clades including only sequences from two osmotrophic groups, each belonging to a distinct eukaryotic division: Pseudofungi–Fungi, Pseudofungi–Teretosporea, Labyrinthulea–Fungi and Labyrinthulea–Teretosporea. We did not consider pairs of osmotrophic groups’ pairs belonging to the same eukaryotic division to minimize chances of false-positive HGT inferences that could be confounded by complex duplication and loss patterns. Before clade inspection, poorly supported clades from the phylogenies (<80% SH-aLRT bootstrap support) were collapsed and turned into polytomies. A total of 1,267 clades in 1,029 clusters passed this automated filter step. In 442 of these 1,267 clades, we found at least one sequence from each of the two osmotrophic groups involved in the clade that performed the best hit in the Diamond alignments carried out for the previous section to a sequence belonging to the other osmotrophic group included in the clade (ignoring in any case hits involving sequences belonging to the same taxonomic category). To provide a specific example, in a clade incorporating sequences from Pseudofungi and Fungi, at least one sequence from Pseudofungi must return its best-alignment score against a sequence from Fungi (excluding hits from Pseudofungi), and in addition, at least one sequence from Fungi must return a best-alignment score against a sequence from Pseudofungi (excluding hits from Fungi). The remaining 442 clades, representing 398 clusters, were manually inspected in detail, exploring (i) the topology of the gene tree for the corresponding cluster and (ii) the taxonomic distribution of the best hits in Diamond alignments for the sequences from the two osmotrophic groups involved in the clade. After a case-by-case inspection, 175 clades from 172 clusters were considered as strongly suggestive of HGTs involving taxa from distantly related osmotrophic groups. Overall, 166 of these clades representing 163 clusters were validated after a re-analysis, which consisted of recomputing the gene trees and the Diamond alignments on the basis of the extended taxon sampling (Supplementary Information 4). On the basis of a rigorous inspection of the taxonomic composition of the sister branches to the HGT clades, and of the taxonomic patterns of the Diamond alignment results, among the 166 cases of HGT, we could confidently establish which osmotrophic group acted as the HGT donor for 62 cases. For these 62 cases, when there was uncertainty in the specific lineages involved as HGT donors–receptors within each osmotrophic group, HGT counts were fractionated and assigned to each plausible donor–receptor scenario identified.
Reporting summary
Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.
© All Rights Reserved.