HOW TO: Gene Family Search

(Go to gene family search)

This page lets you search for gene families starting from a legume gene family ID (e.g., legfed_v1_0.L_2951WH) or words from the description of the gene family (e.g., iron homeostasis or chlorophyll binding protein); family descriptions are derived from homology-based functional analysis of the hidden Markov model representing the known sequences in the family, and include Interpro and Gene Ontology identifiers which may be included in the search. You may leave any of the search fields blank if you don't care about the field, or specify several criteria together (all filters will be ANDed together). The gene families were calculated in 2018, as part of the LegumeFederation project. The full family set is available here.

The result table

The resulting table shows the Family ID and description along with the membership counts in different species. The table can be sorted by clicking on the column headers (clicking a header again will reverse the order of the sort).

The phylogram

The phylogram page displays a phylogenetic tree that has branch spans proportional to the amount of character change, including both legumes and selected non-legumes. Alternatively, you can also display the tree in the form of a circular dendrogram (when a large number of genes are in the tree, this view will not allow you to see the fine details, but does give a concise overview of the topology of the tree and how gene content from different species is distributed throughout). The Taxa and Legend button will toggle the display diagram illustrating their relative abundances (helpful for very large families); the legend is interactive, and the set of taxa that are displayed in the tree can be filtered by clicking on the legend entries for the taxa.

The Analysis

The gene families at LegumeInfo were calculated as part of the NSF LegumeFederation project, as follows (to be described in more detail in a paper anticipated for late 2018). Clustering was done on the basis of gene pairs filtered for per-species Ks values. These were clustered using Markov clustering. Sequence match scores of each sequence in a family were used to identify outliers, on the basis of score value relative to the median score for the family. Remaining sequences were re-clustered, added to the HMM set. Then all sequences were searched against all HMMs, realigned, re-screened relative to median match score, and finally used to generate alignments and phylogenetic trees (using RAxML). The trees are rooted, when possible, using the closest outgroup from among five outgroup species: Arabidopsis thaliana, Prunus persica, Cucumis sativa, Solanum lycopersicum, and Vitis vinifera.

Note: These current gene families at LegumeInfo replace (as of May, 2018) an older set of families that were built on the Phytozome 10.2 Angiosperm-level gene family models.