This page presents general guidelines regarding locus annotations. Please contact Ian Blaby (iblaby@bnl.gov) for any correspondence regarding future gene names.

The text below is an excerpt from:

Blaby IK, Blaby-Haas CE, Tourasse N, Hom EF, Lopez D, Aksoy M, Grossman A, Umen J, Dutcher S, Porter M, King S, Witman GB, Stanke M, Harris EH, Goodstein D, Grimwood J, Schmutz J, Vallon O, Merchant SS, Prochnik S (2014) The Chlamydomonas genome project: a decade on. Trends Plant Sci. 19:672-80

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4185214/

 

To name or not to name?

Over-annotation in databases, whether of an automated origin, or user-initiated, is common and detrimental: errors can proliferate as computer algorithms map data to new genomes [39]. We therefore propose that genes should only be named (i.e. given what geneticists formally call a gene symbol, such as ODA11 or RBCS2) if one of the following is true: (1) A function or involvement in a specific biological process is associated with a publication. In this case, a pubmed ID (PMID) or other citation should accompany the gene symbol, which should be included in the Phytozome Description. (2) A gene is associated with a high-throughput screen or global study, e.g. proteomes of flagella resulting in the naming of flagellar associated proteins (FAP) or the conserved green-lineage (CGL) associated genes. (3) The gene function is confidently predicted by a rigorous bioinformatic study. Indeed, annotation by investigators with extensive knowledge of particular pathway has been very valuable [40].

If the above criteria are not met, then a gene symbol should not be created. This includes genes encoding proteins with poor similarity to sequences in other organisms (forcing an annotation) or for which the naming is only based on a single conserved domain. In a similar vein, genes should not be named on the basis of homology to proteins involved in a process that does not (or has not been shown to) exist in Chlamydomonas. For example, the protein encoded by Cre02.g116900 displays high similarity to small hydrophilic plant seed proteins in Arabidopsis. In the absence of seed production, this protein clearly cannot perform this function in Chlamydomonas, and therefore should not be named after the Arabidopsis gene ATEM1. Genes without an assigned symbol should be referred to by their locus ID, since every locus has a unique and stable ID. To distinguish between a gene and an encoded protein, we suggest italicizing locus IDs (Crex.gyyyyyy) and non-italicizing proteins (Crex.gyyyyyy).

How to devise a gene symbol

Gene nomenclature guidelines have been established by the Chlamydomonas community (http://www.chlamy.org/nomenclature.html), but are not always strictly followed. We hereafter recall the basic rules, and when it is accepted to depart from them.

  1. The preferred format for gene symbols in C. reinhardtii is a 3–5 letter root, in uppercase for nuclear genes, or lower case for organelle genes; this is followed by a number denoting isoform, or occasionally subunits (although for historically named genes, a combination of letters or numbers has been used and can denote numbered mutants recovered in a genetic screen. Alternatively, the gene symbol, including a number, has on occasion been maintained exactly from the orthologous gene of another organism). In general, 3 letters is preferred, but may not always be possible (for example when using an Arabidopsis gene name, which does not conform to a 3-letter standard, the name should not be abbreviated). The root should indicate or abbreviate some aspect of function or phenotype. For example GPD1GPD4 encode 4 isoforms of glycerol-3-phosphate dehydrogenase, ASA1-ASA9 encode the 9 Chlorophyceae-specific subunits of the mitochondrial ATP synthase and ACLA1 and ACLB1 encode ATP citrate lyase subunits A and B). For historical reasons, some names depart from this scheme, for example HSP70A, HSP70B, HSP70C encode three isoforms of HSP70. Nuclear genes for photosynthesis will retain their cyanobacterial name, followed by a number to denote isoform, unless several isoforms exist (for example RBSCS1-RBCS2PSBP1-PSBP9)

To make nomenclature more intuitive, gene symbols can be adapted from those of orthologs in other organisms where characterized orthologs exist. This will ensure related gene symbols across organisms, simplifying comparisons between organisms and retrieval of associated literature.

  1. Potential confusion should be avoided by confirming the proposed gene symbol is not already in use in Chlamydomonas. The authors of this manuscript are available to help researchers verify this. Ideally, it should also not be used in another organism for a different function. The global gene hunter tool (http://www.yeastgenome.org/help/community/global-gene-hunter) enables six databases to be searched simultaneously for this purpose. The Gene database (http://www.ncbi.nlm.nih.gov/gene), at the National Center for Biotechnology Information (NCBI), is also useful for this purpose and can be used to trace gene name roots across different organisms.
  2. Historically, many genes were discovered following genetic studies of mutants named on the basis of a phenotype, or expression or localization studies (e.g. LF5 mutants have long flagella, LCI5 is low-CO2 inducible). Whenever informative of function, these names are preferred as the primary gene symbol over names describing molecular functions. Alternative gene symbols are stored as aliases in Phytozome, allowing the gene to be found if any of its symbols is used as a search term. This effectively links genes to all related literature and vice versa.