Gene Function Prediction



Our research focus is on the discovery of new biomolecular mechanisms from biological and medical data and especially the functional characterization of yet uncharacterized genes and pathways with theoretical/computational methods. Dramatic recent improvements of nucleic acid sequencing technologies enhance the prospect of general availability of genomes from patients, patient-specific pathogens and of gene expression data. This development has profound implications for life science research and biomedical applications. As biomolecular sequencing is becoming the most informative, accurate as well as most readily available research technologies in life sciences, sequence analysis and sequence-based structure and function prediction will be more important than ever.

Typically, a project starts with sets of uncharacterized sequences, expression profiles or other type of omics data associated with known phenotypes where the driving biomolecular mechanism is sought. Most of the work is with internal and/or external collaborators, also including partners in clinics (e.g., 34863649, 34903892 – PubMed IDs here and below) and biotech/pharma industry (e.g., MeshBio in Singapore). The work with clinical data motivated us to discuss the problem of access to patient data for biomedical research from three different perspectives: patients’, clinicians’ and researchers’ (33717311). Large-scale sequences of populations leads to new challenges in biodata analyses (36335097).

Unknown function of genomic regions will plague mankind for at least a century to come (22849370, 30265449). Though it is generally believed that full human genome sequencing was a watershed event in human history that boosted biomedical research, biomolecular mechanism discovery and life science applications, there is no sign that biomolecular mechanism discovery happens at a faster pace than before. The opposite is true: Researchers in the field of genome annotation see that there is a persisting, substantial body of functionally insufficiently or completely not characterized genes (for example, ~10,000 protein-coding in the human genome) despite the availability of full genome sequences. A survey of the biomedical literature shows that the number of reported new protein functions had been steadily growing until 2000 but the trend reversed to a dramatic decline thereafter when, at the same time, the annual amount of new life science publications doubled between 2000 and 2017. So, the group is active on a fertile ground with lots of discovery potential.

Applications reach into medical data analysis, natural product, and rare diseases research. Our success stories include the discovery of the SET domain methyltransferases (PMID: 10949293), ATGL (15550674), kleisins (12667442), many new protein domain functions and functional sequence patterns (for example in the GPI lipid anchor biosynthesis pathway such as the peptide synthetase activity of GPAA1 (24743167)). We discovered a new membrane-embedded protein domain evolutionarily multiplied in the GPI lipid anchor pathway proteins, BindGPILA (29764287). It functions as the unit for recognizing, binding and stabilizing the GPI lipid anchor in a modification-competent form. An E. coli pangenome study revealed a tailocin specific to the pandemic ST131.

Recently, we ventured into research on cellular aging and the discovery of compounds and conditions that delay cellular aging processes (35705837, 35269484).

Together with collaborators, we discovered that the dysfunction of the human gene SUGCT contributes to gut microbiota dysbiosis, leading to age-dependent pathological changes in kidney, liver, and adipose tissue (31722069). We contributed to the development of AllerCatPro, a tool that predicts the allergenic potential of proteins based on the similarity of their 3D structure as well as their amino acid sequence to a data set of known protein allergens (30657872). Both projects were carried out in collaboration with S. Maurer-Stroh’s team.

In several cases, this research effort has involved the development of algorithms and software for biomolecular sequence, omics, clinical and other life science data analysis. Examples are PTM (GPI lipid anchoring, myristoylation, prenylation, phosphorylation) and subcellular localization prediction tools for proteins (e.g., 20221930, 20221930, 19029837, and 15575971), a sophisticated ANNOTATOR software suite for protein function discovery from sequence (27115649) or NSC, the highly cited molecular surface computation algorithm (J. Comp. Chem. 16 pp. 273-284).

The collaboration with the G. Grüber crystallography lab (NTU, Singapore) resulted in a string of discoveries with regard to the structure, catalytic mechanism and sequence architecture significance of the AhpF/AhpC alkyl hydroperoxide reductase complex by studies of mutated versions of AhpF/AhpC (31047989 and references therein).

The team is involved in both academic and industry-funded projects in collaboration with the A*STAR Natural Organism Library (teams of Ng Siew Bee, Y. Kanagasundaram, and P. Arumugam) (29979661). Recently, an analog of Anthracimycin, an antibiotic that, so far, is only known to be produced by Streptomyces species, was predicted and verified to be produced by Nocardiopsis kunsanensis, a non-Streptomyces actinobacterial microorganism (29805716). Together with the BII NOL team, we discovered a new cyclic lipodepsipeptide, BII-Rafflesfungin, possessing antifungal activity that is produced by fungus Phoma sp. F3723 (31088369). We identified a biosynthetic gene cluster compatible with the production of this new compound and proposed a mechanism for its biosynthesis.

The figure corresponds to Figure 4 in the publication (PubMed ID 33436046). We illustrate the spatial localization of most conserved sequence motifs M1 (red), M2 (orange), M3 (yellow), M4 (green), M5 (blue), M6 (violet) and M7 (pink, all shown in ball mode) in human proteins TMTC1/2/3/4 at the background of the structural cartoon of the whole protein model. DPM is presented as blackish sticks, the divalent metal ion is represented as reddish sphere.

We show the case of TMTC1; the figures for the other TMTCs look very similar. The existence of a strongly conserved DPM-binding site together with all elements of the active site indicates that the TMTCs are enzymatically active sugar transferases belonging to the GT-C/PMT superfamily. The DUF1736 segment, the loop between TM7 and TM8, is critical for catalysis and lipid-linked sugar moiety binding. Together with the available indirect experimental data, we conclude that the TMTCs are not only part of an O-mannosylation pathway in the endoplasmic reticulum of upper eukaryotes but, actually, they are the sought for mannosyl-transferases.


Figure 1. The most conserved sequence motifs of TMTC1/2/3/4 proteins come spatially together in model structures of the TMTCs and can be rationalized as a dolichylphospho- mannose (DPM) binding site 


 A*STAR Senior Fellow, Senior Principal Investigator EISENHABER Frank   |    [View Bio]   
 Senior Principal Investigator  EISENHABER Birgit    |    [View Bio]  
 Assistant Principal Investigator ALFATAH Mohammad
 Research Manager TANTOSO Erwin
 Senior Bioinformatics Specialist KUCHIBATHLA Durga
 Post-Doctoral Research Fellow ARSHIA Naaz


Selected Publications