Scientifically, the Gene Function Prediction group is focused on the prediction of molecular and cellular functions of genes and proteins based on the theoretical analysis of biomolecular sequences, expression profiles, and other omics high-throughput data. Most of the work is with internal and/or external collaborators, also including partners in clinics and biotech/pharma industry. Besides that, this group also provides support for teams in BII, an organizational fallback for staff involved in various collaborations, software development and incubation activities that do not readily fit into other existing PI-led teams.
Dramatic recent improvements of nucleic acid sequencing technologies enhance the prospect of general availability of genomes from patients, patient-specific pathogens and of gene expression data. This development has profound implications for life science research and biomedical applications. Sequencing is becoming one of the most informative research technologies in life sciences; consequently, sequence analysis and sequence based structure and function prediction will be more important than ever.
However, at the same time, researchers in the field of genome annotation see that there is a persisting, substantial body of functionally insufficiently or completely not characterized genes (for example, ~10,000 protein-coding genes in the human genome) despite the availability of full genome sequences. A survey of the biomedical literature  shows that the number of reported new protein functions had been steadily growing until 2000 but the trend reversed to a dramatic decline thereafter. The fastest-growing set of genes in the last decade is the set of genes that is well characterized anyhow. At the same time, the annual amount of life science publications doubled between 2000 and 2017.
With regard to new gene functions, we discovered a new membrane-embedded protein domain evolutionarily multiplied in the GPI lipid anchor pathway proteins, BindGPILA. It functions as the unit for recognizing, binding and stabilizing the GPI lipid anchor in a modification-competent form . Recently, we discovered that the mitochondrial gene SUGCT contributes to gut microbiota dysbiosis, leading to age-dependent pathological changes in kidney, liver, and adipose tissue . We contributed to the development of AllerCatPro, a tool that predicts the allergenic potential of proteins based on the similarity of their 3D structure as well as their amino acid sequence to a data set of known protein allergens . Both projects,  and , were carried out in collaboration with S. Maurer-Stroh.
The collaboration with the G. Grüber crystallography lab (NTU, Singapore) resulted in a string of discoveries with regard to the structure, catalytic mechanism and sequence architecture significance of the AhpF/AhpC alkyl hydroperoxide reductase complex by studies of mutated versions of AhpF/AhpC [5-7].
The team is involved in both academic and industry-funded projects in collaboration with the BII Natural Organism Library (teams of Ng Siew Bee, Y. Kanagasundaram, and P. Arumugam) . Recently, an analog of Anthracimycin, an antibiotic that, so far, is only known to be produced by Streptomyces species, was predicted and verified to be produced by Nocardiopsis kunsanensis, a non-Streptomyces actinobacterial microorganism . Together with the BII NOL team, we discovered a new cyclic lipodepsipeptide, BII-Rafflesfungin, possessing antifungal activity that is produced by fungus Phoma sp. F3723 . We identified a biosynthetic gene cluster compatible with the production of this new compound (Figure 1) and proposed a mechanism for its biosynthesis (Figure 2).
Figure 1: The predicted biosynthetic gene cluster BIIRfg.
A. Gene organization of BIIRfg. Biosynthetic genes are highlighted in red, the direction of arrows corresponds to that of the reading frame. The neighboring orfs of NRPS and PKS genes are labelled a, b, c, d, e, f, g, h, i, j and k.
B. Domain structure of the NRPS gene: The green ovals labelled An (n = 1..8) represent adenylation domains. The orange/yellow ovals labelled Cn (n = 1..8) represent condensation domains. The last condensation domain is labelled as CT and is shown in red oval. The pink ovals labelled En (n = 1,2) represent epimerization domains. Cyan ovals labelled PCPn (n = 1..9) represent peptidyl carrier protein (PCP) domains. The constituent modules in the cluster are marked from 1 to 10.
C. Domain structure of PKS gene: T1-PKS module consists of a beta-ketoacyl synthase (KS) domain shown in green, acyltransferase (AT) domain in orange, dehydratase (DH) domain in magenta, methyltransferase (cMT) domain in grey, enoyl reductase (ER) domain in cyan, ketoreductase (KR) in yellow and acyl carrier protein (ACP) domain in red.
The team continues to be successful in attracting grants (BE: NRF-CRP17-2017-03 (green and sustainable pharmaceutical manufacturing via biocatalysis, FE: CITI – Cancer Immunotherapy Imaging). The development of the data management system TIMS (Translational Informatics Management System)  has led to multiple collaborations/grants with both academic and commercial entities. Currently, the group manages the data for joint BII-PRISM projects (GEMINI – gastric cancer dataset, and ATTRaCT - clinical data analysis for heart failure patients), the metadata for the SG10K project (National Precision Medicine Programme), and the data for the CaLiBRe (Cancer Liquid Biopsy for Real-Time Diagnostics and Early Intervention) project. The work with clinical data motivated us to discuss the problem of access to patient data for biomedical research from three different perspectives: patients’, clinicians’ and researchers’ .
Figure 2: Proposed biosynthetic pathway for the synthesis of BII-Rafflesfungin. The lipid part of the compound, β-hydroxy-γ-methyl hexadecanoic acid (HMHDA), is assembled by the BIIRfg_PKS cluster. We propose that the lipid moiety is released from the PKS module and activated to form an acyladenylate by orf-i, a predicted AMP-dependent ligase, and, subsequently, loaded on to the first PCP domain of the BIIRfg_NRPS gene to initiate the peptide synthesis. The CT domain, the last condensation domain of BIIRfg_NRPS, terminates the peptide synthesis and releases the cyclic lipodepsipeptide BII-Rafflesfungin.