Although bioinformatics has made deep advancements since its advent nearly half a century ago, there are still many ongoing places of research in the field that will produce novel insights within the next few decades. Here are the top 20 research areas in computational biology:
- Rapid structural and topological clustering of proteins → protein-protein interactions in high-throughput experiments for stability
- Determining the structure of macromolecular assemblies and complexes
- Simulating realistic oligomeric systems over time
- Prediction of unknown molecular structures (e.g. protein folding problem)
- A platform for a complete analysis of genome-genome comparison
- Simulating genetic networks and their sensitivity to stoichiometric and kinetic interactions
- Building understanding of speciation from a molecular perspective
- Accurate biological structure prediction from minimal data
- A complete model that predicts RNA or alternative splicing from a primary transcript
- Designing small molecule inhibitors of proteins for targeted gene therapy
- Finding protein-protein, protein-RNA, and protein-DNA recognition codes
- Developing gene ontologies and an effective way to describe the functions of any gene or protein
- Complete construction of orthologous and paralogous groups of genes
- Integrating observations across multi-omic and multi-level data to discover environmental models for analyzing organismal interaction
- Building understanding of protein evolution with regards to function
- A method of rapidly assessing polymorphic genetic variations
- A predictive model of when transcription initiates and terminates
- Modeling signal transduction pathways to predict cellular response to external stimuli
- Simulating membrane structure and dynamic structure over time
- Effective methods to educate bioinformaticians within tertiary education
Rapid structural and topological clustering of proteins
Protein clustering is extremely useful for bioinformatics analysis since it can determine the hierarchical organization of protein interaction and help support drug target efforts. Current tools for analyzing proteins revolve around machine learning clustering algorithms. For instance, this paper talks about network-based clustering and similarity-based clustering to reveal novel functional details about the protein. Current tools that exist in the space are RAFTS3G and kClust.
Determining the structure of macromolecular assemblies and complexes
Macromolecular complexes are a stable assembly of two or more macromolecules, such as proteins and carbohydrates, where the underlying components work together. This research area primarily focuses on particular subsystems and how their structure drives certain eukaryotic processes. Due to the ad-hoc nature of these complexes, no large-scale tools have been created, but specific queries into proteins, such as actomyosin and actin, have been made.
Simulating realistic oligomeric systems over time
Oligomeric systems refer to molecules that consist of similar or repeating units, such as polypeptide chains. Simulating these systems in time allows bioinformaticians to model macromolecule interactions in real-time; however, current strides are stunted due to the extensive computing capacity needed. One paper in the field analyzes Aβ peptides and simulates them at various experimental concentrations and timescales.
Prediction of unknown molecular structures
Although similar to Topic #2, prediction finds the most likely probability state of the structure, while determination refers to exact answers. Such problems include folding a protein from its primary or secondary structure while minimizing free energy requirements. Clear competitors in the space include DeepMind with its AlphaFold protein folding model.
A platform for a complete analysis of genome-genome comparison
Full genome-genome analysis tools are currently available, but require updating as new genomic analyses come out. Subtools within these platforms include read alignment, quantification, variant calling, trimming, quality control metrics, evolutionary analysis, among other methods. Platforms that are currently available include geneCo and CoGe.
Simulating genetic networks and their sensitivity to stoichiometric and kinetic interactions
Network analysis is a new and evolving field; common biological networks include regulatory networks, metabolic networks, signaling networks, protein networks, and co-expression networks. Stochastic and probabilistic methods occur for traversing these structures based on predefined rules but may prove polynomial or higher time complexity to analyze all nodes in the graph. One application of this work involves understanding how mutation through regulatory networks affects oncogenesis.
Building understanding of speciation from a molecular perspective
Speciation looks at the formation of new species in the course of evolution. Although evolutionary history has been extensively mapped out, inferring speciation events from evolutionary genes will prove useful for discovering new adaptations; one paper uses inference methods to infer events from a gene tree.
Accurate biological structure prediction from minimal data
Similar to Topic #4, we look at biological structure prediction but from the context of minimal data, in which case probabilistic models must be used. Real-world data likely comes flawed or in small sizes, so deriving inferences from this data will be useful in many clinical and research applications. One such application involves analyzing residue-residue contact prediction for proteins from just biological sequencing data (Github).
A complete model that predicts RNA or alternative splicing from a primary transcript
RNA splicing is when an RNA transcript is changed by removing introns and joining exons; discovering where these introns are can help computational biologists determine its secondary and tertiary model from a non-mature mRNA. Using neural networks, splice junction prediction has been attempted to discover intronic mutations that can cause pathogenesis.
Designing small molecule inhibitors of proteins for targeted gene therapy
Designing small molecule inhibitors has large repercussions in oncogenetics, since large molecules can be used to obstruct major enzymes that act as signals for further cancer cell development; the practice ties with targeted gene therapy and requires bioinformatics to determine the optimal location in a protein to attach upon. Research has already been done on designing small-molecule inhibitors for understanding and protecting against myosin dysfunction.
Finding protein-protein, protein-RNA, and protein-DNA recognition codes
Protein-protein recognition codes refer to amino acid residues on a protein surface that show specific DNA, and similar definitions hold true for other combinations. Research has been done on a ‘chemical’ level looking at amino acid side chains and DNA bases, and from a ‘stereochemical’ level looking at DNA-binding motif found in transcription factors. One such paper combined these two rules to create a unique recognition code for proteins.
Developing gene ontologies and an effective way to describe the functions of any gene or protein
Gene ontology refers to finding representations for gene and gene attributes across all species, and major headway has been made through the Gene Ontology knowledge base. It provides knowledge of the biological domain from a molecular standpoint (e.g. catalytic, transporter), cellular standpoint (e.g. ribosomal), and a broad biological standpoint (e.g. DNA repair, signal transduction).
Complete construction of orthologous and paralogous groups of genes
Orthologous genes refer to genes that diverged after a speciation event (see Topic #7 for more about speciation), while paralogous genes diverge from one another within a species. Current methods of identifying orthologous genes involve phylogenetic analysis through sequence conservation, which can prove computational expensive to create. Some new tools for orthology identification are seen in this paper, discussing how to detect remotely conserved orthologs, conduct accurate orthology inference, among other tools.
Integrating observations across multi-omic and multi-level data to discover environmental models for analyzing organismal interaction
A large problem after analyzing omic data separately is that cross-omic inference must be made to make large-scale observations about the organism. This field called multiomics looks across the genome, transcriptome, epigenome, metabolome, and microbiome, to understand mechanisms that underlie biological processes and functions. A review paper talking about the field with regards to data integration can be found here.
Building understanding of protein evolution with regards to function
Beyond gene ontologies discussed in Topic #12, we now look at how evolution plays a role in the function of a protein, and how a protein’s structure changes over time as a result of evolution. Two approaches can be taken: analyzing this from a biophysics perspective (looking at chemical structure changes) and from a bioinformatics perspective (looking at whole-scale organismal data changes). A recent paper looking at functional conservation’s role in protein evolution discovered minimal changes in orthologous protein sequences over large periods of time.
A method of rapidly assessing polymorphic genetic variations
Current tools for understanding genetic variation lie in linkage disequilibrium for Mendelian disease and GWAS/eQTL studies for non-Mendelian disease. Some current pathways involve finding other ways to analyze non-Mendelian disease, determining cross-correlations between various GWAS through two-sample Mendelian Randomization, and optimizing existing statistical tests to find conserved regions. One tool called MAPS takes image data to run a rapid assessment of genetic variants.
A predictive model of when transcription initiates and terminates
Genomic parts to analyze when looking at transcription involve transcription start sites, promoter regions, and termination strands; more biological components include transcription inhibitors and RNA polymerase activation enzymes. Some tools in the space include ADAPT-CAGE, which analyzes capping sites, and TIPR, a sequence-based ML model that identifies gene transcription start sites.
Modeling signal transduction pathways to predict cellular response to external stimuli
Factors to look for include hydrophobicity, posttranslational modification, transmembrane structure, secondary/tertiary structure, signal peptide structure, subcellular localization, and physical/chemical properties. Some studies look at Hidden Markov Models to analyze CpG islands, methylated sites, and ligand-receptor pairs. There are no current tools for large-scale analysis as it depends primarily on cell-type analysis.
Simulating membrane structure and dynamic structure over time
Similar to other simulations talked about previously, membrane structure changes over time and involves thousands, if not millions, of biological components to function effectively; thus, simulation proves computationally taxing. Nevertheless, biomembrane simulations are being developed to study behavior involving molecular composition and structure. This paper goes into detail about biological membranes and their various components.
Effective methods to educate bioinformaticians within tertiary education
There are many problems with the current system of bioinformatics. First, there are numerous niche areas in bioinformatics, which require new educational programs to address them. Next, there isn’t a single bioinformatics education that can teach everything there is to know about the subject. Third, there is less distribution of bioinformatics content to undergraduate students and computer science majors. This paper goes into greater detail about other issues.