Protein Tools

Education Activities



Contact Info



Random Number
















































































































Active Research Projects

Dr. Yaohang Li

Department of Computer Science

Old Dominion University


Novel Sampling Approaches in Protein Structure Modeling

Accurately modeling protein or protein complex structure is considered a significant grand challenge that has broad economic and scientific impact. One of the key obstacles is the absence of a reliable sampling method that can efficiently explore the tremendously large protein conformation space. This project investigates efficient sampling approaches that can lead to prediction of high resolution protein structures with accuracy and reliability currently not achievable in computational protein modeling. The rationale is to integrate various physics- and knowledge-based scoring functions via multi-scoring functions sampling to explore the complex protein conformation space. The research work includes 1) establishing computational models for multi-scoring functions sampling in protein structure modeling with theoretically and mathematically rigorous justification; 2) designing novel sampling algorithms to efficiently explore large protein conformation space; 3) applying the sampling algorithms to important protein modeling applications including ab initio protein folding and protein-protein docking; and 4) developing a resource-efficient protein modeling programming paradigm.

The following figure shows the structure-score distribution of 849 non-dominated solutions obtained in an 11-residue protein loop 153l(154:164) using our sampling algorithm. The multi-scoring functions sampling has led to diversified, non-dominated solutions satisfying the three statistics- or physics-based scoring functions. Among these solutions, a solution cluster close to the native conformation (< 0.5A) emerges. As expected, the near-native solutions do not yield minimum scores in either of the scoring functions, but are at the Pareto-optimal front of the multi-scoring function space.

This work is funded as under NSF Computing and Communication Foundations: CAREER: Novel Sampling Approaches for Protein Modeling Applications.


Monte Carlo Methods for Big Data Analysis

Recent years have witnessed dramatic increase of data in many fields of science and engineering, due to the advancement of sensors, mobile devices, biotechnology, digital communication, and internet applications. These massive, continuing growing, complex, diverse, distributed data sets are referred to as the “big data.” Big data touches every aspects of our life. On one hand, big data provides rich information source to enable us to gain important insight in various scientific and engineering domains in a scale and at a level that has never been possible before. Successfully addressing the big data challenge can lead to broad scientific and economic impacts. On the other hand, the growth of big data has outpaced our capability to process, analyze, and understand these datasets. Most traditional data processing approaches have failed to scale to big data.

For extremely large data sets, statistical sampling is often the only viable approach. Our research seeks to develop novel Monte Carlo methods that can "smartly" and adaptively sampling large data sets to rapidly extract the important patterns and knowledge.

Recovering ODU from ~20% Samples

Recovering "ODU" withh ~20% of Samples

Monte Carlo Methods for Large-Scale Linear Algebra

Solving linear systems with big coefficient matrices is behind a lot of big data applications. The Monte Carlo methods for linear systems, originally proposed by Ulam and von Neumann and their colleagues in 1950s, once were deemed inefficient compared to deterministic solvers by the numerical analysis community, regain our attention. Random sampling in Monte Carlo methods offers natural solutions to address many challenges in solving linear systems with large matrices. We recently revisited and analyzed the Ulam-von Neumann algorithm. We clarified 60-year long-standing confusions on convergence of the Ulam-von Neumann algorithm in the literature and derived a necessary and sufficient condition.

Our research interest is to obtain a good approximation of very large-scale linear systems by developing novel Monte Carlo sampling methods to reduce number of accesses to coefficient matrix elements, accelerate convergence, and even detect potential soft faults.

This project is funded as under ODU Office of Research: Toward Solutions to Big Data Challenges in Multiple Disciplinary Applications


Protein Loop Structure Prediction

Accurate protein loop structure modeling is important in structural biology for its wide applications, including determining the surface loop regions in homology modeling, defining segments in NMR spectroscopy experiments, designing antibodies, and modeling ion channels. Our ultimate goal is to obtain computational loop models with experiment resolution. 

Methods we developed include:
New Modeling Potential: Backbone Statistical Potential from Local Sequence-Structure Interactions in Protein Loops
New Sampling Method: Multi-scoring Sampling Methods on Distance-, Torsion-, and Physics-based Scoring Functions
New Decoy Ranking Method: Pareto-optimal Consensus Method

In near 70% of 87 long loop benchmark targets (10~13 residues),  the top-ranked decoys predicted from our current server are in subangstrom resolution. In more than 80% of these targets, at least one of the top-5-ranked decoys is in subangstrom resolution. The following figures show our prediction results in two protein loops.

1CNV(110:122)                                 1RCF(122:132)

We are collaborating with Dr. Eric Jakobsson and Ionel Rata of National Center for Supercomputing Applications (NCSA) at University of Illinois, Urbana-Champaign, in this project.

This work is funded as under NSF Computing and Communication Foundations: SGER: A Novel Multi-Scoring Functions Sampling Approach to Improve Protein Modeling Resolution and It's Applications in Protein Loop Structure Prediction CCF-0829382.


High Resolution ab initio Protein Folding

The successful ab initio protein structure prediction depends on the surmounting of three efforts: (1) formulating an accurate and sensitive scoring function that can lead the search process to the global minimum in the protein folding energy landscape; (2) devising efficient moves (conformation changes) toward the native conformation; and (3) developing a global optimization algorithm that can efficiently escape from the deep local minima and converge to the global energy minimum. Among these three efforts, building an accurate and sensitive scoring function is of the most importance. However, just like many other computational biology problems, developing a sensitive and accurate scoring function is a very difficult and even formidable job. In reality, even though many scoring functions based on various criteria, such as energy, statistics, secondary structure, loops, or contact pairs, have been proposed, currently there does not exist a reliable and general scoring function that can always drive a search to a native fold, and there is no reliable and general global optimization method under these scoring functions that can sample the conformation space adequately to guarantee a significant fraction o near-natives (<3.0 A RMSD from the experimental structure).
We seek to develop novel Monte Carlo approach to address the issue of the insensitivity in the current existing protein folding scoring functions and perform convergence analysis of the Monte Carlo sampling process. We are expected to develop protein modeling software tools to improve the resolution of
ab initio protein structure prediction. I am collaborating with Dr. Andrey Gorin at Oak Ridge National Laboratory and Dr. Charlie Strauss at Los Alamos National Laboratory.

This work is partially supported by ORAU/ORNL Ralph E. Powe Junior Faculty Enhancement Award and ORAU/ORNL Summer Faculty Participation Program.


Sensor Grid

Many different types of sensors are employed to monitor the different environmental attributes contributing to the climate change. For example, seismic sensors are deployed to monitor the seismic activities under the ocean and temperature sensors and sea level sensors monitor the changes in temperature and sea level in the ocean. While individual sensors provide some insights about the ongoing events, it is very important to consider the signals from different sensors collectively for detecting climate change events. Patterns detected from the individual sensors may look just normal in isolation. However, the temporal relations among them across geospatially distributed sensors may better indicate an important class of global events that may have not been apparent from the individual stream analysis. Mining individual stream data has been a subject of a large number of studies. However, studies on mining spatio-temporal patterns across multivariate stream data are very limited. In this project, we intend to address the following three issues:
1. How the heterogeneous, geographically distributed sensors within a sensor grid can be located, accessed, filtered, and integrated for a particular study?
2. How the large amount of data can be collected and analyzed?
3. How a multitude of analysis components, such as statistical, clustering, visualization, and classification tools can be applied to mine the sensor data?
This project is supported by National Oceanic & Atmospheric Administration (NOAA) through the ISET center at North Carolina A&T State University. I am collaborating with Mark Govett at NOAA, Vincent Freeh at North Carolina State University, and Albert Esterline at NCAT on this project.

This work is partially supported by National Oceanic & Atmospheric Administration (NOAA) though the ISET Center at North Carolina A&T State University.