Home

Teaching

Research

Education Activities

Publications

Students

Contact Info

CV

Services

Software

Links





Active Research Projects

Dr. Yaohang Li

Department of Computer Science

Old Dominion University

 


Novel Sampling Approaches in Protein Structure Modeling

Accurately modeling protein or protein complex structure is considered a significant grand challenge that has broad economic and scientific impact. One of the key obstacles is the absence of a reliable sampling method that can efficiently explore the tremendously large protein conformation space. This project investigates efficient sampling approaches that can lead to prediction of high resolution protein structures with accuracy and reliability currently not achievable in computational protein modeling. The rationale is to integrate various physics- and knowledge-based scoring functions via multi-scoring functions sampling to explore the complex protein conformation space. The research work includes 1) establishing computational models for multi-scoring functions sampling in protein structure modeling with theoretically and mathematically rigorous justification; 2) designing novel sampling algorithms to efficiently explore large protein conformation space; 3) applying the sampling algorithms to important protein modeling applications including ab initio protein folding and protein-protein docking; and 4) developing a resource-efficient protein modeling programming paradigm.

The following figure shows the structure-score distribution of 849 non-dominated solutions obtained in an 11-residue protein loop 153l(154:164) using our sampling algorithm. The multi-scoring functions sampling has led to diversified, non-dominated solutions satisfying the three statistics- or physics-based scoring functions. Among these solutions, a solution cluster close to the native conformation (< 0.5A) emerges. As expected, the near-native solutions do not yield minimum scores in either of the scoring functions, but are at the Pareto-optimal front of the multi-scoring function space.

This work is funded as under NSF Computing and Communication Foundations: CAREER: Novel Sampling Approaches for Protein Modeling Applications.
CCF-0845702. National Science Foundation

 

Protein Loop Structure Prediction

Accurate protein loop structure modeling is important in structural biology for its wide applications, including determining the surface loop regions in homology modeling, defining segments in NMR spectroscopy experiments, designing antibodies, and modeling ion channels. Our ultimate goal is to obtain computational loop models with experiment resolution. 

Methods we developed include:
New Modeling Potential: Backbone Statistical Potential from Local Sequence-Structure Interactions in Protein Loops
New Sampling Method: Multi-scoring Sampling Methods on Distance-, Torsion-, and Physics-based Scoring Functions
New Decoy Ranking Method: Pareto-optimal Consensus Method

In near 70% of 87 long loop benchmark targets (10~13 residues),  the top-ranked decoys predicted from our current server are in subangstrom resolution. In more than 80% of these targets, at least one of the top-5-ranked decoys is in subangstrom resolution. The following figures show our prediction results in two protein loops.

1CNV(110:122)                                 1RCF(122:132)

We are collaborating with Dr. Eric Jakobsson and Ionel Rata of National Center for Supercomputing Applications (NCSA) at University of Illinois, Urbana-Champaign, in this project.

This work is funded as under NSF Computing and Communication Foundations: SGER: A Novel Multi-Scoring Functions Sampling Approach to Improve Protein Modeling Resolution and It's Applications in Protein Loop Structure Prediction CCF-0829382.
National Science Foundation

 

High Resolution ab initio Protein Folding

The successful ab initio protein structure prediction depends on the surmounting of three efforts: (1) formulating an accurate and sensitive scoring function that can lead the search process to the global minimum in the protein folding energy landscape; (2) devising efficient moves (conformation changes) toward the native conformation; and (3) developing a global optimization algorithm that can efficiently escape from the deep local minima and converge to the global energy minimum. Among these three efforts, building an accurate and sensitive scoring function is of the most importance. However, just like many other computational biology problems, developing a sensitive and accurate scoring function is a very difficult and even formidable job. In reality, even though many scoring functions based on various criteria, such as energy, statistics, secondary structure, loops, or contact pairs, have been proposed, currently there does not exist a reliable and general scoring function that can always drive a search to a native fold, and there is no reliable and general global optimization method under these scoring functions that can sample the conformation space adequately to guarantee a significant fraction o near-natives (<3.0 A RMSD from the experimental structure).
We seek to develop novel Monte Carlo approach to address the issue of the insensitivity in the current existing protein folding scoring functions and perform convergence analysis of the Monte Carlo sampling process. We are expected to develop protein modeling software tools to improve the resolution of ab initio protein structure prediction. I am collaborating with Dr. Andrey Gorin at Oak Ridge National Laboratory and Dr. Charlie Strauss at Los Alamos National Laboratory.

This work is partially supported by ORAU/ORNL Ralph E. Powe Junior Faculty Enhancement Award and ORAU/ORNL Summer Faculty Participation Program.
Oak Ridge Associated Universities.

 

Sensor Grid

Many different types of sensors are employed to monitor the different environmental attributes contributing to the climate change. For example, seismic sensors are deployed to monitor the seismic activities under the ocean and temperature sensors and sea level sensors monitor the changes in temperature and sea level in the ocean. While individual sensors provide some insights about the ongoing events, it is very important to consider the signals from different sensors collectively for detecting climate change events. Patterns detected from the individual sensors may look just normal in isolation. However, the temporal relations among them across geospatially distributed sensors may better indicate an important class of global events that may have not been apparent from the individual stream analysis. Mining individual stream data has been a subject of a large number of studies. However, studies on mining spatio-temporal patterns across multivariate stream data are very limited. In this project, we intend to address the following three issues:
1. How the heterogeneous, geographically distributed sensors within a sensor grid can be located, accessed, filtered, and integrated for a particular study?
2. How the large amount of data can be collected and analyzed?
3. How a multitude of analysis components, such as statistical, clustering, visualization, and classification tools can be applied to mine the sensor data?
This project is supported by National Oceanic & Atmospheric Administration (NOAA) through the ISET center at North Carolina A&T State University. I am collaborating with Mark Govett at NOAA, Vincent Freeh at North Carolina State University, and Albert Esterline at NCAT on this project.

This work is partially supported by National Oceanic & Atmospheric Administration (NOAA) though the ISET Center at North Carolina A&T State University.
 

 

Markov Chain Monte Carlo

Collaborating with Drs. Andrey Gorin and Vladimir Protopopescu at Oak Ridge National Laboratory, we have developed two new stochastic global optimization/sampling methods.
The first one is called Accelerated Simulated Tempering (AST), which intends accelerate the simulated tempering scheme with random walks executed on a temperature ladder with various transition step sizes. By suitably choosing the length of the transition steps, the accelerated scheme enables the search process to execute large jumps and escape entrapment in local minima, while retaining the capability to explore local details, whenever warranted. Our simulations confirm the expected improvements and show that the accelerated simulated tempering scheme has a much faster convergence to the target distribution than Geyer and Thompson’s simulated tempering algorithm and exhibits accuracy comparable to the simulated annealing method.
The second method is a population-based Markov Chain Monte Carlo approach, so-called hybrid PT/SA scheme, which combines Parallel Tempering (PT) and Simulated Annealing (SA) methods and is suitable for large-scale parallel computing systems. Within the hybrid PT/SA scheme, a composite system with multiple conformations is evolving in parallel in a temperature ladder with various transition step sizes. The SA process uses a cooling scheme to lead the temperature values in temperature ladder down to the target temperature. The PT scheme is employed to reduce the equilibration relaxation time of the composite system at a particular temperature ladder configuration in the SA process. The hybrid PT/SA method can reduce the waiting time in deep local minima and thus leads to a more efficient sampling capability on high-dimensional complicated objective function landscapes. Compared to the approaches PT and parallel SA with the same temperature ladder, transition step sizes, and cooling scheme (parallel SA) configurations, our preliminary results obtained with the hybrid PT/SA method confirm the expected performance improvements.

 

Grid-based Monte Carlo and Quasi-Monte Carlo

Monte Carlo applications are widely perceived as computationally intensive but naturally parallel. Therefore, they can be effectively executed on the grid using the dynamic bag-of-work model. We improve the efficiency of the subtask-scheduling scheme by using an N-out-of-M strategy, and develop a Monte Carlo-specific lightweight checkpoint technique, which leads to a performance improvement for Monte Carlo grid computing.  Also, we enhance the trustworthiness of Monte Carlo grid-computing applications by utilizing the statistical nature of Monte Carlo and by cryptographically validating intermediate results utilizing the random number generator already in use in the Monte Carlo application. All these techniques lead to a high-performance grid-computing infrastructure that is capable of providing trustworthy Monte Carlo computation services. These techniques can be also extended to quasi-Monte Carlo applications. Dr. Michael Mascagni is my key collaborator in this project.

 

Biologically-inspired Methods

Biological systems are remarkably adaptive and robust in complex real-world environments. For this reason, in this project, we are exploring biologically-inspired approaches to system adaptation, fault-tolerance and reconfiguration. Our approach is theoretical and application driven. On the theoretical side, we will explore the phenomena of self-adaptation and reconfiguration/organization in natural biological systems, and then develop a theoretical framework for self-reconfigurable systems. On the application side, we will design and evaluate biomimetic mechanisms and algorithms for future metamorphic autonomous systems. This project is funded by National Science Foundation, RISE program. More information can be found at the cooperative research center website.