At SMLC, we are broadly interested in machine learning methods that can further our understanding of proteins and structural biology and also for analyzing and automating data collected by instruments at NYSBC. These areas touch on many aspects of machine learning and computer vision including image segmentation, object detection, object reconstruction, few shot and semi-supervised learning, geometric deep learning, generative modeling and language modeling, and invariant and disentangled representation learning.
Machine learning for cryoEM/cryoET
We see 3 major areas where machine learning is poised to change the way we approach cryoEM and cryoET:
1) Automation of data collection. ML can enable automated microscope targeting and collection by combining low and medium magnification image segmentation with online learning to learn how to choose targets to optimize data quality and collection speed. Increasing throughput and lowering costs will make cryoEM more accessible to the wider biology community and drive new discoveries.
2) Improved analysis. The ability to unlock new insights from existing and future data is critical to ask deeper biological questions and more fully understand protein structural biology. Better pre-trained object detection models, deep generative models with interpretable latent variables, and automatic micrograph and tomogram segmentation through unsupervised representation learning are core ML developments that will improve our ability to understand biology through cryoEM. Data efficiency can also be improved by incorporating strong data-driven priors.
3) Integration with external data sources. The cryoEM pipeline conventionally operates in a silo and does not integrate directly with other information sources until final atomic models are produced. ML offers the potential to directly integrate other information sources as rich prior knowledge for improving cryoEM data analysis.
Machine learning for protein structure and function
Language models are a powerful new development for understanding and making predictions about biological sequences. Increasing model size and compute power continues to deliver improved performance. However, proteins have intrinsic properties that are not naturally encoded by existing deep learning models. Augmenting unsupervised representations with protein specific properties such as structure and function offers one already successful route towards richer representations and novel biology. We are interested in purpose-built models with natural inductive biases suited for the physical nature of proteins.