Software
Our lab develops research software for analyzing large-scale biological datasets. Lab members developed or contributed to the tools below. Also see our GitHub organization.
METL: Mutational Effect Transfer Learning
Mutational Effect Transfer Learning is a protein language model that unites advanced machine learning and biophysical modeling. It learns representations of protein sequences based on large-scale molecular simulations of the relationships between protein sequence, structure, and energetics. Those representations are powerful for subsequent protein engineering tasks. Sam Gelman led the development.
References: Gelman et al. 2024
- METL on GitHub
- METL Zenodo archive
- METL simulations on GitHub
- METL simulations Zenodo archive
- METL pretrained models on GitHub
- METL pretrained models Zenodo archive
- METL resources on GitHub
- METL resources Zenodo archive
Manubot: Manuscripts, open and automated
Manubot is a workflow for writing and distributing scholarly manuscripts. It uses GitHub to coordinate large-scale collaborative writing and automates many aspects of the writing process, such as citation. Daniel Himmelstein led the development with many collaborators.
References: Himmelstein et al. 2019, Rando et al. 2021, Manubot website
ChemLML: Chemical Language Model Linker
A text-based model for generating chemicals with desired properties that combines pretrained large language models and molecule representation models. Yifan Deng led the development.
References: Deng et al. 2024
nn4dms: Neural networks for deep mutational scanning data
The neural networks for deep mutational scanning data project is a deep learning framework for learning protein sequence-function relationships. The software supports retraining the models described in our manuscript, training models on new sequence-function examples, or using the pre-trained models to predict function scores for new sequence variants. Sam Gelman led the development.
References: Gelman et al. 2021
MPAC: Multi-omic Pathway Analysis of Cancer
MPAC learns activity signatures of entities in biological pathways from multi-omic data, specifically DNA copy number alterations and gene expression. Peng Liu led the development.
References: Liu et al. 2024
- MPAC on Bioconductor
- Shiny app on GitHub
- Shiny app demo
- MPAC Zenodo archive
- Shiny app zenodo archive
SSPS: Sparse Signaling Pathway Sampling
SSPS learns signaling pathway structures from time series protein phosphorylation data. Its statistical model is implemented in the Gen probabilistic programming language. David Merrell led the development.
References: Merrell and Gitter 2020
SINGE: Single-cell Inference of Networks using Granger Ensembles
SINGE adopts Granger Causality to reconstruct transcriptional regulatory networks from pseudotime-ordered single-cell RNA-seq data. It uses a specialized form of Granger Causality to smooth the irregularly-spaced expression data and builds ensembles of many individual candidate networks. Atul Deshpande led the development.
References: Deshpande et al. 2022
TPS: Temporal Pathway Synthesizer
TPS uses protein-protein interactions and time series phosphorylation data to infer signaling pathway structures. It synthesizes (generates) candidate pathways that are consistent with logical constraints. For instance, a protein activated late in a stimulus response cannot control a protein activated earlier in the response. In addition, all proteins in the pathway must be connected to the source(s) of stimulation. Ali Köksal led the development in collaboration with several other groups.
References: Köksal et al. 2018
LPWC: Lag Penalized Weighted Correlation
LPWC is a clustering algorithm specialized for time series data. Unlike general clustering methods, it detects related temporal patterns that occur at similar times even if they are not perfectly synchronized. Thevaa Chandereng led the development.
References: Chandereng and Gitter 2020
ML4Bio: Machine Learning for Biologists
ML4Bio is a Python package used to introduce machine learning concepts to a biology audience in a workshop format. It focuses on classification and wraps the scikit-learn Python package. The workshop includes example datasets and guides to the machine learning pipeline and different classifiers. Chris Magnano, Fangzhou Mu, and Milica Cvetkovic led the development.
References: Magnano et al. 2022
Omics Integrator
Omics Integrator is a suite of tools for integrating and building network models from multiple types of omic data (transcriptomic, epigenomic, proteomic, genomic, etc.). The Garnet module combines epigenomic and transcriptomic data to determine which transcription factors are relevant in a biological condition. The Forest module uses the prize-collecting Steiner forest (PCSF) algorithm to connect proteins of interest in a protein-protein interaction network, which may optionally include transcription factors from Garnet. The software was primarily developed by Ernest Fraenkel's lab.
References: Tuncbag et al. 2016
SDREM: Signaling and Dynamic Regulatory Events Miner
SDREM reconstructs the signaling pathways and transcriptional regulatory networks that cells use to response to external stimuli. It takes as input time series gene expression data following stimulation, a list of proteins that initially detect the stimulation, and optional prior knowledge about the relevance of other proteins on the signaling pathway. These condition-specific data are combined with generic protein-protein interactions and protein-DNA interactions (e.g. from ChIP-seq, ChIP-chip, or inferred from DNA binding motifs). The resulting model predicts which transcription factors control the response, when they are active, and how they are activated by upstream signaling pathways. The Gitter et al. 2015 reference below is a step-by-step guide for using the SDREM software (PDF available upon request). MT-SDREM is a multi-task learning extension of SDREM that jointly models multiple responses. The software was developed with Ziv Bar-Joseph's lab.
References: Gitter et al. 2013a, Gitter et al. 2013b, Jain et al. 2014, and Gitter et al. 2015
MEO: Maximum Edge Orientation
MEO orients an undirected graph by finding the edge directions that maximize the high-confidence connections between a set of source nodes and a set of target nodes. This approach can be used to find signaling pathways embedded in a protein-protein interaction network given their starting points (e.g. receptors) and end points (e.g. transcription factors). The software was developed with Ziv Bar-Joseph's lab.
References: Gitter et al. 2011
DREM 2.0: Dynamic Regulatory Events Miner
DREM identifies the transcription factors that drive temporal changes in gene expression by predicting which regulators cause groups of genes that are co-expressed up until a particular time point to diverge. It integrates time series gene expression data with protein-DNA interactions (e.g. from ChIP-seq, ChIP-chip, or inferred from DNA binding motifs). DREM 2.0 extends the original DREM by supporting protein-DNA interactions that change over time, incorporating motif finding, and improving the visualization. The software was developed with Jason Ernst and Ziv Bar-Joseph's labs.
References: Ernst et al. 2007 and Schulz et al. 2012
Multi-PCSF: Multi-Sample Prize-Collecting Steiner Forest
Multi-PCSF extends the PCSF algorithm to jointly model multiple samples or patients. PCSF combines scores on proteins with a weighted protein-protein interaction network to identify low cost connections between high-scoring proteins. The multi-sample extension learns networks for all samples simultaneously, constraining the networks to be similar for different samples. The software was developed with Ernest Fraenkel's lab.
References: Gitter et al. 2014