Gitter Lab - Software

Software

Our lab develops research software for analyzing large-scale biological datasets. Lab members developed or contributed to the tools below. Also see our GitHub organization.

METL: Mutational Effect Transfer Learning

Mutational Effect Transfer Learning is a protein language model that unites advanced machine learning and biophysical modeling. It learns representations of protein sequences based on large-scale molecular simulations of the relationships between protein sequence, structure, and energetics. Those representations are powerful for subsequent protein engineering tasks. Sam Gelman led the development.

References: Gelman et al. 2024

Manubot: Manuscripts, open and automated

Manubot is a workflow for writing and distributing scholarly manuscripts. It uses GitHub to coordinate large-scale collaborative writing and automates many aspects of the writing process, such as citation. Daniel Himmelstein led the development with many collaborators.

References: Himmelstein et al. 2019, Rando et al. 2021, Manubot website

Assay2Mol

Assay2Mol uses a large language model to generate chemicals by retrieving relevant biochemical assays from PubChem to use as context. Yifan Deng led the development.

References: Deng et al. 2025

ChemLML: Chemical Language Model Linker

A text-based model for generating chemicals with desired properties that combines pretrained large language models and molecule representation models. Yifan Deng led the development.

References: Deng et al. 2024

nn4dms: Neural networks for deep mutational scanning data

The neural networks for deep mutational scanning data project is a deep learning framework for learning protein sequence-function relationships. The software supports retraining the models described in our manuscript, training models on new sequence-function examples, or using the pre-trained models to predict function scores for new sequence variants. Sam Gelman led the development.

References: Gelman et al. 2021

MPAC: Multi-omic Pathway Analysis of Cancer

MPAC learns activity signatures of entities in biological pathways from multi-omic data, specifically DNA copy number alterations and gene expression. Peng Liu led the development.

References: Liu et al. 2024

SSPS: Sparse Signaling Pathway Sampling

SSPS learns signaling pathway structures from time series protein phosphorylation data. Its statistical model is implemented in the Gen probabilistic programming language. David Merrell led the development.

References: Merrell and Gitter 2020

SINGE: Single-cell Inference of Networks using Granger Ensembles

SINGE adopts Granger Causality to reconstruct transcriptional regulatory networks from pseudotime-ordered single-cell RNA-seq data. It uses a specialized form of Granger Causality to smooth the irregularly-spaced expression data and builds ensembles of many individual candidate networks. Atul Deshpande led the development.

References: Deshpande et al. 2022

TPS: Temporal Pathway Synthesizer

TPS uses protein-protein interactions and time series phosphorylation data to infer signaling pathway structures. It synthesizes (generates) candidate pathways that are consistent with logical constraints. For instance, a protein activated late in a stimulus response cannot control a protein activated earlier in the response. In addition, all proteins in the pathway must be connected to the source(s) of stimulation. Ali Köksal led the development in collaboration with several other groups.

References: Köksal et al. 2018

LPWC: Lag Penalized Weighted Correlation

LPWC is a clustering algorithm specialized for time series data. Unlike general clustering methods, it detects related temporal patterns that occur at similar times even if they are not perfectly synchronized. Thevaa Chandereng led the development.

References: Chandereng and Gitter 2020

ML4Bio: Machine Learning for Biologists

ML4Bio is a Python package used to introduce machine learning concepts to a biology audience in a workshop format. It focuses on classification and wraps the scikit-learn Python package. The workshop includes example datasets and guides to the machine learning pipeline and different classifiers. Chris Magnano, Fangzhou Mu, and Milica Cvetkovic led the development.

References: Magnano et al. 2022

Omics Integrator

Omics Integrator is a suite of tools for integrating and building network models from multiple types of omic data (transcriptomic, epigenomic, proteomic, genomic, etc.). The Garnet module combines epigenomic and transcriptomic data to determine which transcription factors are relevant in a biological condition. The Forest module uses the prize-collecting Steiner forest (PCSF) algorithm to connect proteins of interest in a protein-protein interaction network, which may optionally include transcription factors from Garnet. The software was primarily developed by Ernest Fraenkel's lab.

References: Tuncbag et al. 2016

SDREM: Signaling and Dynamic Regulatory Events Miner

SDREM reconstructs the signaling pathways and transcriptional regulatory networks that cells use to response to external stimuli. It takes as input time series gene expression data following stimulation, a list of proteins that initially detect the stimulation, and optional prior knowledge about the relevance of other proteins on the signaling pathway. These condition-specific data are combined with generic protein-protein interactions and protein-DNA interactions (e.g. from ChIP-seq, ChIP-chip, or inferred from DNA binding motifs). The resulting model predicts which transcription factors control the response, when they are active, and how they are activated by upstream signaling pathways. The Gitter et al. 2015 reference below is a step-by-step guide for using the SDREM software (PDF available upon request). MT-SDREM is a multi-task learning extension of SDREM that jointly models multiple responses. The software was developed with Ziv Bar-Joseph's lab.

References: Gitter et al. 2013a, Gitter et al. 2013b, Jain et al. 2014, and Gitter et al. 2015

MEO: Maximum Edge Orientation

MEO orients an undirected graph by finding the edge directions that maximize the high-confidence connections between a set of source nodes and a set of target nodes. This approach can be used to find signaling pathways embedded in a protein-protein interaction network given their starting points (e.g. receptors) and end points (e.g. transcription factors). The software was developed with Ziv Bar-Joseph's lab.

References: Gitter et al. 2011

DREM 2.0: Dynamic Regulatory Events Miner

DREM identifies the transcription factors that drive temporal changes in gene expression by predicting which regulators cause groups of genes that are co-expressed up until a particular time point to diverge. It integrates time series gene expression data with protein-DNA interactions (e.g. from ChIP-seq, ChIP-chip, or inferred from DNA binding motifs). DREM 2.0 extends the original DREM by supporting protein-DNA interactions that change over time, incorporating motif finding, and improving the visualization. The software was developed with Jason Ernst and Ziv Bar-Joseph's labs.

References: Ernst et al. 2007 and Schulz et al. 2012

Multi-PCSF: Multi-Sample Prize-Collecting Steiner Forest

Multi-PCSF extends the PCSF algorithm to jointly model multiple samples or patients. PCSF combines scores on proteins with a weighted protein-protein interaction network to identify low cost connections between high-scoring proteins. The multi-sample extension learns networks for all samples simultaneously, constraining the networks to be similar for different samples. The software was developed with Ernest Fraenkel's lab.

References: Gitter et al. 2014