Run the GRAVITY Pipeline¶
The pancreatic endocrinogenesis CSV is the real demo dataset used by the smoke test and tutorials. It is the preprocessed input file linked from cellDancer’s pancreatic endocrinogenesis case study:
https://guangyuwanglab2021.github.io/cellDancer_website/notebooks/case_study_pancreas.html
Download the CSV archive from:
https://drive.google.com/file/d/16hV9t66edOgjCmoBuEfekS3ijtL1fYNc/view?usp=sharing
Save it as data/PancreaticEndocrinogenesis_cell_type_u_s.csv, then run the
two-stage pipeline via PipelineConfig and run_pipeline. The example below
uses the same pancreas reference layout as the smoke test.
The bundled prior network archives are stored under prior_data/. The pancreas
demo uses the NicheNet-derived mouse network, prior_data/nichenet_mouse.zip;
for human datasets, switch the same config field to
prior_data/nichenet_human.zip. The archives follow the prior-network
processing described by CEFCON: start from NicheNet’s integrated gene
interaction network, remove cell-cell ligand-receptor interactions, use the
unweighted integrated network, and represent undirected edges as bidirectional
directed edges.
from gravity import PipelineConfig, run_pipeline
cfg = PipelineConfig(
raw_counts="data/PancreaticEndocrinogenesis_cell_type_u_s.csv",
workdir="gravity_outputs_pancreas",
prior_network="prior_data/nichenet_mouse.zip",
gene_order_path="data/pancreas/reference_checkpoints/pancreas_genes.txt",
stage1_pretrained_checkpoint="data/pancreas/reference_checkpoints/pancreas_stage1.ckpt",
stage2_pretrained_checkpoint="data/pancreas/reference_checkpoints/pancreas_stage2.ckpt",
accelerator="gpu",
devices=1,
batch_size=16,
stage1_lr=1e-6,
stage2_lr=1e-4,
make_plot=True,
plot_genes=["GCG", "INS2"],
stage1_epochs=6,
stage2_epochs=4,
)
outputs = run_pipeline(cfg)
print(outputs)
The expected output is a dictionary containing paths to combine.csv, stage
checkpoints, future_positions.npy, stage CSV files, attention exports, and
velocity plots for the selected genes. The scripted smoke test uses the same
layout:
python gravity/smoke_test.py
Expected runtime depends on the hardware and epoch settings. The complete
pancreas demo is intended for a CUDA workstation and typically runs on the
order of tens of minutes. A CPU-only desktop is not the recommended target for
the full demo; for a desktop sanity check, reduce GRAVITY_STAGE1_EPOCHS and
GRAVITY_STAGE2_EPOCHS.
Key tips:
Use unique
workdirnames per experiment to avoid overwriting checkpoints.Set
devicesandstrategyto match your hardware. Start withdevices=1; usedevices=[0,1],strategy="ddp"only for multi-GPU runs.Reduce
batch_sizeor providegene_subsetwhen GPU memory is limited.Pass
gene_order_pathwhen using pretrained/reference checkpoints; checkpoint tensors are aligned by gene index, not only by gene name.For checkpoint-based pancreas reproduction, pass both
stage1_pretrained_checkpointandstage2_pretrained_checkpoint; this runs inference/export from the provided weights instead of retraining.The unsupervised and contrastive objectives are learning-rate sensitive. For reference-style runs, use
stage1_lr < 1e-5and tunestage2_lrwithin1e-3to1e-5.
The resulting dictionary contains paths to combine.csv, stage checkpoints, future_positions.npy, and attention exports.
For pancreatic endocrinogenesis examples, pretrained reference weights are
available in data/pancreas/reference_checkpoints/. The matching large
reference exports are named pancreas_stage1_reference.csv and
pancreas_stage2_reference.csv; keep them outside git or distribute them
separately. Use
gene_order_path="data/pancreas/reference_checkpoints/pancreas_genes.txt"
together with the two pretrained checkpoint fields for provided pancreas
checkpoint reproduction.
Background links:
NicheNet: https://www.nature.com/articles/s41592-019-0667-5
CEFCON: https://www.nature.com/articles/s41467-023-44103-3
cellDancer input layout and pancreas demo: https://www.nature.com/articles/s41587-023-01728-5