Convert AnnData (.h5ad) to GRAVITY CSV

GRAVITY expects a cellDancer-style long-format CSV containing spliced/unspliced counts, embeddings, and optional cluster labels. Use gravity.export_intermediate_from_h5ad to produce this CSV once per dataset.

from gravity import export_intermediate_from_h5ad

export_intermediate_from_h5ad(
    input_h5ad="data/postprocessed.h5ad",
    output_csv="data/PancreaticEndocrinogenesis_cell_type_u_s.csv",
    retain_genes=["GCG", "INS2"],
    n_top_genes=1000,
    embed_key="X_umap",
    celltype_key="celltype",
    overwrite=True,
)

The helper performs:

  1. Reading the AnnData file and checking that spliced/unspliced layers exist.

  2. Computing the preprocessing steps needed to populate the spliced/unspliced count table.

  3. Exporting a cellDancer-style CSV with embedded coordinates plus optional cluster labels.

Keep large generated CSV files outside git and document their expected paths so subsequent pipeline runs can reuse them without recomputing AnnData steps.

Gene Order

By default, GRAVITY preserves the gene order found in the exported CSV. This is fine for training a new model, but pretrained checkpoints require the same gene index order used during their original run. If you plan to reuse a checkpoint, keep the checkpoint’s genes.txt file and pass it later as gene_order_path when running the pipeline.

For the provided pancreas reference checkpoints, use:

gene_order_path = "data/pancreas/reference_checkpoints/pancreas_genes.txt"

The gene set alone is not sufficient for checkpoint reproduction; the order also matters because model weights and attention matrices are indexed by gene position.