Tutorial#

The deepcell-types model predicts cell types from multiplexed spatial proteomic images. There are three main inputs to the cell type prediction pipeline:

  1. The multiplexed image in channels-first format

  2. A whole-cell segmentation mask for the image, and

  3. A mapping of the channel index to the marker expression name.

Each of these components will be covered in further detail in this tutorial.

Example datasets#

This tutorial will make use of the spatial proteomic data available on the HuBMAP data portal. Users are encouraged to explore the portal for data of interest. For convenience, a subset of the publicly-available spatial proteomic data has been converted to a remote zarr archive. The datasets in the zarr archive reflect the original HuBMAP indexing scheme (i.e. HBM###_????_###, where # indicates a number nad ? indicates an upper-case alphabetical character).

Interacting with the zarr hubmap data mirror requires a few additional dependencies:

pip install zarr\>2 s3fs rich

Note

The hubmap data mirror uses zarr format v3, thus requires zarr>2 to be installed.

Exploring the archive#

import zarr

z = zarr.open_group(
    store="s3://deepcelltypes-demo-datasets/hubmap.zarr",
    mode="r",
    storage_options={
        "anon": True,
        "client_kwargs": dict(region_name="us-east-1"),
    },
)

High-level structure of the data archive:

z.tree()
/
├── HBM267_BZKT_867
│   ├── image (19, 9072, 9408) uint16
│   └── segmentations
│       ├── cellsam (9072, 9408) uint32
│       └── torch_mesmer (9072, 9408) uint32
├── HBM685_PCCJ_427
│   ├── image (54, 9510, 9993) uint16
│   └── segmentations
│       ├── cellsam (9510, 9993) uint32
│       └── torch_mesmer (9510, 9993) uint32
└── HBM994_PDJN_987
    ├── image (37, 2048, 2048) int16
    └── segmentations
        ├── cellsam (2048, 2048) uint32
        └── torch_mesmer (2048, 2048) uint32

A more detailed look at the datasets:

import pandas as pd  # for nice html rendering

summary = pd.DataFrame.from_dict(
    {k:
        {
            "tissue": z[k].attrs["tissue"],
            "technology": z[k].attrs["modality"],
            "Num ch.": z[k]["image"].shape[0],
            "shape": z[k]["image"].shape[1:],
        }
        for k in z.group_keys()
    },
    orient="index",
)

summary.sort_index()
tissue technology Num ch. shape
HBM267_BZKT_867 spleen codex 19 (9072, 9408)
HBM685_PCCJ_427 intestine codex 54 (9510, 9993)
HBM994_PDJN_987 uterus mibi 37 (2048, 2048)

In the interest of minimizing network bandwidth, we’ll use the HBM994_PDJN_987 dataset to demonstrate the deepcell-types inference pipeline.

k = "HBM994_PDJN_987"

Dataset anatomy#

As noted above, the cell-type prediction pipeline requires the multiplexed image, the channel-name mapping, and a segmentation mask for the image. The multiplexed image is stored in the image array for each dataset, and the channel mapping is stored under the key "channels" in the image metadata. Note that these two inputs are derived directly from the corresponding datasets on the HuBMAP data portal.

ds = z[k]
img = ds["image"][:]  # Load data into memory
chnames = ds["image"].attrs.get("channels")

# Sanity check: ensure that channel name list is the same size as the number of
# channels in the image
len(chnames) == img.shape[0]
True

Another bit of metadata that is useful (when available) is the pixel size of the image, in microns-per-pixel. While not strictly required, this can improve predictions by tamping down variability in image scaling. This information is stored in the dataset metadata.

mpp = ds["image"].attrs["mpp"]
mpp
0.6

Running the cell-type prediction pipeline#

Note

Both cellSAM and deepcell-types models can in principle be run on CPUs, but it is strongly recommended that users make use of GPU-capable machines when running cell segmentation/cell-type prediction workflows.

The final input is a segmentation mask. deepcell-types has been intentionally designed for flexibility on this front to better integrate into existing spatial-omics workflows. However, for convenience, several pre-computed segmentation masks are stored in the data archive: one computed by Mesmer (available at ds["segmentations/torch_mesmer"]) and a second by CellSAM (available at ds["segmentations/cellsam"]).

In this tutorial, we will demonstrate how to use one of these models to construct a full cell-type inference pipeline.

Cell segmentation with cellSAM#

In order to use cellSAM, it must be installed in the environment, e.g.

pip install git+https://github.com/vanvalenlab/cellSAM.git
import numpy as np
from cellSAM.cellsam_pipeline import cellsam_pipeline
/home/administrator/repos/deepcell-types/dct13-env/lib/python3.13/site-packages/cellSAM/model.py:12: UserWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html. The pkg_resources package is slated for removal as early as 2025-11-30. Refrain from using this package or pin to Setuptools<81.
  from pkg_resources import resource_filename

For convenience, channels corresponding to nuclear markers and a whole-cell marker are stored in the dataset metadata.

Note

Nuclear markers are typically unambiguous. The whole-cell channel selection on the other hand is less well-defined. Users are encouraged to try different channels or combinations of channels for improved whole-cell segmentation results. The membrane_channel selection in the metadata is arbitrary and provided for convenience.

# Extract channels for segmentation
nuc, mem = ds.attrs["nuclear_channel"], ds.attrs["membrane_channel"]
im = np.stack(
    [img[chnames.index(nuc)], img[chnames.index(mem)]],
    axis=-1,
).squeeze()

CellSAM expects multiplexed data in a particular format. See the cellsam docs for details.

# Format for cellsam
seg_img = np.zeros((*im.shape[:-1], 3), dtype=im.dtype)
seg_img[..., 1:] = im

Finally, run the segmentation pipeline:

mask = cellsam_pipeline(
    seg_img,
    block_size=512,
    low_contrast_enhancement=False,
    use_wsi=True,
    gauge_cell_size=False,
)
Hide code cell output
Total blocks: 16
0it [00:00, ?it/s]
1it [00:02,  2.05s/it]
2it [00:03,  1.80s/it]
3it [00:05,  1.99s/it]
4it [00:07,  1.96s/it]
5it [00:09,  1.89s/it]
6it [00:11,  1.91s/it]
7it [00:13,  2.01s/it]
8it [00:15,  1.96s/it]
9it [00:17,  1.90s/it]
10it [00:19,  1.91s/it]
11it [00:21,  1.95s/it]
12it [00:23,  1.96s/it]
13it [00:25,  1.93s/it]
14it [00:26,  1.87s/it]
15it [00:28,  1.89s/it]
16it [00:30,  1.88s/it]
16it [00:30,  1.92s/it]

  0%|                                                           | 0/24 [00:00<?, ?it/s]
100%|█████████████████████████████████████████████████| 24/24 [00:00<00:00, 492.26it/s]

# Sanity check: the segmentation mask should have the same W, H dimensions as
# the input image
mask.shape == img.shape[1:]
True

Let’s perform a bit of post-processing to ensure that the segmentation mask (represented as a label image) is sequential.

import skimage

mask, _, _ = skimage.segmentation.relabel_sequential(mask)
mask = mask.astype(np.uint32)

Visualizing results#

Note

Multiplexed images and their analysis products are extremely information dense; users are strongly recommended to run tutorials locally to leverage napari for interactive visualization.

import napari
nim = napari.Viewer(show=True)  # Headless for CI; set show=True for interactive viz

# Compute contrast limits
cl = [(np.min(ch), np.max(ch)) for ch in img]

# Visualize multiplex image
nim.add_image(img, channel_axis=0, name=chnames, contrast_limits=cl);

# Add segmentation mask
mask_lyr = nim.add_labels(mask, name="CellSAM segmentation")
mask_lyr.contour = 3  # Relatively thick borders for static viz
Hide code cell content
# For static rendering - can safely be ignored if running notebook interactively
from pathlib import Path

screenshot_path = Path("../_static/_generated")
screenshot_path.mkdir(parents=True, exist_ok=True)
nim.screenshot(
    path=screenshot_path / "napari_img_and_segmentation.png",
    canvas_only=False,
);
Napari window of multiplexed image and computed segmentation mask

Cell-type inference with deepcell-types#

We now have all the necessary components to run the cell-type inference pipeline.

import deepcell_types

To run the inference pipeline, you will need to download a trained model. See Models for details.

# Model & system-specific configuration
model = "deepcell-types_2025-06-09"

# NOTE: if you do not have a cuda-capable GPU, try "cpu"
device = "cuda:0"
# NOTE: For machines with many cores & large RAM (e.g. GPU nodes), consider
# increasing for better performance.
num_data_loader_threads = 1

With the system all configured, we can now run the pipeline:

cell_types = deepcell_types.predict(
    img,
    mask,
    chnames,
    mpp,
    model_name=model,
    device_num=device,
    num_workers=num_data_loader_threads,
)
Hide code cell output
/home/administrator/repos/deepcell-types/dct13-env/lib/python3.13/site-packages/deepcell_types/dataset.py:42: UserWarning: Channel CD14 is not in the channel mapping. This channel will be masked out.
  warnings.warn(
/home/administrator/repos/deepcell-types/dct13-env/lib/python3.13/site-packages/deepcell_types/dataset.py:42: UserWarning: Channel CD80 is not in the channel mapping. This channel will be masked out.
  warnings.warn(
/home/administrator/repos/deepcell-types/dct13-env/lib/python3.13/site-packages/deepcell_types/dataset.py:42: UserWarning: Channel DCSIGN is not in the channel mapping. This channel will be masked out.
  warnings.warn(
/home/administrator/repos/deepcell-types/dct13-env/lib/python3.13/site-packages/deepcell_types/dataset.py:42: UserWarning: Channel ECAD is not in the channel mapping. This channel will be masked out.
  warnings.warn(
/home/administrator/repos/deepcell-types/dct13-env/lib/python3.13/site-packages/deepcell_types/dataset.py:42: UserWarning: Channel FOXP3 is not in the channel mapping. This channel will be masked out.
  warnings.warn(
/home/administrator/repos/deepcell-types/dct13-env/lib/python3.13/site-packages/deepcell_types/dataset.py:42: UserWarning: Channel GALECTIN9 is not in the channel mapping. This channel will be masked out.
  warnings.warn(
/home/administrator/repos/deepcell-types/dct13-env/lib/python3.13/site-packages/deepcell_types/dataset.py:42: UserWarning: Channel GRB is not in the channel mapping. This channel will be masked out.
  warnings.warn(
/home/administrator/repos/deepcell-types/dct13-env/lib/python3.13/site-packages/deepcell_types/dataset.py:42: UserWarning: Channel H3 is not in the channel mapping. This channel will be masked out.
  warnings.warn(
/home/administrator/repos/deepcell-types/dct13-env/lib/python3.13/site-packages/deepcell_types/dataset.py:42: UserWarning: Channel HLAG is not in the channel mapping. This channel will be masked out.
  warnings.warn(
/home/administrator/repos/deepcell-types/dct13-env/lib/python3.13/site-packages/deepcell_types/dataset.py:42: UserWarning: Channel HO1 is not in the channel mapping. This channel will be masked out.
  warnings.warn(
/home/administrator/repos/deepcell-types/dct13-env/lib/python3.13/site-packages/deepcell_types/dataset.py:42: UserWarning: Channel LCK is not in the channel mapping. This channel will be masked out.
  warnings.warn(
/home/administrator/repos/deepcell-types/dct13-env/lib/python3.13/site-packages/deepcell_types/dataset.py:42: UserWarning: Channel TIGIT is not in the channel mapping. This channel will be masked out.
  warnings.warn(
/home/administrator/repos/deepcell-types/dct13-env/lib/python3.13/site-packages/deepcell_types/dataset.py:42: UserWarning: Channel TIM3 is not in the channel mapping. This channel will be masked out.
  warnings.warn(
/home/administrator/repos/deepcell-types/dct13-env/lib/python3.13/site-packages/deepcell_types/dataset.py:42: UserWarning: Channel TRYPTASE is not in the channel mapping. This channel will be masked out.
  warnings.warn(
/home/administrator/repos/deepcell-types/dct13-env/lib/python3.13/site-packages/deepcell_types/dataset.py:42: UserWarning: Channel VIM is not in the channel mapping. This channel will be masked out.
  warnings.warn(
/home/administrator/repos/deepcell-types/dct13-env/lib/python3.13/site-packages/deepcell_types/dataset.py:42: UserWarning: Channel INOS is not in the channel mapping. This channel will be masked out.
  warnings.warn(
(inference): 0it [00:00, ?it/s]

  0%|                                                         | 0/1646 [00:00<?, ?it/s]


  5%|██▏                                            | 75/1646 [00:00<00:02, 745.61it/s]


  9%|████▏                                         | 150/1646 [00:00<00:02, 622.80it/s]


 13%|█████▉                                        | 214/1646 [00:00<00:02, 610.29it/s]


 17%|███████▋                                      | 276/1646 [00:01<00:07, 187.94it/s]


 20%|█████████▎                                    | 331/1646 [00:01<00:05, 237.32it/s]


 23%|██████████▊                                   | 386/1646 [00:01<00:04, 287.61it/s]


 27%|████████████▍                                 | 446/1646 [00:01<00:03, 345.60it/s]


 31%|██████████████▏                               | 507/1646 [00:01<00:02, 400.67it/s]

/home/administrator/repos/deepcell-types/dct13-env/lib/python3.13/site-packages/torch/nn/modules/transformer.py:505: UserWarning: The PyTorch API of nested tensors is in prototype stage and will change in the near future. We recommend specifying layout=torch.jagged when constructing a nested tensor, as this layout receives active development, has better operator coverage, and works with torch.compile. (Triggered internally at /pytorch/aten/src/ATen/NestedTensorImpl.cpp:178.)
  output = torch._nested_tensor_from_mask(

(inference): 1it [00:07,  7.80s/it]

 34%|███████████████▋                              | 562/1646 [00:02<00:06, 176.62it/s]


 38%|█████████████████▎                            | 619/1646 [00:02<00:04, 222.95it/s]


 41%|██████████████████▊                           | 673/1646 [00:02<00:03, 268.67it/s]


 44%|████████████████████▍                         | 732/1646 [00:02<00:02, 323.48it/s]

(inference): 2it [00:08,  3.69s/it]

 48%|█████████████████████▉                        | 784/1646 [00:03<00:05, 161.02it/s]


 51%|███████████████████████▋                      | 846/1646 [00:03<00:03, 211.56it/s]


 55%|█████████████████████████▏                    | 900/1646 [00:03<00:02, 256.22it/s]


 58%|██████████████████████████▋                   | 956/1646 [00:03<00:02, 305.23it/s]


 62%|███████████████████████████▊                 | 1016/1646 [00:03<00:01, 360.86it/s]

(inference): 3it [00:09,  2.50s/it]

 65%|█████████████████████████████▏               | 1069/1646 [00:04<00:03, 171.02it/s]


 68%|██████████████████████████████▊              | 1126/1646 [00:04<00:02, 217.00it/s]


 72%|████████████████████████████████▎            | 1182/1646 [00:04<00:01, 265.57it/s]


 75%|█████████████████████████████████▊           | 1239/1646 [00:04<00:01, 316.30it/s]

(inference): 4it [00:10,  1.95s/it]

 78%|███████████████████████████████████▎         | 1290/1646 [00:05<00:02, 159.42it/s]


 82%|█████████████████████████████████████        | 1355/1646 [00:05<00:01, 213.69it/s]


 86%|██████████████████████████████████████▌      | 1409/1646 [00:05<00:00, 257.68it/s]


 89%|████████████████████████████████████████     | 1464/1646 [00:05<00:00, 304.64it/s]


 93%|█████████████████████████████████████████▋   | 1523/1646 [00:05<00:00, 357.61it/s]

(inference): 5it [00:11,  1.63s/it]

 96%|███████████████████████████████████████████  | 1576/1646 [00:06<00:00, 167.62it/s]


 99%|████████████████████████████████████████████▋| 1633/1646 [00:06<00:00, 213.40it/s]

100%|█████████████████████████████████████████████| 1646/1646 [00:06<00:00, 246.86it/s]

(inference): 6it [00:12,  1.45s/it]
(inference): 7it [00:13,  1.11s/it]
(inference): 7it [00:13,  1.92s/it]

Predictions are provided in the form of list of strings, where the order of the list is given by the ordering of cell indices in the segmentation mask. Since we ordered the mask indices above, it’s straightforward to make this mapping explicit:

idx_to_pred = dict(enumerate(cell_types, start=1))

pd.DataFrame.from_dict(  # For nice table rendering
    idx_to_pred, orient="index", columns=["Cell type"]
)
Cell type
1 Myofibroblast
2 Myofibroblast
3 Endothelial
4 Stellate
5 Myofibroblast
... ...
1642 NK
1643 EVT
1644 Myofibroblast
1645 NK
1646 EVT

1646 rows × 1 columns

Depending on the subsequent analysis you wish to perform, it may be convenient to group the cells by their predicted cell-type:

from collections import defaultdict

# Convert the 1-1 `cell: type` mapping to a 1-many `type: list-of-cells` mapping
labels_by_celltype = defaultdict(list)
for idx, ct in idx_to_pred.items():
    labels_by_celltype[ct].append(idx)

Here’s the distribution of predicted cell types for this tissue:

from pprint import pprint

print(f"Total number of cells: {(num_cells := np.max(mask))}")

pprint(
    {
        k: f"{len(v)} ({100 * len(v) / num_cells:02.2f}%)"
        for k, v in labels_by_celltype.items()
    },
    sort_dicts=False,
)
Total number of cells: 1646
{'Myofibroblast': '100 (6.08%)',
 'Endothelial': '43 (2.61%)',
 'Stellate': '9 (0.55%)',
 'NK': '311 (18.89%)',
 'Fibroblast': '340 (20.66%)',
 'Macrophage': '191 (11.60%)',
 'SmoothMuscle': '54 (3.28%)',
 'NKT': '9 (0.55%)',
 'EVT': '485 (29.47%)',
 'Epithelial': '82 (4.98%)',
 'Melanocyte': '4 (0.24%)',
 'CD4T': '8 (0.49%)',
 'CD8T': '4 (0.24%)',
 'HSEC': '5 (0.30%)',
 'Microglial': '1 (0.06%)'}

Visualizing the results#

There are many ways to visualize the cell-type prediction data, each with their own advantages and disadvantages. One way is to add an independent layer for each predicted cell type. The advantage of this approach is that individual layers can be toggled to focus on a particular cell type during interactive visualization.

# Regionprops to extract slices corresponding to each individual cell mask
props = skimage.measure.regionprops(mask)
prop_dict = {p.label: p for p in props}

# Create a binary mask layer for each celltype and populate it
# using the regionprops
for k, l in labels_by_celltype.items():
    ctmask = np.zeros_like(mask, dtype=np.uint8)
    for idx in l:
        p = prop_dict[idx]
        ctmask[p.slice][p.image] = 1
    mask_lyr = nim.add_labels(ctmask, name=f"{k} ({len(l)})")
    mask_lyr.colormap = napari.utils.DirectLabelColormap(
        color_dict={None: (0, 0, 0), 1: np.random.rand(3)}
    )
Hide code cell content
# For static rendering - can safely be ignored if running notebook interactively
from pathlib import Path

screenshot_path = Path("../_static/_generated")
screenshot_path.mkdir(parents=True, exist_ok=True)
nim.screenshot(
    path=screenshot_path / "napari_celltype_layers.png",
    canvas_only=False,
);
Napari window of multiplexed image with celltype predictions