scrna5/6 Jupyter Notebook lamindata

Train a machine learning model on a collection#

Here, we iterate over the artifacts within a collection to train a machine learning model at scale.

import lamindb as ln
馃挕 connected lamindb: testuser1/test-scrna
ln.settings.transform.stem_uid = "Qr1kIHvK506r"
ln.settings.transform.version = "1"
ln.track()
馃挕 notebook imports: lamindb==0.70.2 torch==2.2.2
馃挕 saved: Transform(uid='Qr1kIHvK506r5zKv', name='Train a machine learning model on a collection', key='scrna5', version='1', type='notebook', updated_at=2024-04-19 17:46:52 UTC, created_by_id=1)
馃挕 saved: Run(uid='8LDHMBx13nM0zxdTFD8M', transform_id=5, created_by_id=1)

Query our collection:

collection = ln.Collection.filter(
    name="My versioned scRNA-seq collection", version="2"
).one()
collection.describe()
Hide code cell output
Collection(uid='LGFtQ96jRdgVCPt3xErf', name='My versioned scRNA-seq collection', version='2', hash='HNR3VFV60_yqRnUka11E', visibility=1, updated_at=2024-04-19 17:46:31 UTC)

Provenance:
  馃搸 transform: Transform(uid='ManDYgmftZ8C5zKv', name='Standardize and append a batch of data', key='scrna2', version='1', type='notebook')
  馃搸 run: Run(uid='lLBAoqmUsZzBW7JkffZQ', started_at=2024-04-19 17:46:07 UTC, is_consecutive=True)
  馃搸 created_by: User(uid='DzTjkKse', handle='testuser1', name='Test User1')
  馃搸 input_of (core.Run): ['2024-04-19 17:46:42 UTC']
Features:
  var: FeatureSet(uid='AMmAPnLnkAhv6ZwRny1p', n=36508, type='number', registry='bionty.Gene')
    'LINC01477', 'BPIFA1', 'BEX1', 'SCUBE3', 'COG5', 'DPH2', 'MIF', 'PARP8', 'ACOT7', 'CNRIP1', 'PSMC3IP', 'RRP1B', 'LCA5L', 'RAI1-AS1', 'DEPTOR', 'BSCL2', 'VIRMA-DT', 'CDRT4', 'MRGPRX2', 'TAS1R3', ...
  obs: FeatureSet(uid='iDA0HIaDIskc7pnYnQ1o', n=4, registry='core.Feature')
    馃敆 donor (12, core.ULabel): 'D496', 'A37', 'A31', '582C', '640C', 'A52', 'D503', 'A35', 'A36', 'A29', ...
    馃敆 tissue (17, bionty.Tissue): 'thoracic lymph node', 'ileum', 'mesenteric lymph node', 'lung', 'liver', 'blood', 'jejunal epithelium', 'spleen', 'caecum', 'thymus', ...
    馃敆 cell_type (40, bionty.CellType): 'CD14-positive, CD16-negative classical monocyte', 'cytotoxic T cell', 'regulatory T cell', 'alpha-beta T cell', 'plasmacytoid dendritic cell', 'lymphocyte', 'effector memory CD4-positive, alpha-beta T cell', 'megakaryocyte', 'gamma-delta T cell', 'naive B cell', ...
    馃敆 assay (3, bionty.ExperimentalFactor): '10x 5' v2', '10x 5' v1', '10x 3' v3'
Labels:
  馃搸 tissues (17, bionty.Tissue): 'thoracic lymph node', 'ileum', 'mesenteric lymph node', 'lung', 'liver', 'blood', 'jejunal epithelium', 'spleen', 'caecum', 'thymus', ...
  馃搸 cell_types (40, bionty.CellType): 'CD14-positive, CD16-negative classical monocyte', 'cytotoxic T cell', 'regulatory T cell', 'alpha-beta T cell', 'plasmacytoid dendritic cell', 'lymphocyte', 'effector memory CD4-positive, alpha-beta T cell', 'megakaryocyte', 'gamma-delta T cell', 'naive B cell', ...
  馃搸 experimental_factors (3, bionty.ExperimentalFactor): '10x 5' v2', '10x 5' v1', '10x 3' v3'
  馃搸 ulabels (12, core.ULabel): 'D496', 'A37', 'A31', '582C', '640C', 'A52', 'D503', 'A35', 'A36', 'A29', ...

Create a map-style dataset#

Let us create a map-style dataset using using mapped(): a MappedCollection. This is what, for example, the PyTorch DataLoader expects as an input.

Under-the-hood, it performs a virtual inner join of the features of the underlying AnnData objects and thus allows to work with very large collections.

You can either perform a virtual inner join:

with collection.mapped(obs_keys=["cell_type"], join="inner") as dataset:
    print(len(dataset.var_joint))
749

Or a virtual outer join:

dataset = collection.mapped(obs_keys=["cell_type"], join="outer")
len(dataset.var_joint)
36508

This is compatible with a PyTorch DataLoader because it implements __getitem__ over a list of backed AnnData objects. The 5th cell in the collection can be accessed like:

dataset[5]
Hide code cell output
{'X': array([ 0.   ,  0.   ,  0.   , ...,  0.   ,  0.   , -0.456], dtype=float32),
 '_store_idx': 0,
 'cell_type': 0}

The labels are encoded into integers:

dataset.encoders
Hide code cell output
{'cell_type': {'cytotoxic T cell': 0,
  'lymphocyte': 1,
  'plasmacytoid dendritic cell': 2,
  'alpha-beta T cell': 3,
  'regulatory T cell': 4,
  'effector memory CD4-positive, alpha-beta T cell': 5,
  'megakaryocyte': 6,
  'gamma-delta T cell': 7,
  'naive B cell': 8,
  'animal cell': 9,
  'germinal center B cell': 10,
  'non-classical monocyte': 11,
  'progenitor cell': 12,
  'CD4-positive helper T cell': 13,
  'CD8-positive, CD25-positive, alpha-beta regulatory T cell': 14,
  'CD38-positive naive B cell': 15,
  'alveolar macrophage': 16,
  'naive thymus-derived CD4-positive, alpha-beta T cell': 17,
  'CD4-positive, alpha-beta T cell': 18,
  'CD16-negative, CD56-bright natural killer cell, human': 19,
  'CD8-positive, alpha-beta memory T cell, CD45RO-positive': 20,
  'dendritic cell, human': 21,
  'classical monocyte': 22,
  'B cell, CD19-positive': 23,
  'mast cell': 24,
  'CD8-positive, alpha-beta memory T cell': 25,
  'plasmablast': 26,
  'conventional dendritic cell': 27,
  'plasma cell': 28,
  'effector memory CD4-positive, alpha-beta T cell, terminally differentiated': 29,
  'macrophage': 30,
  'memory B cell': 31,
  'T follicular helper cell': 32,
  'dendritic cell': 33,
  'naive thymus-derived CD8-positive, alpha-beta T cell': 34,
  'CD16-positive, CD56-dim natural killer cell, human': 35,
  'effector memory CD8-positive, alpha-beta T cell, terminally differentiated': 36,
  'group 3 innate lymphoid cell': 37,
  'mucosal invariant T cell': 38,
  'CD14-positive, CD16-negative classical monocyte': 39}}

Create a pytorch DataLoader#

Let us use a weighted sampler:

from torch.utils.data import DataLoader, WeightedRandomSampler

# label_key for weight doesn't have to be in labels on init
sampler = WeightedRandomSampler(
    weights=dataset.get_label_weights("cell_type"), num_samples=len(dataset)
)
dataloader = DataLoader(dataset, batch_size=128, sampler=sampler)

We can now iterate through the data loader:

for batch in dataloader:
    pass

Close the connections in MappedCollection:

dataset.close()
In practice, use a context manager
with collection.mapped(obs_keys=["cell_type"]) as dataset:
    sampler = WeightedRandomSampler(
        weights=dataset.get_label_weights("cell_type"), num_samples=len(dataset)
    )
    dataloader = DataLoader(dataset, batch_size=128, sampler=sampler)
    for batch in dataloader:
        pass