scrna1/6 Jupyter Notebook lamindata

scRNA-seq#

You’ll learn how to manage a growing number of scRNA-seq data shards as a single queryable collection.

Along the way, you’ll see how to create reports, leverage data lineage, and query individual data shards stored as files.

If you’re only interested in using a large curated scRNA-seq collection, see the CELLxGENE Census guide.

Here, you will:

  1. create an Artifact from an AnnData object and seed a growing Collection with it (scrna1/6, current page)

  2. append a new data batch (a new .h5ad file) and create a new version of this collection (scrna2/6)

  3. query & inspect artifacts by metadata individually (scrna3/6)

  4. load the joint collection into memory and save analytical results (scrna4/6)

  5. iterate over the collection, train a model, store a derived representation (scrna5/6)

  6. discuss converting a number of artifacts to a single TileDB SOMA store of the same data (scrna6/6)

Setup#

!lamin init --storage ./test-scrna --schema bionty
Hide code cell output
✅ saved: User(uid='DzTjkKse', handle='testuser1', name='Test User1', updated_at=2024-03-04 17:07:52 UTC)
✅ saved: Storage(uid='VTVxF6EC', root='/home/runner/work/lamin-usecases/lamin-usecases/docs/test-scrna', type='local', updated_at=2024-03-04 17:07:52 UTC, created_by_id=1)
💡 loaded instance: testuser1/test-scrna
💡 did not register local instance on lamin.ai
import lamindb as ln
import bionty as bt

ln.settings.verbosity = "hint"
Hide code cell output
💡 lamindb instance: testuser1/test-scrna
ln.transform.stem_uid = "Nv48yAceNSh8"
ln.transform.version = "1"
ln.track()
💡 Assuming editor is Jupyter Lab.
💡 notebook imports: bionty==0.41.0 lamindb==0.68.0
💡 saved: Transform(uid='Nv48yAceNSh85zKv', name='scRNA-seq', short_name='scrna', version='1', type=notebook, updated_at=2024-03-04 17:07:55 UTC, created_by_id=1)
💡 saved: Run(uid='ypTwLmvw0MDHtxhOdyA1', run_at=2024-03-04 17:07:55 UTC, transform_id=1, created_by_id=1)
💡 tracked pip freeze > /home/runner/.cache/lamindb/run_env_pip_ypTwLmvw0MDHtxhOdyA1.txt

Ingest an artifact#

Let us look at the standardized data of Conde et al., Science (2022), available from CZ CELLxGENE.

By calling anndata_human_immune_cells(), we load a subsampled version of the collection from CZ CELLxGENE and pre-populate the corresponding LaminDB registries: Feature, ULabel, Gene, CellType, CellLine, ExperimentalFactor.

adata = ln.dev.datasets.anndata_human_immune_cells(populate_registries=True)
adata
Hide code cell output
AnnData object with n_obs × n_vars = 1648 × 36503
    obs: 'donor', 'tissue', 'cell_type', 'assay'
    var: 'feature_is_filtered', 'feature_reference', 'feature_biotype'
    uns: 'default_embedding'
    obsm: 'X_umap'

This AnnData object is standardized using the CZI single-cell-curation validator with the same public ontologies that underlie bionty. Because registries are pre-populated, validation passes.

Note

In the next guide, we’ll curate a non-standardized collection.

The gene registry provides metadata for each of the 36k genes measured in the AnnData:

bt.Gene.df()
Hide code cell output
uid symbol stable_id ensembl_gene_id ncbi_gene_ids biotype description synonyms organism_id public_source_id created_at updated_at created_by_id
id
36390 13TcSi2lpH6w None None ENSG00000277196 protein_coding proline dehydrogenase 1 1 9 2024-03-04 17:08:08.794651+00:00 2024-03-04 17:08:08.794659+00:00 1
36389 e6mZ4HkNLubl None None ENSG00000278817 protein_coding None 1 9 2024-03-04 17:08:08.794552+00:00 2024-03-04 17:08:08.794561+00:00 1
36388 7lbvdjjwEPWE None None ENSG00000276017 protein_coding None 1 9 2024-03-04 17:08:08.794453+00:00 2024-03-04 17:08:08.794462+00:00 1
36387 2gA5h8flLUoS None None ENSG00000278633 protein_coding None 1 9 2024-03-04 17:08:08.794355+00:00 2024-03-04 17:08:08.794364+00:00 1
36386 78xgwwESCqbV None None ENSG00000277836 protein_coding None 1 9 2024-03-04 17:08:08.794255+00:00 2024-03-04 17:08:08.794264+00:00 1
... ... ... ... ... ... ... ... ... ... ... ... ... ...
5 1jEmSWWnEaEB None None ENSG00000239945 lncRNA novel transcript 1 9 2024-03-04 17:08:03.381236+00:00 2024-03-04 17:08:03.381245+00:00 1
4 38Y2cbnNr2Eo None None ENSG00000238009 lncRNA novel transcript 1 9 2024-03-04 17:08:03.381132+00:00 2024-03-04 17:08:03.381142+00:00 1
3 2C6vvj3UpzFH OR4F5 None ENSG00000186092 79501 protein_coding olfactory receptor family 4 subfamily F member 5 1 9 2024-03-04 17:08:03.381027+00:00 2024-03-04 17:08:03.381038+00:00 1
2 7jplpj8nX5SZ FAM138A None ENSG00000237613 645520|124906933 lncRNA family with sequence similarity 138 member A F379 1 9 2024-03-04 17:08:03.380918+00:00 2024-03-04 17:08:03.380930+00:00 1
1 6CIfseu2Ft4B MIR1302-2HG None ENSG00000243485 lncRNA MIR1302-2 host gene 1 9 2024-03-04 17:08:03.380775+00:00 2024-03-04 17:08:03.380815+00:00 1

36390 rows × 13 columns

When we create a Artifact object from an AnnData, we automatically link its features:

artifact = ln.Artifact.from_anndata(adata, description="Human immune cells from Conde22")
artifact
Hide code cell output
💡 path content will be copied to default storage upon `save()` with key `None` ('.lamindb/DmWBZLaQ5ROqUGbyx2cz.h5ad')
Artifact(uid='DmWBZLaQ5ROqUGbyx2cz', suffix='.h5ad', accessor='AnnData', description='Human immune cells from Conde22', size=57612943, hash='9sXda5E7BYiVoDOQkTC0KB', hash_type='sha1-fl', visibility=1, key_is_virtual=True, storage_id=1, transform_id=1, run_id=1, created_by_id=1)
artifact.save()
Hide code cell output
✅ storing artifact 'DmWBZLaQ5ROqUGbyx2cz' at '/home/runner/work/lamin-usecases/lamin-usecases/docs/test-scrna/.lamindb/DmWBZLaQ5ROqUGbyx2cz.h5ad'

Link artifact to the features from AnnData object:

artifact.features.add_from_anndata(
    var_field=bt.Gene.ensembl_gene_id,  # field to validate and link features in var index
    organism="human",  # or set globally: bt.settings.organism = "human"
    )
💡 parsing feature names of X stored in slot 'var'
36390 terms (99.70%) are validated for ensembl_gene_id
113 terms (0.30%) are not validated for ensembl_gene_id: ENSG00000269933, ENSG00000261737, ENSG00000259834, ENSG00000256374, ENSG00000263464, ENSG00000203812, ENSG00000272196, ENSG00000272880, ENSG00000270188, ENSG00000287116, ENSG00000237133, ENSG00000224739, ENSG00000227902, ENSG00000239467, ENSG00000272551, ENSG00000280374, ENSG00000236886, ENSG00000229352, ENSG00000286601, ENSG00000227021, ...
✅    linked: FeatureSet(uid='JDxC5yVPWtDWCDaZygPz', n=36390, type='number', registry='bionty.Gene', hash='gRQGj3QB8ZsIfXA1BjiL', created_by_id=1)
💡 parsing feature names of slot 'obs'
4 terms (100.00%) are validated for name
✅    linked: FeatureSet(uid='mu8Se2IFDv8QKp9mQhsR', n=4, registry='core.Feature', hash='zSzJsFYX3Pk2dn4CtNBd', created_by_id=1)
✅ saved 2 feature sets for slots: 'var','obs'

The artifact has 2 linked feature sets, one for measured genes and one for measured metadata:

artifact.features
Hide code cell output
Features:
  var: FeatureSet(uid='JDxC5yVPWtDWCDaZygPz', n=36390, type='number', registry='bionty.Gene', hash='gRQGj3QB8ZsIfXA1BjiL', updated_at=2024-03-04 17:08:15 UTC, created_by_id=1)
    'MIR1302-2HG', 'FAM138A', 'OR4F5', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'OR4F29', 'None', 'OR4F16', 'None', 'LINC01409', 'FAM87B', 'LINC01128', 'LINC00115', 'FAM41C', 'None', ...
  obs: FeatureSet(uid='mu8Se2IFDv8QKp9mQhsR', n=4, registry='core.Feature', hash='zSzJsFYX3Pk2dn4CtNBd', updated_at=2024-03-04 17:08:16 UTC, created_by_id=1)
    🔗 cell_type (0, bionty.CellType): 
    🔗 assay (0, bionty.ExperimentalFactor): 
    🔗 tissue (0, bionty.Tissue): 
    🔗 donor (0, core.ULabel): 

Let’s now annotate the artifact with labels:

experimental_factors = bt.ExperimentalFactor.lookup()
organism = bt.Organism.lookup()
features = ln.Feature.lookup()

artifact.labels.add(organism.human, feature=features.organism)
artifact.labels.add(
    experimental_factors.single_cell_rna_sequencing, feature=features.assay
)
artifact.labels.add(adata.obs.cell_type, feature=features.cell_type)
artifact.labels.add(adata.obs.assay, feature=features.assay)
artifact.labels.add(adata.obs.tissue, feature=features.tissue)
artifact.labels.add(adata.obs.donor, feature=features.donor)
Hide code cell output
✅ linked new feature 'organism' together with new feature set FeatureSet(uid='X0bWqEpRUpvhCB8kzojh', n=1, registry='core.Feature', hash='mcCS9PYNWr62UH7Wq1oS', updated_at=2024-03-04 17:08:17 UTC, created_by_id=1)

The artifact is now validated & queryable by everything we linked:

artifact.describe()
Hide code cell output
Artifact(uid='DmWBZLaQ5ROqUGbyx2cz', suffix='.h5ad', accessor='AnnData', description='Human immune cells from Conde22', size=57612943, hash='9sXda5E7BYiVoDOQkTC0KB', hash_type='sha1-fl', visibility=1, key_is_virtual=True, updated_at=2024-03-04 17:08:16 UTC)

Provenance:
  🗃️ storage: Storage(uid='VTVxF6EC', root='/home/runner/work/lamin-usecases/lamin-usecases/docs/test-scrna', type='local', updated_at=2024-03-04 17:07:52 UTC, created_by_id=1)
  💫 transform: Transform(uid='Nv48yAceNSh85zKv', name='scRNA-seq', short_name='scrna', version='1', type=notebook, updated_at=2024-03-04 17:07:55 UTC, created_by_id=1)
  👣 run: Run(uid='ypTwLmvw0MDHtxhOdyA1', run_at=2024-03-04 17:07:55 UTC, transform_id=1, created_by_id=1)
  👤 created_by: User(uid='DzTjkKse', handle='testuser1', name='Test User1', updated_at=2024-03-04 17:07:52 UTC)
Features:
  var: FeatureSet(uid='JDxC5yVPWtDWCDaZygPz', n=36390, type='number', registry='bionty.Gene', hash='gRQGj3QB8ZsIfXA1BjiL', updated_at=2024-03-04 17:08:15 UTC, created_by_id=1)
    'MIR1302-2HG', 'FAM138A', 'OR4F5', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'OR4F29', 'None', 'OR4F16', 'None', 'LINC01409', 'FAM87B', 'LINC01128', 'LINC00115', 'FAM41C', 'None', ...
  obs: FeatureSet(uid='mu8Se2IFDv8QKp9mQhsR', n=4, registry='core.Feature', hash='zSzJsFYX3Pk2dn4CtNBd', updated_at=2024-03-04 17:08:16 UTC, created_by_id=1)
    🔗 cell_type (32, bionty.CellType): 'classical monocyte', 'T follicular helper cell', 'memory B cell', 'alveolar macrophage', 'naive thymus-derived CD4-positive, alpha-beta T cell', 'effector memory CD8-positive, alpha-beta T cell, terminally differentiated', 'alpha-beta T cell', 'CD4-positive helper T cell', 'naive thymus-derived CD8-positive, alpha-beta T cell', 'macrophage', ...
    🔗 assay (4, bionty.ExperimentalFactor): 'single-cell RNA sequencing', '10x 3' v3', '10x 5' v2', '10x 5' v1'
    🔗 tissue (17, bionty.Tissue): 'blood', 'thoracic lymph node', 'spleen', 'lung', 'mesenteric lymph node', 'lamina propria', 'liver', 'jejunal epithelium', 'omentum', 'bone marrow', ...
    🔗 donor (12, core.ULabel): 'D496', '621B', 'A29', 'A36', 'A35', '637C', 'A52', 'A37', 'D503', '640C', ...
  external: FeatureSet(uid='X0bWqEpRUpvhCB8kzojh', n=1, registry='core.Feature', hash='mcCS9PYNWr62UH7Wq1oS', updated_at=2024-03-04 17:08:17 UTC, created_by_id=1)
    🔗 organism (1, bionty.Organism): 'human'
Labels:
  🏷️ organism (1, bionty.Organism): 'human'
  🏷️ tissues (17, bionty.Tissue): 'blood', 'thoracic lymph node', 'spleen', 'lung', 'mesenteric lymph node', 'lamina propria', 'liver', 'jejunal epithelium', 'omentum', 'bone marrow', ...
  🏷️ cell_types (32, bionty.CellType): 'classical monocyte', 'T follicular helper cell', 'memory B cell', 'alveolar macrophage', 'naive thymus-derived CD4-positive, alpha-beta T cell', 'effector memory CD8-positive, alpha-beta T cell, terminally differentiated', 'alpha-beta T cell', 'CD4-positive helper T cell', 'naive thymus-derived CD8-positive, alpha-beta T cell', 'macrophage', ...
  🏷️ experimental_factors (4, bionty.ExperimentalFactor): 'single-cell RNA sequencing', '10x 3' v3', '10x 5' v2', '10x 5' v1'
  🏷️ ulabels (12, core.ULabel): 'D496', '621B', 'A29', 'A36', 'A35', '637C', 'A52', 'A37', 'D503', '640C', ...

Seed a collection#

Let’s create a first version of a collection that will encompass many h5ad files when more data is ingested.

Note

To see the result of the incremental growth, take a look at the CELLxGENE Census guide for an instance with ~1k h5ads and ~50 million cells.

collection = ln.Collection(
    artifact, name="My versioned scRNA-seq collection", version="1"
)
collection.save()
collection.labels.add_from(artifact)  # seed the initial labels of the collection
Hide code cell output
💡 transferring cell_type
💡 transferring assay
💡 transferring tissue
💡 transferring donor
💡 transferring organism

For this version 1 of the collection, collection and artifact match each other. But they’re independently tracked and queryable through their registries:

collection.describe()
Collection(uid='DmWBZLaQ5ROqUGbyx2cz', name='My versioned scRNA-seq collection', version='1', hash='9sXda5E7BYiVoDOQkTC0KB', visibility=1, updated_at=2024-03-04 17:08:18 UTC)

Provenance:
  💫 transform: Transform(uid='Nv48yAceNSh85zKv', name='scRNA-seq', short_name='scrna', version='1', type=notebook, updated_at=2024-03-04 17:07:55 UTC, created_by_id=1)
  👣 run: Run(uid='ypTwLmvw0MDHtxhOdyA1', run_at=2024-03-04 17:07:55 UTC, transform_id=1, created_by_id=1)
  📄 artifact: Artifact(uid='DmWBZLaQ5ROqUGbyx2cz', suffix='.h5ad', accessor='AnnData', description='Human immune cells from Conde22', size=57612943, hash='9sXda5E7BYiVoDOQkTC0KB', hash_type='sha1-fl', visibility=1, key_is_virtual=True, updated_at=2024-03-04 17:08:18 UTC, storage_id=1, transform_id=1, run_id=1, created_by_id=1)
  👤 created_by: User(uid='DzTjkKse', handle='testuser1', name='Test User1', updated_at=2024-03-04 17:07:52 UTC)
Features:
  var: FeatureSet(uid='JDxC5yVPWtDWCDaZygPz', n=36390, type='number', registry='bionty.Gene', hash='gRQGj3QB8ZsIfXA1BjiL', updated_at=2024-03-04 17:08:15 UTC, created_by_id=1)
    'MIR1302-2HG', 'FAM138A', 'OR4F5', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'OR4F29', 'None', 'OR4F16', 'None', 'LINC01409', 'FAM87B', 'LINC01128', 'LINC00115', 'FAM41C', 'None', ...
  obs: FeatureSet(uid='mu8Se2IFDv8QKp9mQhsR', n=4, registry='core.Feature', hash='zSzJsFYX3Pk2dn4CtNBd', updated_at=2024-03-04 17:08:16 UTC, created_by_id=1)
    🔗 cell_type (32, bionty.CellType): 'classical monocyte', 'T follicular helper cell', 'memory B cell', 'alveolar macrophage', 'naive thymus-derived CD4-positive, alpha-beta T cell', 'effector memory CD8-positive, alpha-beta T cell, terminally differentiated', 'alpha-beta T cell', 'CD4-positive helper T cell', 'naive thymus-derived CD8-positive, alpha-beta T cell', 'macrophage', ...
    🔗 assay (4, bionty.ExperimentalFactor): '10x 3' v3', '10x 5' v2', '10x 5' v1', 'single-cell RNA sequencing'
    🔗 tissue (17, bionty.Tissue): 'blood', 'thoracic lymph node', 'spleen', 'lung', 'mesenteric lymph node', 'lamina propria', 'liver', 'jejunal epithelium', 'omentum', 'bone marrow', ...
    🔗 donor (12, core.ULabel): 'D496', '621B', 'A29', 'A36', 'A35', '637C', 'A52', 'A37', 'D503', '640C', ...
  external: FeatureSet(uid='X0bWqEpRUpvhCB8kzojh', n=1, registry='core.Feature', hash='mcCS9PYNWr62UH7Wq1oS', updated_at=2024-03-04 17:08:17 UTC, created_by_id=1)
    🔗 organism (1, bionty.Organism): 'human'
Labels:
  🏷️ organism (1, bionty.Organism): 'human'
  🏷️ tissues (17, bionty.Tissue): 'blood', 'thoracic lymph node', 'spleen', 'lung', 'mesenteric lymph node', 'lamina propria', 'liver', 'jejunal epithelium', 'omentum', 'bone marrow', ...
  🏷️ cell_types (32, bionty.CellType): 'classical monocyte', 'T follicular helper cell', 'memory B cell', 'alveolar macrophage', 'naive thymus-derived CD4-positive, alpha-beta T cell', 'effector memory CD8-positive, alpha-beta T cell, terminally differentiated', 'alpha-beta T cell', 'CD4-positive helper T cell', 'naive thymus-derived CD8-positive, alpha-beta T cell', 'macrophage', ...
  🏷️ experimental_factors (4, bionty.ExperimentalFactor): '10x 3' v3', '10x 5' v2', '10x 5' v1', 'single-cell RNA sequencing'
  🏷️ ulabels (12, core.ULabel): 'D496', '621B', 'A29', 'A36', 'A35', '637C', 'A52', 'A37', 'D503', '640C', ...

Access the underlying artifact like so:

collection.artifact
Artifact(uid='DmWBZLaQ5ROqUGbyx2cz', suffix='.h5ad', accessor='AnnData', description='Human immune cells from Conde22', size=57612943, hash='9sXda5E7BYiVoDOQkTC0KB', hash_type='sha1-fl', visibility=1, key_is_virtual=True, updated_at=2024-03-04 17:08:18 UTC, storage_id=1, transform_id=1, run_id=1, created_by_id=1)

See data lineage:

collection.view_lineage()
_images/84d4f906c13480a67baa615eaf12144187af87637f3355d5e9adf7f787a90d41.svg