Jupyter Notebook Binder

Integrate scRNA-seq datasets#

scRNA-seq data integration is the process of analyzing data from several scRNA sequencing experiments to uncover common or distinct biological insights and patterns.

Here, we’ll demonstrate how to fetch two scRNA-seq datasets by registered metadata such as cell types to finally integrate them.

Setup#

!lamin load test-scrna
Hide code cell output
πŸ’‘ found cached instance metadata: /home/runner/.lamin/instance--testuser1--test-scrna.env
πŸ’‘ loaded instance: testuser1/test-scrna

import lamindb as ln
import lnschema_bionty as lb
import anndata as ad
πŸ’‘ loaded instance: testuser1/test-scrna (lamindb 0.54.2)
ln.track()
πŸ’‘ notebook imports: anndata==0.9.2 lamindb==0.54.2 lnschema_bionty==0.31.2
❗ record with similar name exist! did you mean to load it?
id __ratio__
name
scRNA-seq Nv48yAceNSh8z8 90.0
πŸ’‘ Transform(id='agayZTonayqAz8', name='Integrate scRNA-seq datasets', short_name='scrna2', version='0', type=notebook, updated_at=2023-09-26 15:22:44, created_by_id='DzTjkKse')
πŸ’‘ Run(id='3PWYUNPHu3mR1SiDNzB6', run_at=2023-09-26 15:22:44, transform_id='agayZTonayqAz8', created_by_id='DzTjkKse')

Access #

Query files by provenance metadata#

users = ln.User.lookup()
ln.Transform.filter(created_by=users.testuser1).search("scrna")
id __ratio__
name
Integrate scRNA-seq datasets agayZTonayqAz8 90.0
scRNA-seq Nv48yAceNSh8z8 90.0
transform = ln.Transform.filter(id="Nv48yAceNSh8z8").one()
ln.File.filter(transform=transform).df()
storage_id key suffix accessor description version size hash hash_type transform_id run_id initial_version_id updated_at created_by_id
id
mRiMeuE6kiGVZd17JYY5 5VDhxnQV None .h5ad AnnData Conde22 None 28049505 WEFcMZxJNmMiUOFrcSTaig md5 Nv48yAceNSh8z8 R6VHyzAPZWE010J0EfCw None 2023-09-26 15:22:16 DzTjkKse
JbGzRmdZ1qkqC1uIhRFd 5VDhxnQV None .h5ad AnnData 10x reference pbmc68k None 660792 a2V0IgOjMRHsCeZH169UOQ md5 Nv48yAceNSh8z8 R6VHyzAPZWE010J0EfCw None 2023-09-26 15:22:39 DzTjkKse

Query files based on biological metadata#

assays = lb.ExperimentalFactor.lookup()
species = lb.Species.lookup()
cell_types = lb.CellType.lookup()
query = ln.File.filter(
    experimental_factors=assays.single_cell_rna_sequencing,
    species=species.human,
    cell_types=cell_types.gamma_delta_t_cell,
)
query.df()
storage_id key suffix accessor description version size hash hash_type transform_id run_id initial_version_id updated_at created_by_id
id
JbGzRmdZ1qkqC1uIhRFd 5VDhxnQV None .h5ad AnnData 10x reference pbmc68k None 660792 a2V0IgOjMRHsCeZH169UOQ md5 Nv48yAceNSh8z8 R6VHyzAPZWE010J0EfCw None 2023-09-26 15:22:39 DzTjkKse
mRiMeuE6kiGVZd17JYY5 5VDhxnQV None .h5ad AnnData Conde22 None 28049505 WEFcMZxJNmMiUOFrcSTaig md5 Nv48yAceNSh8z8 R6VHyzAPZWE010J0EfCw None 2023-09-26 15:22:16 DzTjkKse

Transform #

Compare gene sets#

Get file objects:

query = ln.File.filter()
file1, file2 = query.list()
file1.describe()
File(id='mRiMeuE6kiGVZd17JYY5', suffix='.h5ad', accessor='AnnData', description='Conde22', size=28049505, hash='WEFcMZxJNmMiUOFrcSTaig', hash_type='md5', updated_at=2023-09-26 15:22:16)

Provenance:
  πŸ—ƒοΈ storage: Storage(id='5VDhxnQV', root='/home/runner/work/lamin-usecases/lamin-usecases/docs/test-scrna', type='local', updated_at=2023-09-26 15:21:41, created_by_id='DzTjkKse')
  πŸ“” transform: Transform(id='Nv48yAceNSh8z8', name='scRNA-seq', short_name='scrna', version='0', type='notebook', updated_at=2023-09-26 15:22:39, created_by_id='DzTjkKse')
  πŸ‘£ run: Run(id='R6VHyzAPZWE010J0EfCw', run_at=2023-09-26 15:21:43, transform_id='Nv48yAceNSh8z8', created_by_id='DzTjkKse')
  πŸ‘€ created_by: User(id='DzTjkKse', handle='testuser1', email='testuser1@lamin.ai', name='Test User1', updated_at=2023-09-26 15:21:41)
Features:
  var: FeatureSet(id='uvZOKkBmvkqR3z9azMvj', n=36503, type='number', registry='bionty.Gene', hash='dnRexHCtxtmOU81_EpoJ', updated_at=2023-09-26 15:22:12, modality_id='8f5ws6Em', created_by_id='DzTjkKse')
    'FUT2', 'ZNF493', 'None', 'ESRRG', 'OR8A1', 'TMSB15B-AS1', 'ARHGEF3', 'None', 'NBR2', 'L3HYPDH', ...
  obs: FeatureSet(id='4gQLEeDIAYILFQMka6yJ', n=4, registry='core.Feature', hash='KViikKFECoDQO9CL-NyN', updated_at=2023-09-26 15:22:16, modality_id='AUT3hjtO', created_by_id='DzTjkKse')
    πŸ”— assay (4, bionty.ExperimentalFactor): 'single-cell RNA sequencing', '10x 5' v2', '10x 5' v1', '10x 3' v3'
    πŸ”— donor (12, core.ULabel): 'A52', 'D496', 'A35', '621B', '637C', 'D503', '640C', '582C', 'A37', 'A29', ...
    πŸ”— cell_type (32, bionty.CellType): 'mucosal invariant T cell', 'CD8-positive, alpha-beta memory T cell', 'mast cell', 'progenitor cell', 'group 3 innate lymphoid cell', 'classical monocyte', 'lymphocyte', 'conventional dendritic cell', 'naive B cell', 'effector memory CD8-positive, alpha-beta T cell, terminally differentiated', ...
    πŸ”— tissue (17, bionty.Tissue): 'blood', 'jejunal epithelium', 'thoracic lymph node', 'sigmoid colon', 'thymus', 'duodenum', 'caecum', 'lamina propria', 'liver', 'ileum', ...
Labels:
  🏷️ species (1, bionty.Species): 'human'
  🏷️ tissues (17, bionty.Tissue): 'blood', 'jejunal epithelium', 'thoracic lymph node', 'sigmoid colon', 'thymus', 'duodenum', 'caecum', 'lamina propria', 'liver', 'ileum', ...
  🏷️ cell_types (32, bionty.CellType): 'mucosal invariant T cell', 'CD8-positive, alpha-beta memory T cell', 'mast cell', 'progenitor cell', 'group 3 innate lymphoid cell', 'classical monocyte', 'lymphocyte', 'conventional dendritic cell', 'naive B cell', 'effector memory CD8-positive, alpha-beta T cell, terminally differentiated', ...
  🏷️ experimental_factors (4, bionty.ExperimentalFactor): 'single-cell RNA sequencing', '10x 5' v2', '10x 5' v1', '10x 3' v3'
  🏷️ ulabels (12, core.ULabel): 'A52', 'D496', 'A35', '621B', '637C', 'D503', '640C', '582C', 'A37', 'A29', ...
file1.view_flow()
_images/ec0c656d40cb77692fb9c87820a20444aad295db299e10d9efb64c519e9c4351.svg
file2.describe()
File(id='JbGzRmdZ1qkqC1uIhRFd', suffix='.h5ad', accessor='AnnData', description='10x reference pbmc68k', size=660792, hash='a2V0IgOjMRHsCeZH169UOQ', hash_type='md5', updated_at=2023-09-26 15:22:39)

Provenance:
  πŸ—ƒοΈ storage: Storage(id='5VDhxnQV', root='/home/runner/work/lamin-usecases/lamin-usecases/docs/test-scrna', type='local', updated_at=2023-09-26 15:21:41, created_by_id='DzTjkKse')
  πŸ“” transform: Transform(id='Nv48yAceNSh8z8', name='scRNA-seq', short_name='scrna', version='0', type='notebook', updated_at=2023-09-26 15:22:39, created_by_id='DzTjkKse')
  πŸ‘£ run: Run(id='R6VHyzAPZWE010J0EfCw', run_at=2023-09-26 15:21:43, transform_id='Nv48yAceNSh8z8', created_by_id='DzTjkKse')
  πŸ‘€ created_by: User(id='DzTjkKse', handle='testuser1', email='testuser1@lamin.ai', name='Test User1', updated_at=2023-09-26 15:21:41)
Features:
  var: FeatureSet(id='MQPU18dOZrpTw3zImorV', n=754, type='number', registry='bionty.Gene', hash='WMDxN7253SdzGwmznV5d', updated_at=2023-09-26 15:22:39, modality_id='8f5ws6Em', created_by_id='DzTjkKse')
    'FGR', 'APTR', 'TPR', 'DNAJB1', 'FCN1', 'SELENOS', 'ANXA1', 'EBPL', 'JCHAIN', 'PRSS57', ...
  obs: FeatureSet(id='1Ee7GMdVnNaWFJWWFOuE', n=1, registry='core.Feature', hash='U42Q4GyIP0fFVXaGUk8v', updated_at=2023-09-26 15:22:39, modality_id='AUT3hjtO', created_by_id='DzTjkKse')
    πŸ”— cell_type (9, bionty.CellType): 'gamma-delta T cell', 'CD4-positive, alpha-beta T cell', 'CD8-positive, CD25-positive, alpha-beta regulatory T cell', 'B cell, CD19-positive', 'cytotoxic T cell', 'monocyte', 'dendritic cell', 'CD24-positive, CD4 single-positive thymocyte', 'CD16-positive, CD56-dim natural killer cell, human'
  external: FeatureSet(id='l1HNo9eWNzNKTcaxRBDR', n=2, registry='core.Feature', hash='yHmwwnoFJDU0_1t7TLmH', updated_at=2023-09-26 15:22:39, modality_id='AUT3hjtO', created_by_id='DzTjkKse')
    πŸ”— species (1, bionty.Species): 'human'
    πŸ”— assay (1, bionty.ExperimentalFactor): 'single-cell RNA sequencing'
Labels:
  🏷️ species (1, bionty.Species): 'human'
  🏷️ cell_types (9, bionty.CellType): 'gamma-delta T cell', 'CD4-positive, alpha-beta T cell', 'CD8-positive, CD25-positive, alpha-beta regulatory T cell', 'B cell, CD19-positive', 'cytotoxic T cell', 'monocyte', 'dendritic cell', 'CD24-positive, CD4 single-positive thymocyte', 'CD16-positive, CD56-dim natural killer cell, human'
  🏷️ experimental_factors (1, bionty.ExperimentalFactor): 'single-cell RNA sequencing'
file2.view_flow()
_images/3e26fd02f945d8d8b138d39fca5d5eb8799651fa4bd75eedf02822d126d8b48f.svg

Load files into memory:

file1_adata = file1.load()
file2_adata = file2.load()

Here we compute shared genes without loading files:

file1_genes = file1.features["var"]
file2_genes = file2.features["var"]

shared_genes = file1_genes & file2_genes
len(shared_genes)
749
shared_genes.list("symbol")[:10]
['NDUFAF3',
 'CD82',
 'TEX264',
 'ARID4B',
 'F12',
 'CCDC167',
 'POLR1H',
 'EXOG',
 'LMAN2',
 'GNG7']

Compare cell types#

file1_celltypes = file1.cell_types.all()
file2_celltypes = file2.cell_types.all()

shared_celltypes = file1_celltypes & file2_celltypes
shared_celltypes_names = shared_celltypes.list("name")
shared_celltypes_names
['CD16-positive, CD56-dim natural killer cell, human', 'gamma-delta T cell']

We can now subset the two datasets by shared cell types:

file1_adata_subset = file1_adata[
    file1_adata.obs["cell_type"].isin(shared_celltypes_names)
]

file2_adata_subset = file2_adata[
    file2_adata.obs["cell_type"].isin(shared_celltypes_names)
]

Concatenate subsetted datasets:

adata_concat = ad.concat(
    [file1_adata_subset, file2_adata_subset],
    label="file",
    keys=[file1.description, file2.description],
)
adata_concat
AnnData object with n_obs Γ— n_vars = 187 Γ— 749
    obs: 'cell_type', 'file'
    obsm: 'X_umap'
adata_concat.obs.value_counts()
cell_type                                           file                 
CD16-positive, CD56-dim natural killer cell, human  Conde22                  114
gamma-delta T cell                                  Conde22                   66
                                                    10x reference pbmc68k      4
CD16-positive, CD56-dim natural killer cell, human  10x reference pbmc68k      3
dtype: int64
# clean up test instance
!lamin delete --force test-scrna
!rm -r ./test-scrna
Hide code cell output
πŸ’‘ deleting instance testuser1/test-scrna
βœ…     deleted instance settings file: /home/runner/.lamin/instance--testuser1--test-scrna.env
βœ…     instance cache deleted
βœ…     deleted '.lndb' sqlite file
❗     consider manually deleting your stored data: /home/runner/work/lamin-usecases/lamin-usecases/docs/test-scrna