Linking scRNA-seq data against Gene#

Hide code cell content
!lamin login testuser2
✅ Logged in with email testuser2@lamin.ai and id bKeW4T6E
import lamindb as ln
import lnschema_bionty as bt

ln.track()
ℹ️ Instance: testuser1/mydata
ℹ️ User: testuser2
ℹ️ Added notebook: Transform(id='Nv48yAceNSh8', v='0', name='11-scrna', type=notebook, title='Linking scRNA-seq data against `Gene`', created_by='bKeW4T6E', created_at=datetime.datetime(2023, 3, 30, 23, 17, 18))
ℹ️ Added run: Run(id='H8XISAvITiHbM2Z0nxtn', transform_id='Nv48yAceNSh8', transform_v='0', created_by='bKeW4T6E', created_at=datetime.datetime(2023, 3, 30, 23, 17, 18))

Consider an scRNA-seq count matrix in form of an AnnData object in memory:

Hide code cell content
adata = ln.dev.datasets.anndata_mouse_sc_lymph_node()
adata
AnnData object with n_obs × n_vars = 1135 × 10000
    obs: 'Sample Characteristic[organism]', 'Sample Characteristic Ontology Term[organism]', 'Sample Characteristic[strain]', 'Sample Characteristic Ontology Term[strain]', 'Sample Characteristic[age]', 'Sample Characteristic Ontology Term[age]', 'Sample Characteristic[developmental stage]', 'Sample Characteristic Ontology Term[developmental stage]', 'Sample Characteristic[sex]', 'Sample Characteristic Ontology Term[sex]', 'Sample Characteristic[genotype]', 'Sample Characteristic Ontology Term[genotype]', 'Sample Characteristic[organism part]', 'Sample Characteristic Ontology Term[organism part]', 'Sample Characteristic[cell type]', 'Sample Characteristic Ontology Term[cell type]', 'Sample Characteristic[immunophenotype]', 'Sample Characteristic Ontology Term[immunophenotype]', 'Sample Characteristic[post analysis well quality]', 'Sample Characteristic Ontology Term[post analysis well quality]', 'Factor Value[single cell identifier]', 'Factor Value Ontology Term[single cell identifier]', 'Factor Value[cluster]', 'Factor Value Ontology Term[cluster]', 'Factor Value[inferred cell type - authors labels]', 'Factor Value Ontology Term[inferred cell type - authors labels]'

Check out the features of this dataset:

adata.var.head()
ENSMUSG00000104923
ENSMUSG00000079038
ENSMUSG00000069755
ENSMUSG00000040648
ENSMUSG00000113486

Configure a Gene reference for parsing features#

The features in this data object are genes and indexed by Ensembl gene ids. We’d like to link these features so that we can query the data by genes!

Features are often knowledge-based entities. Bionty provides several knowledge-based tables for basic biological entities.

Note

  • For an overview of knowledge tables, see: bionty.

  • For an overview of lookup identifiers, see: Bionty lookup.

Let’s first configure a gene reference we’d like to use for parsing features in this dataset:

Tip

We provide lookups for ids and species in bionty, so you don’t need to memorize the exact strings:


import bionty
import lnschema_bionty as bt

id = bionty.lookup.gene_id.ensembl_gene_id
species=bt.Species.lookup.mouse

By default, ensembl_gene_id is used, you may pass other ids available from bt.lookup.gene_id.

reference = bt.Gene(species="mouse")
reference
Entity of Gene
Species: mouse
Database: ensembl

Access ontology terms with 'Gene.df'

Parse features#

Now let’s parse the features from the data using the configured reference above:

features = ln.Features(data=adata, reference=reference)
✅ 10000 terms (100.0%) are mapped.
🔶 0 terms (0.0%) are not mapped.
ℹ️ Downloading species reference for the first time might take a while...


Here, all features were successfully (unambiguously) linked against their canonical reference in bionty.Gene.

This creates a feature set of type gene linked (indexed by its hash):

features
Features(id='2Mv3JtH-ScBVYHilbLaQ', type='gene', created_by='bKeW4T6E')

This feature set links records for 10k genes. Here are the first 3, all of which can be queried:

features.genes[:3]
[Gene(id='0VXkJH', ensembl_gene_id='ENSMUSG00000064345', symbol='mt-Nd2', gene_type='protein_coding', description='mitochondrially encoded NADH dehydrogenase 2 [Source:MGI Symbol;Acc:MGI:102500]', ncbi_gene_id=17717, mgi_id='MGI:102500', synonyms='URF2', species_id='NCBI_10090'),
 Gene(id='bC5pqp', ensembl_gene_id='ENSMUSG00000064351', symbol='mt-Co1', gene_type='protein_coding', description='mitochondrially encoded cytochrome c oxidase I [Source:MGI Symbol;Acc:MGI:102504]', ncbi_gene_id=17708, mgi_id='MGI:102504', synonyms='CCOI|COI|COX 1|COX1|CoxI|mt-Co1-201', species_id='NCBI_10090'),
 Gene(id='LCRSs1', ensembl_gene_id='ENSMUSG00000064354', symbol='mt-Co2', gene_type='protein_coding', description='mitochondrially encoded cytochrome c oxidase II [Source:MGI Symbol;Acc:MGI:102503]', ncbi_gene_id=17709, mgi_id='MGI:102503', synonyms='COXII', species_id='NCBI_10090')]

Hence, not just for Ensemble IDs, but also by gene symbol, NCBI ids, gene type, etc.

Track data with features (genes)#

Now we can track data together with the parsed features by passing features when instantiating File.

Tip

Linking features can also be made post instantiation via:


file = ln.File(adata, name="Mouse Lymph Node scRNA-seq")
file.features = features
file = ln.File(adata, name="Mouse Lymph Node scRNA-seq", features=features)

The features can now be accessed via relationship to dobejct:

file.features[0].genes[:3]
[Gene(id='0VXkJH', ensembl_gene_id='ENSMUSG00000064345', symbol='mt-Nd2', gene_type='protein_coding', description='mitochondrially encoded NADH dehydrogenase 2 [Source:MGI Symbol;Acc:MGI:102500]', ncbi_gene_id=17717, mgi_id='MGI:102500', synonyms='URF2', species_id='NCBI_10090'),
 Gene(id='bC5pqp', ensembl_gene_id='ENSMUSG00000064351', symbol='mt-Co1', gene_type='protein_coding', description='mitochondrially encoded cytochrome c oxidase I [Source:MGI Symbol;Acc:MGI:102504]', ncbi_gene_id=17708, mgi_id='MGI:102504', synonyms='CCOI|COI|COX 1|COX1|CoxI|mt-Co1-201', species_id='NCBI_10090'),
 Gene(id='LCRSs1', ensembl_gene_id='ENSMUSG00000064354', symbol='mt-Co2', gene_type='protein_coding', description='mitochondrially encoded cytochrome c oxidase II [Source:MGI Symbol;Acc:MGI:102503]', ncbi_gene_id=17709, mgi_id='MGI:102503', synonyms='COXII', species_id='NCBI_10090')]

Add file and its linked features to the database:

ln.add(file)
File(id='iZ26MT0p56QN0flq69Pv', name='Mouse Lymph Node scRNA-seq', suffix='.h5ad', size=17341245, hash='Qprqj0O23197Ko-VobaZiw', source_id='H8XISAvITiHbM2Z0nxtn', storage_id='8Pj12JLb', created_at=datetime.datetime(2023, 3, 30, 23, 17, 45))

Querying data by features#

See also

Basic queries:

Let us query gene records by symbol:

ln.select(bt.Gene, symbol="Actg1").df()
ensembl_gene_id symbol gene_type description ncbi_gene_id hgnc_id mgi_id omim_id synonyms species_id version
id
zDWsM1 ENSMUSG00000062825 Actg1 protein_coding actin, gamma, cytoplasmic 1 [Source:MGI Symbol... 11465 None MGI:87906 None Actl|E51 NCBI_10090 None

How do we retrieve data objects, in which this gene was a feature:

We could first query all feature sets that contain the gene, and then query File by that.

features = (
    ln.select(ln.Features, bt.Gene.symbol)
    .join(ln.Features.genes)
    .where(bt.Gene.symbol == "Actg1")
)
features.df().head()
core.features bionty.gene
id type created_by created_at symbol
0 2Mv3JtH-ScBVYHilbLaQ gene bKeW4T6E 2023-03-30 23:17:44 Actg1

Of course, we can also write everything in one statement:

files = (
    ln.select(ln.File)
    .join(ln.File.features)
    .join(ln.Features.genes)
    .where(bt.Gene.symbol == "Actg1")
)
files.df().head()
name suffix size hash source_id storage_id created_at updated_at
id
iZ26MT0p56QN0flq69Pv Mouse Lymph Node scRNA-seq .h5ad 17341245 Qprqj0O23197Ko-VobaZiw H8XISAvITiHbM2Z0nxtn 8Pj12JLb 2023-03-30 23:17:45 None