Linking scRNA-seq data against Gene
#
Show code cell content
!lamin login testuser2
✅ Logged in with email testuser2@lamin.ai and id bKeW4T6E
import lamindb as ln
import lnschema_bionty as bt
ln.track()
ℹ️ Instance: testuser1/mydata
ℹ️ User: testuser2
ℹ️ Added notebook: Transform(id='Nv48yAceNSh8', v='0', name='11-scrna', type=notebook, title='Linking scRNA-seq data against `Gene`', created_by='bKeW4T6E', created_at=datetime.datetime(2023, 3, 30, 23, 17, 18))
ℹ️ Added run: Run(id='H8XISAvITiHbM2Z0nxtn', transform_id='Nv48yAceNSh8', transform_v='0', created_by='bKeW4T6E', created_at=datetime.datetime(2023, 3, 30, 23, 17, 18))
Consider an scRNA-seq count matrix in form of an AnnData
object in memory:
Show code cell content
adata = ln.dev.datasets.anndata_mouse_sc_lymph_node()
adata
AnnData object with n_obs × n_vars = 1135 × 10000
obs: 'Sample Characteristic[organism]', 'Sample Characteristic Ontology Term[organism]', 'Sample Characteristic[strain]', 'Sample Characteristic Ontology Term[strain]', 'Sample Characteristic[age]', 'Sample Characteristic Ontology Term[age]', 'Sample Characteristic[developmental stage]', 'Sample Characteristic Ontology Term[developmental stage]', 'Sample Characteristic[sex]', 'Sample Characteristic Ontology Term[sex]', 'Sample Characteristic[genotype]', 'Sample Characteristic Ontology Term[genotype]', 'Sample Characteristic[organism part]', 'Sample Characteristic Ontology Term[organism part]', 'Sample Characteristic[cell type]', 'Sample Characteristic Ontology Term[cell type]', 'Sample Characteristic[immunophenotype]', 'Sample Characteristic Ontology Term[immunophenotype]', 'Sample Characteristic[post analysis well quality]', 'Sample Characteristic Ontology Term[post analysis well quality]', 'Factor Value[single cell identifier]', 'Factor Value Ontology Term[single cell identifier]', 'Factor Value[cluster]', 'Factor Value Ontology Term[cluster]', 'Factor Value[inferred cell type - authors labels]', 'Factor Value Ontology Term[inferred cell type - authors labels]'
Check out the features of this dataset:
adata.var.head()
ENSMUSG00000104923 |
---|
ENSMUSG00000079038 |
ENSMUSG00000069755 |
ENSMUSG00000040648 |
ENSMUSG00000113486 |
Configure a Gene
reference for parsing features#
The features in this data object are genes and indexed by Ensembl gene ids. We’d like to link these features so that we can query the data by genes!
Features are often knowledge-based entities. Bionty provides several knowledge-based tables for basic biological entities.
Note
For an overview of knowledge tables, see:
bionty
.For an overview of lookup identifiers, see: Bionty lookup.
Let’s first configure a gene reference we’d like to use for parsing features in this dataset:
Tip
We provide lookups for ids and species in bionty, so you don’t need to memorize the exact strings:
import bionty
import lnschema_bionty as bt
id = bionty.lookup.gene_id.ensembl_gene_id
species=bt.Species.lookup.mouse
By default, ensembl_gene_id
is used, you may pass other ids available from bt.lookup.gene_id
.
reference = bt.Gene(species="mouse")
reference
Entity of Gene
Species: mouse
Database: ensembl
Access ontology terms with 'Gene.df'
Parse features#
Now let’s parse the features from the data using the configured reference above:
features = ln.Features(data=adata, reference=reference)
✅ 10000 terms (100.0%) are mapped.
🔶 0 terms (0.0%) are not mapped.
ℹ️ Downloading species reference for the first time might take a while...
Here, all features were successfully (unambiguously) linked against their canonical reference in bionty.Gene
.
This creates a feature set of type gene
linked (indexed by its hash):
features
Features(id='2Mv3JtH-ScBVYHilbLaQ', type='gene', created_by='bKeW4T6E')
This feature set links records for 10k genes. Here are the first 3, all of which can be queried:
features.genes[:3]
[Gene(id='0VXkJH', ensembl_gene_id='ENSMUSG00000064345', symbol='mt-Nd2', gene_type='protein_coding', description='mitochondrially encoded NADH dehydrogenase 2 [Source:MGI Symbol;Acc:MGI:102500]', ncbi_gene_id=17717, mgi_id='MGI:102500', synonyms='URF2', species_id='NCBI_10090'),
Gene(id='bC5pqp', ensembl_gene_id='ENSMUSG00000064351', symbol='mt-Co1', gene_type='protein_coding', description='mitochondrially encoded cytochrome c oxidase I [Source:MGI Symbol;Acc:MGI:102504]', ncbi_gene_id=17708, mgi_id='MGI:102504', synonyms='CCOI|COI|COX 1|COX1|CoxI|mt-Co1-201', species_id='NCBI_10090'),
Gene(id='LCRSs1', ensembl_gene_id='ENSMUSG00000064354', symbol='mt-Co2', gene_type='protein_coding', description='mitochondrially encoded cytochrome c oxidase II [Source:MGI Symbol;Acc:MGI:102503]', ncbi_gene_id=17709, mgi_id='MGI:102503', synonyms='COXII', species_id='NCBI_10090')]
Hence, not just for Ensemble IDs, but also by gene symbol, NCBI ids, gene type, etc.
Track data with features (genes)#
Now we can track data together with the parsed features by passing features
when instantiating File
.
Tip
Linking features can also be made post instantiation via:
file = ln.File(adata, name="Mouse Lymph Node scRNA-seq")
file.features = features
file = ln.File(adata, name="Mouse Lymph Node scRNA-seq", features=features)
The features can now be accessed via relationship to dobejct:
file.features[0].genes[:3]
[Gene(id='0VXkJH', ensembl_gene_id='ENSMUSG00000064345', symbol='mt-Nd2', gene_type='protein_coding', description='mitochondrially encoded NADH dehydrogenase 2 [Source:MGI Symbol;Acc:MGI:102500]', ncbi_gene_id=17717, mgi_id='MGI:102500', synonyms='URF2', species_id='NCBI_10090'),
Gene(id='bC5pqp', ensembl_gene_id='ENSMUSG00000064351', symbol='mt-Co1', gene_type='protein_coding', description='mitochondrially encoded cytochrome c oxidase I [Source:MGI Symbol;Acc:MGI:102504]', ncbi_gene_id=17708, mgi_id='MGI:102504', synonyms='CCOI|COI|COX 1|COX1|CoxI|mt-Co1-201', species_id='NCBI_10090'),
Gene(id='LCRSs1', ensembl_gene_id='ENSMUSG00000064354', symbol='mt-Co2', gene_type='protein_coding', description='mitochondrially encoded cytochrome c oxidase II [Source:MGI Symbol;Acc:MGI:102503]', ncbi_gene_id=17709, mgi_id='MGI:102503', synonyms='COXII', species_id='NCBI_10090')]
Add file and its linked features to the database:
ln.add(file)
File(id='iZ26MT0p56QN0flq69Pv', name='Mouse Lymph Node scRNA-seq', suffix='.h5ad', size=17341245, hash='Qprqj0O23197Ko-VobaZiw', source_id='H8XISAvITiHbM2Z0nxtn', storage_id='8Pj12JLb', created_at=datetime.datetime(2023, 3, 30, 23, 17, 45))
Querying data by features#
Let us query gene records by symbol:
ln.select(bt.Gene, symbol="Actg1").df()
ensembl_gene_id | symbol | gene_type | description | ncbi_gene_id | hgnc_id | mgi_id | omim_id | synonyms | species_id | version | |
---|---|---|---|---|---|---|---|---|---|---|---|
id | |||||||||||
zDWsM1 | ENSMUSG00000062825 | Actg1 | protein_coding | actin, gamma, cytoplasmic 1 [Source:MGI Symbol... | 11465 | None | MGI:87906 | None | Actl|E51 | NCBI_10090 | None |
How do we retrieve data objects, in which this gene was a feature:
We could first query all feature sets that contain the gene, and then query File
by that.
features = (
ln.select(ln.Features, bt.Gene.symbol)
.join(ln.Features.genes)
.where(bt.Gene.symbol == "Actg1")
)
features.df().head()
core.features | bionty.gene | ||||
---|---|---|---|---|---|
id | type | created_by | created_at | symbol | |
0 | 2Mv3JtH-ScBVYHilbLaQ | gene | bKeW4T6E | 2023-03-30 23:17:44 | Actg1 |
Of course, we can also write everything in one statement:
files = (
ln.select(ln.File)
.join(ln.File.features)
.join(ln.Features.genes)
.where(bt.Gene.symbol == "Actg1")
)
files.df().head()
name | suffix | size | hash | source_id | storage_id | created_at | updated_at | |
---|---|---|---|---|---|---|---|---|
id | ||||||||
iZ26MT0p56QN0flq69Pv | Mouse Lymph Node scRNA-seq | .h5ad | 17341245 | Qprqj0O23197Ko-VobaZiw | H8XISAvITiHbM2Z0nxtn | 8Pj12JLb | 2023-03-30 23:17:45 | None |