Track sample-level metadata#
We already saw how to link data objects to entities representing features during ingestion.
For sample-level metadata, the underlying schema is often more complicated, and hence it’s best done in a separate step.
Here, we walk through this process.
import lamindb as ln
import lamindb.schema as lns
import lnschema_bionty as bt
ln.track()
ℹ️ Instance: testuser1/mydata
ℹ️ User: testuser2
ℹ️ Added notebook: Transform(id='zMCvXplQ8kTk', v='0', name='14-link-samples', type=notebook, title='Track sample-level metadata', created_by='bKeW4T6E', created_at=datetime.datetime(2023, 3, 30, 23, 18, 5))
ℹ️ Added run: Run(id='c42EAlE765NP4ibZbhEa', transform_id='zMCvXplQ8kTk', transform_v='0', created_by='bKeW4T6E', created_at=datetime.datetime(2023, 3, 30, 23, 18, 5))
Samples, i.e., metadata associated with observations, are linked with the same approach post-ingestion.
We’ll need to lazily relationships of objects, and hence, we need to keep track of a session.
ss = ln.Session()
Let’s first query an scRNA-seq dataset stored as an .h5ad
file.
file = ss.select(ln.File, suffix=".h5ad").first()
file
[session open] File(id='iZ26MT0p56QN0flq69Pv', name='Mouse Lymph Node scRNA-seq', suffix='.h5ad', size=17341245, hash='Qprqj0O23197Ko-VobaZiw', source_id='H8XISAvITiHbM2Z0nxtn', storage_id='8Pj12JLb', created_at=datetime.datetime(2023, 3, 30, 23, 17, 45))
For instance, let’s annotate a scRNA-seq dataset with its readout type (scRNA-seq), the tissue, and the species.
Readout#
scrnaseq = bt.Readout.lookup.single_cell_RNA_sequencing
scrnaseq
readout(index=7409, ontology_id='EFO:0008913', name='single-cell RNA sequencing')
readout = bt.Readout(name=scrnaseq.name)
readout
ℹ️ Downloading readout reference for the first time might take a while...
Readout(id='EFO:0008913', name='single-cell RNA sequencing', molecule='RNA assay', instrument='single cell sequencing', created_by='bKeW4T6E')
Link the readout against the data object.
file.readouts.append(readout)
Biosample#
biosample = lns.wetlab.Biosample(name="Mouse Lymph Node")
Species#
We already have mouse in the database, hence let’s just query it. No need to create a new record.
species = ln.select(bt.Species, name="mouse").one()
species
Species(id='NCBI_10090', name='mouse', taxon_id=10090, scientific_name='mus_musculus')
biosample.species = species
Tissue#
tissue_lookup = bt.Tissue.lookup
ℹ️ Downloading tissue reference for the first time might take a while...
tissue_lookup.lymph_node
tissue(ontology_id='UBERON:0000029', name='lymph node')
tissue = bt.Tissue(name=tissue_lookup.lymph_node.name)
tissue
Tissue(id='UBERON:0000029', ontology_id='UBERON:0000029', name='lymph node')
biosample.tissue = tissue
Link against file#
Link against the data object:
file.biosamples.append(biosample)
Add to the DB#
We can add everything to the DB in one transaction:
ss.add([readout, biosample])
[[session open] Readout(id='EFO:0008913', name='single-cell RNA sequencing', molecule='RNA assay', instrument='single cell sequencing', created_by='bKeW4T6E', created_at=datetime.datetime(2023, 3, 30, 23, 19, 33)),
[session open] Biosample(id='9Qg2tEfShXoCpDM22veR', name='Mouse Lymph Node', created_by='bKeW4T6E', created_at=datetime.datetime(2023, 3, 30, 23, 19, 33), species_id='NCBI_10090', tissue_id='UBERON:0000029')]
Let us close the session.
ss.close()
Tip
Manage Session
closing with a context manager instead of manually closing it!
With it the above would look like:
with ln.Session() as ss:
# manipulate data
Query for linked metadata#
ln.select(ln.File).where(
ln.File.readouts,
bt.Readout.name == scrnaseq.name,
).df()
name | suffix | size | hash | source_id | storage_id | created_at | updated_at | |
---|---|---|---|---|---|---|---|---|
id | ||||||||
iZ26MT0p56QN0flq69Pv | Mouse Lymph Node scRNA-seq | .h5ad | 17341245 | Qprqj0O23197Ko-VobaZiw | H8XISAvITiHbM2Z0nxtn | 8Pj12JLb | 2023-03-30 23:17:45 | None |
ln.select(ln.File).join(ln.File.biosamples).where(
lns.wetlab.Biosample.species, bt.Species.name == "mouse"
).df()
name | suffix | size | hash | source_id | storage_id | created_at | updated_at | |
---|---|---|---|---|---|---|---|---|
id | ||||||||
iZ26MT0p56QN0flq69Pv | Mouse Lymph Node scRNA-seq | .h5ad | 17341245 | Qprqj0O23197Ko-VobaZiw | H8XISAvITiHbM2Z0nxtn | 8Pj12JLb | 2023-03-30 23:17:45 | None |
What’s in the database?#
Biological entities#
ln.view(schema="bionty")
******************
* module: bionty *
******************
CellMarker
name | ncbi_gene_id | gene_symbol | gene_name | uniprotkb_id | species_id | |
---|---|---|---|---|---|---|
id | ||||||
CM_CD127 | CD127 | 3575 | IL7R | interleukin 7 receptor | P16871 | NCBI_9606 |
CM_CD8 | CD8 | 925 | CD8A | CD8a molecule | P01732 | NCBI_9606 |
CM_CD3 | CD3 | None | None | None | None | NCBI_9606 |
CM_CD45RO | CD45RO | None | None | None | None | NCBI_9606 |
CM_CD57 | CD57 | 27087 | B3GAT1 | beta-1,3-glucuronyltransferase 1 | Q9P2W7 | NCBI_9606 |
CM_CCR5 | CCR5 | 1234 | CCR5 | C-C motif chemokine receptor 5 | P51681 | NCBI_9606 |
CM_KI67 | KI67 | None | None | None | None | NCBI_9606 |
CM_SSC-A | SSC-A | None | None | None | None | NCBI_9606 |
CcvhLu1g | FSC-H | None | None | None | None | NCBI_9606 |
GmKcBYj3 | FSC-A | None | None | None | None | NCBI_9606 |
Gene
ensembl_gene_id | symbol | gene_type | description | ncbi_gene_id | hgnc_id | mgi_id | omim_id | synonyms | species_id | version | |
---|---|---|---|---|---|---|---|---|---|---|---|
id | |||||||||||
bV57xq | ENSMUSG00000098950 | Gm28036 | protein_coding | predicted gene, 28036 [Source:MGI Symbol;Acc:M... | NaN | None | MGI:5547772 | None | None | NCBI_10090 | None |
Quh9bR | ENSMUSG00000063844 | Olfr1276 | protein_coding | olfactory receptor 1276 [Source:MGI Symbol;Acc... | 258390.0 | None | MGI:3031110 | None | GA_x6K02T2Q125-72308574-72309512|MOR245-10 | NCBI_10090 | None |
C3bx0M | ENSMUSG00000068647 | Olfr1278 | protein_coding | olfactory receptor 1278 [Source:MGI Symbol;Acc... | 258389.0 | None | MGI:3031112 | None | GA_x6K02T2Q125-72343713-72344654|MOR245-11 | NCBI_10090 | None |
4VzQC3 | ENSMUSG00000027157 | Potefam1 | protein_coding | POTE ankyrin domain family member 1 [Source:MG... | 67575.0 | None | MGI:1914825 | None | 4930430A15Rik|A26c3|Pote1|Potea | NCBI_10090 | None |
xwg7Tv | ENSMUSG00000025781 | Atp5c1 | protein_coding | ATP synthase, H+ transporting, mitochondrial F... | 11949.0 | None | MGI:1261437 | None | 1700094F02Rik|F1 gamma | NCBI_10090 | None |
04xxXD | ENSMUSG00000081719 | Gm13726 | processed_pseudogene | predicted gene 13726 [Source:MGI Symbol;Acc:MG... | NaN | None | MGI:3649259 | None | None | NCBI_10090 | None |
EwfwXl | ENSMUSG00000090059 | Olfr1062 | protein_coding | olfactory receptor 1062 [Source:MGI Symbol;Acc... | 259082.0 | None | MGI:3030896 | None | GA_x6K02T2Q125-47892992-47892045|MOR185-1 | NCBI_10090 | None |
7uT8Ig | ENSMUSG00000111306 | Olfr1065 | protein_coding | olfactory receptor 1065 [Source:MGI Symbol;Acc... | 258403.0 | None | MGI:3030899 | None | GA_x6K02T2Q125-47915274-47914333|MOR190-1 | NCBI_10090 | None |
3sPofk | ENSMUSG00000025314 | Ptprj | protein_coding | protein tyrosine phosphatase, receptor type, J... | 19271.0 | None | MGI:104574 | None | Byp|CD148|DEP-1|RPTPJ|Scc-1|Scc1 | NCBI_10090 | None |
OUwEUX | ENSMUSG00000085028 | Slc2a4rg-ps | transcribed_unitary_pseudogene | Slc2a4 regulator, pseudogene [Source:MGI Symbo... | 329584.0 | None | MGI:3651388 | None | None | NCBI_10090 | None |
Readout
efo_id | name | molecule | instrument | measurement | created_by | created_at | updated_at | |
---|---|---|---|---|---|---|---|---|
id | ||||||||
EFO:0008913 | None | single-cell RNA sequencing | RNA assay | single cell sequencing | None | bKeW4T6E | 2023-03-30 23:19:33 | None |
Species
name | taxon_id | scientific_name | |
---|---|---|---|
id | |||
NCBI_10090 | mouse | 10090 | mus_musculus |
NCBI_9606 | human | 9606 | homo_sapiens |
Tissue
ontology_id | name | |
---|---|---|
id | ||
UBERON:0000029 | UBERON:0000029 | lymph node |
Wetlab#
ln.view(schema="wetlab")
******************
* module: wetlab *
******************
Biosample
name | created_by | created_at | updated_at | batch | species_id | tissue_id | cell_type_id | disease_id | |
---|---|---|---|---|---|---|---|---|---|
id | |||||||||
9Qg2tEfShXoCpDM22veR | Mouse Lymph Node | bKeW4T6E | 2023-03-30 23:19:33 | None | None | NCBI_10090 | UBERON:0000029 | None | None |
Show code cell content
# integrity checks
with ln.Session() as ss:
mouselymph = ss.select(ln.File, name="Mouse Lymph Node scRNA-seq").one()
mouselymph_hash = mouselymph.hash
assert mouselymph_hash == "Qprqj0O23197Ko-VobaZiw"
mouselymph_features_hash = mouselymph.features[0].id
assert mouselymph_features_hash == "2Mv3JtH-ScBVYHilbLaQ"