Track sample-level metadata#

We already saw how to link data objects to entities representing features during ingestion.

For sample-level metadata, the underlying schema is often more complicated, and hence it’s best done in a separate step.

Here, we walk through this process.

import lamindb as ln
import lamindb.schema as lns
import lnschema_bionty as bt

ln.track()
ℹ️ Instance: testuser1/mydata
ℹ️ User: testuser2
ℹ️ Added notebook: Transform(id='zMCvXplQ8kTk', v='0', name='14-link-samples', type=notebook, title='Track sample-level metadata', created_by='bKeW4T6E', created_at=datetime.datetime(2023, 3, 30, 23, 18, 5))
ℹ️ Added run: Run(id='c42EAlE765NP4ibZbhEa', transform_id='zMCvXplQ8kTk', transform_v='0', created_by='bKeW4T6E', created_at=datetime.datetime(2023, 3, 30, 23, 18, 5))

Samples, i.e., metadata associated with observations, are linked with the same approach post-ingestion.

We’ll need to lazily relationships of objects, and hence, we need to keep track of a session.

ss = ln.Session()

Let’s first query an scRNA-seq dataset stored as an .h5ad file.

file = ss.select(ln.File, suffix=".h5ad").first()
file
[session open] File(id='iZ26MT0p56QN0flq69Pv', name='Mouse Lymph Node scRNA-seq', suffix='.h5ad', size=17341245, hash='Qprqj0O23197Ko-VobaZiw', source_id='H8XISAvITiHbM2Z0nxtn', storage_id='8Pj12JLb', created_at=datetime.datetime(2023, 3, 30, 23, 17, 45))

For instance, let’s annotate a scRNA-seq dataset with its readout type (scRNA-seq), the tissue, and the species.

Readout#

scrnaseq = bt.Readout.lookup.single_cell_RNA_sequencing

scrnaseq
readout(index=7409, ontology_id='EFO:0008913', name='single-cell RNA sequencing')
readout = bt.Readout(name=scrnaseq.name)

readout
ℹ️ Downloading readout reference for the first time might take a while...


Readout(id='EFO:0008913', name='single-cell RNA sequencing', molecule='RNA assay', instrument='single cell sequencing', created_by='bKeW4T6E')

Link the readout against the data object.

file.readouts.append(readout)

Biosample#

biosample = lns.wetlab.Biosample(name="Mouse Lymph Node")

Species#

We already have mouse in the database, hence let’s just query it. No need to create a new record.

species = ln.select(bt.Species, name="mouse").one()

species
Species(id='NCBI_10090', name='mouse', taxon_id=10090, scientific_name='mus_musculus')
biosample.species = species

Tissue#

tissue_lookup = bt.Tissue.lookup
ℹ️ Downloading tissue reference for the first time might take a while...


tissue_lookup.lymph_node
tissue(ontology_id='UBERON:0000029', name='lymph node')
tissue = bt.Tissue(name=tissue_lookup.lymph_node.name)
tissue
Tissue(id='UBERON:0000029', ontology_id='UBERON:0000029', name='lymph node')
biosample.tissue = tissue

Add to the DB#

We can add everything to the DB in one transaction:

ss.add([readout, biosample])
[[session open] Readout(id='EFO:0008913', name='single-cell RNA sequencing', molecule='RNA assay', instrument='single cell sequencing', created_by='bKeW4T6E', created_at=datetime.datetime(2023, 3, 30, 23, 19, 33)),
 [session open] Biosample(id='9Qg2tEfShXoCpDM22veR', name='Mouse Lymph Node', created_by='bKeW4T6E', created_at=datetime.datetime(2023, 3, 30, 23, 19, 33), species_id='NCBI_10090', tissue_id='UBERON:0000029')]

Let us close the session.

ss.close()

Tip

Manage Session closing with a context manager instead of manually closing it!

With it the above would look like:

with ln.Session() as ss:
    # manipulate data

Query for linked metadata#

ln.select(ln.File).where(
    ln.File.readouts,
    bt.Readout.name == scrnaseq.name,
).df()
name suffix size hash source_id storage_id created_at updated_at
id
iZ26MT0p56QN0flq69Pv Mouse Lymph Node scRNA-seq .h5ad 17341245 Qprqj0O23197Ko-VobaZiw H8XISAvITiHbM2Z0nxtn 8Pj12JLb 2023-03-30 23:17:45 None
ln.select(ln.File).join(ln.File.biosamples).where(
    lns.wetlab.Biosample.species, bt.Species.name == "mouse"
).df()
name suffix size hash source_id storage_id created_at updated_at
id
iZ26MT0p56QN0flq69Pv Mouse Lymph Node scRNA-seq .h5ad 17341245 Qprqj0O23197Ko-VobaZiw H8XISAvITiHbM2Z0nxtn 8Pj12JLb 2023-03-30 23:17:45 None

What’s in the database?#

Biological entities#

ln.view(schema="bionty")
******************
* module: bionty *
******************
CellMarker
name ncbi_gene_id gene_symbol gene_name uniprotkb_id species_id
id
CM_CD127 CD127 3575 IL7R interleukin 7 receptor P16871 NCBI_9606
CM_CD8 CD8 925 CD8A CD8a molecule P01732 NCBI_9606
CM_CD3 CD3 None None None None NCBI_9606
CM_CD45RO CD45RO None None None None NCBI_9606
CM_CD57 CD57 27087 B3GAT1 beta-1,3-glucuronyltransferase 1 Q9P2W7 NCBI_9606
CM_CCR5 CCR5 1234 CCR5 C-C motif chemokine receptor 5 P51681 NCBI_9606
CM_KI67 KI67 None None None None NCBI_9606
CM_SSC-A SSC-A None None None None NCBI_9606
CcvhLu1g FSC-H None None None None NCBI_9606
GmKcBYj3 FSC-A None None None None NCBI_9606
Gene
ensembl_gene_id symbol gene_type description ncbi_gene_id hgnc_id mgi_id omim_id synonyms species_id version
id
bV57xq ENSMUSG00000098950 Gm28036 protein_coding predicted gene, 28036 [Source:MGI Symbol;Acc:M... NaN None MGI:5547772 None None NCBI_10090 None
Quh9bR ENSMUSG00000063844 Olfr1276 protein_coding olfactory receptor 1276 [Source:MGI Symbol;Acc... 258390.0 None MGI:3031110 None GA_x6K02T2Q125-72308574-72309512|MOR245-10 NCBI_10090 None
C3bx0M ENSMUSG00000068647 Olfr1278 protein_coding olfactory receptor 1278 [Source:MGI Symbol;Acc... 258389.0 None MGI:3031112 None GA_x6K02T2Q125-72343713-72344654|MOR245-11 NCBI_10090 None
4VzQC3 ENSMUSG00000027157 Potefam1 protein_coding POTE ankyrin domain family member 1 [Source:MG... 67575.0 None MGI:1914825 None 4930430A15Rik|A26c3|Pote1|Potea NCBI_10090 None
xwg7Tv ENSMUSG00000025781 Atp5c1 protein_coding ATP synthase, H+ transporting, mitochondrial F... 11949.0 None MGI:1261437 None 1700094F02Rik|F1 gamma NCBI_10090 None
04xxXD ENSMUSG00000081719 Gm13726 processed_pseudogene predicted gene 13726 [Source:MGI Symbol;Acc:MG... NaN None MGI:3649259 None None NCBI_10090 None
EwfwXl ENSMUSG00000090059 Olfr1062 protein_coding olfactory receptor 1062 [Source:MGI Symbol;Acc... 259082.0 None MGI:3030896 None GA_x6K02T2Q125-47892992-47892045|MOR185-1 NCBI_10090 None
7uT8Ig ENSMUSG00000111306 Olfr1065 protein_coding olfactory receptor 1065 [Source:MGI Symbol;Acc... 258403.0 None MGI:3030899 None GA_x6K02T2Q125-47915274-47914333|MOR190-1 NCBI_10090 None
3sPofk ENSMUSG00000025314 Ptprj protein_coding protein tyrosine phosphatase, receptor type, J... 19271.0 None MGI:104574 None Byp|CD148|DEP-1|RPTPJ|Scc-1|Scc1 NCBI_10090 None
OUwEUX ENSMUSG00000085028 Slc2a4rg-ps transcribed_unitary_pseudogene Slc2a4 regulator, pseudogene [Source:MGI Symbo... 329584.0 None MGI:3651388 None None NCBI_10090 None
Readout
efo_id name molecule instrument measurement created_by created_at updated_at
id
EFO:0008913 None single-cell RNA sequencing RNA assay single cell sequencing None bKeW4T6E 2023-03-30 23:19:33 None
Species
name taxon_id scientific_name
id
NCBI_10090 mouse 10090 mus_musculus
NCBI_9606 human 9606 homo_sapiens
Tissue
ontology_id name
id
UBERON:0000029 UBERON:0000029 lymph node

Wetlab#

ln.view(schema="wetlab")
******************
* module: wetlab *
******************
Biosample
name created_by created_at updated_at batch species_id tissue_id cell_type_id disease_id
id
9Qg2tEfShXoCpDM22veR Mouse Lymph Node bKeW4T6E 2023-03-30 23:19:33 None None NCBI_10090 UBERON:0000029 None None
Hide code cell content
# integrity checks
with ln.Session() as ss:
    mouselymph = ss.select(ln.File, name="Mouse Lymph Node scRNA-seq").one()

    mouselymph_hash = mouselymph.hash
    assert mouselymph_hash == "Qprqj0O23197Ko-VobaZiw"

    mouselymph_features_hash = mouselymph.features[0].id
    assert mouselymph_features_hash == "2Mv3JtH-ScBVYHilbLaQ"