Annotate data#
While data is the primary information or raw facts that are collected and stored, metadata is the supporting information that provides context and meaning to that data.
LaminDB let’s you annotate data with metadata in two ways: features and labels. (Also see tutorial)
This guide extends Quickstart to explain the details of annotating data.
Setup#
Let us create an instance that has lnschema_bionty
mounted:
!lamin init --storage ./test-annotate --schema bionty
Show code cell output
✅ saved: User(uid='DzTjkKse', handle='testuser1', name='Test User1', updated_at=2023-12-08 11:34:14 UTC)
✅ saved: Storage(uid='LvwXwaYK', root='/home/runner/work/lamindb/lamindb/docs/test-annotate', type='local', updated_at=2023-12-08 11:34:14 UTC, created_by_id=1)
💡 loaded instance: testuser1/test-annotate
💡 did not register local instance on hub
import lamindb as ln
import lnschema_bionty as lb
import pandas as pd
import anndata as ad
💡 lamindb instance: testuser1/test-annotate
lb.settings.organism = "human" # globally set organism
lb.settings.auto_save_parents = False # ignores ontological hierarchy
ln.settings.verbosity = "info"
Register a dataset#
Let’s use the same example data as in the Quickstart:
df = pd.DataFrame(
{"CD8A": [1, 2, 3], "CD4": [3, 4, 5], "CD14": [5, 6, 7]},
index=["sample1", "sample2", "sample3"],
)
In addition to the data, we also have two types of metadata as follows:
# observational metadata (1:1 correspondence with samples)
obs_meta = pd.DataFrame(
{
"cell_type": ["T cell", "T cell", "Monocyte"],
"tissue": ["capillary blood", "arterial blood", "capillary blood"],
},
index=["sample1", "sample2", "sample3"],
)
# external metadata (describes the entire dataset)
external_meta = {
"organism": "human",
"assay": "scRNA-seq",
"experiment": "EXP0001",
"project": "PRJ0001",
}
To store both data and observational metadata, we use an AnnData
object:
# note that we didn't add external metadata to adata.uns, because we will use LaminDB to store it
adata = ad.AnnData(df, obs=obs_meta)
adata
AnnData object with n_obs × n_vars = 3 × 3
obs: 'cell_type', 'tissue'
Now let’s register the AnnData object without annotating with any metadata:
ln.track()
dataset = ln.Dataset(adata, name="my RNA-seq")
dataset.save()
Show code cell output
💡 notebook imports: anndata==0.10.3 lamindb==0.63.4 lnschema_bionty==0.35.3 pandas==1.5.3
💡 saved: Transform(uid='sU0y1kF3igepz8', name='Annotate data', short_name='annotate', version='0', type=notebook, updated_at=2023-12-08 11:34:17 UTC, created_by_id=1)
💡 saved: Run(uid='EDK6zUUGvSfryV9Ty383', run_at=2023-12-08 11:34:17 UTC, transform_id=1, created_by_id=1)
... storing 'cell_type' as categorical
... storing 'tissue' as categorical
✅ storing file '4V0LW9MtlWQLkGjgy19A' at '/home/runner/work/lamindb/lamindb/docs/test-annotate/.lamindb/4V0LW9MtlWQLkGjgy19A.h5ad'
We don’t see any metadata in the registered dataset yet:
dataset.describe()
Dataset(uid='4V0LW9MtlWQLkGjgy19A', name='my RNA-seq', hash='jBNzT3fmTNEcfJ19FK2euw', visibility=1, updated_at=2023-12-08 11:34:17 UTC)
Provenance:
💫 transform: Transform(uid='sU0y1kF3igepz8', name='Annotate data', short_name='annotate', version='0', type=notebook, updated_at=2023-12-08 11:34:17 UTC, created_by_id=1)
👣 run: Run(uid='EDK6zUUGvSfryV9Ty383', run_at=2023-12-08 11:34:17 UTC, transform_id=1, created_by_id=1)
📄 file: File(uid='4V0LW9MtlWQLkGjgy19A', suffix='.h5ad', accessor='AnnData', description='See dataset 4V0LW9MtlWQLkGjgy19A', size=21224, hash='jBNzT3fmTNEcfJ19FK2euw', hash_type='md5', visibility=1, key_is_virtual=True, updated_at=2023-12-08 11:34:17 UTC, storage_id=1, transform_id=1, run_id=1, created_by_id=1)
👤 created_by: User(uid='DzTjkKse', handle='testuser1', name='Test User1', updated_at=2023-12-08 11:34:14 UTC)
Define features and labels#
Features and labels are records from their respective registries.
You can define them schema-less using Feature
and ULabel
registries, or schema-full using dedicated registries.
Define data features#
Data features refer to individual measurable properties or characteristics of a phenomenon being observed. In data analysis and machine learning, features are the input variables used to predict or classify an outcome.
Data features are often numeric, but can also be categorical. For example, in the case of gene expression data, the features are the expression levels of individual genes. They are often stored as columns in a data table (adata.var_names for AnnData
objects).
Here we define them using the Gene
registry:
data_features = lb.Gene.from_values(adata.var_names)
ln.save(data_features)
data_features
✅ created 3 Gene records from Bionty matching symbol: 'CD8A', 'CD4', 'CD14'
[Gene(uid='P4lun3ltWYs0', symbol='CD8A', ensembl_gene_id='ENSG00000153563', ncbi_gene_ids='925', biotype='protein_coding', description='CD8 subunit alpha [Source:HGNC Symbol;Acc:HGNC:1706]', synonyms='CD8|P32|CD8ALPHA', updated_at=2023-12-08 11:34:21 UTC, organism_id=1, bionty_source_id=9, created_by_id=1),
Gene(uid='GL7Rh969kHgO', symbol='CD4', ensembl_gene_id='ENSG00000010610', ncbi_gene_ids='920', biotype='protein_coding', description='CD4 molecule [Source:HGNC Symbol;Acc:HGNC:1678]', synonyms='', updated_at=2023-12-08 11:34:21 UTC, organism_id=1, bionty_source_id=9, created_by_id=1),
Gene(uid='apu1qTgmYDyi', symbol='CD14', ensembl_gene_id='ENSG00000170458', ncbi_gene_ids='929', biotype='protein_coding', description='CD14 molecule [Source:HGNC Symbol;Acc:HGNC:1628]', synonyms='', updated_at=2023-12-08 11:34:21 UTC, organism_id=1, bionty_source_id=9, created_by_id=1)]
Define metadata features#
Metadata features refer to descriptive or contextual information about the data. They don’t directly describe the content of the data but rather its characteristics.
In this example, the metadata features are “cell_type”, “tissue” that describe observations (stored in adata.obs.columns
) and “organism”, “assay”, “experiment” that describe the entire dataset.
Here we define them using the Feature
registry:
# obs metadata features
obs_meta_features = ln.Feature.from_df(adata.obs)
ln.save(obs_meta_features)
obs_meta_features
[Feature(uid='7Dp01ydm9P2x', name='cell_type', type='category', updated_at=2023-12-08 11:34:21 UTC, created_by_id=1),
Feature(uid='ZJCkvCwZgdUd', name='tissue', type='category', updated_at=2023-12-08 11:34:21 UTC, created_by_id=1)]
# external metadata features
external_meta_features = [
ln.Feature(name=name, type="category") for name in external_meta.keys()
]
ln.save(external_meta_features)
external_meta_features
[Feature(uid='s3i27SReq7jn', name='organism', type='category', updated_at=2023-12-08 11:34:21 UTC, created_by_id=1),
Feature(uid='Ku8Q4liXttUa', name='assay', type='category', updated_at=2023-12-08 11:34:21 UTC, created_by_id=1),
Feature(uid='G6zprsmhgaOZ', name='experiment', type='category', updated_at=2023-12-08 11:34:21 UTC, created_by_id=1),
Feature(uid='Qlld6tbwQE5u', name='project', type='category', updated_at=2023-12-08 11:34:21 UTC, created_by_id=1)]
Define metadata labels#
Metadata labels are the categorical values of metadata features. They are more specific than features and are often used in classification.
In this example, the metadata labels of feature “cell_type” are “T cell” and “Monocyte”; the metadata labels of feature “tissue” are “capillary blood”, “arterial blood”; the metadata labels of feature “organism” is “human”; and so on.
Let’s define them with their respective registries:
cell_types = lb.CellType.from_values(adata.obs["cell_type"])
ln.save(cell_types)
cell_types
✅ created 1 CellType record from Bionty matching name: 'T cell'
✅ created 1 CellType record from Bionty matching synonyms: 'Monocyte'
[CellType(uid='BxNjby0x', name='T cell', ontology_id='CL:0000084', synonyms='T-cell|T-lymphocyte|T lymphocyte', description='A Type Of Lymphocyte Whose Defining Characteristic Is The Expression Of A T Cell Receptor Complex.', updated_at=2023-12-08 11:34:22 UTC, bionty_source_id=21, created_by_id=1),
CellType(uid='YzV7Qgmj', name='monocyte', ontology_id='CL:0000576', description='Myeloid Mononuclear Recirculating Leukocyte That Can Act As A Precursor Of Tissue Macrophages, Osteoclasts And Some Populations Of Tissue Dendritic Cells.', updated_at=2023-12-08 11:34:22 UTC, bionty_source_id=21, created_by_id=1)]
tissues = lb.Tissue.from_values(adata.obs["tissue"])
ln.save(tissues)
tissues
✅ created 2 Tissue records from Bionty matching name: 'capillary blood', 'arterial blood'
[Tissue(uid='40Y1BZYd', name='capillary blood', ontology_id='UBERON:0013757', synonyms='portion of capillary blood|blood in capillary|portion of blood in capillary', description='A Blood That Is Part Of A Capillary.', updated_at=2023-12-08 11:34:24 UTC, bionty_source_id=25, created_by_id=1),
Tissue(uid='4qSHAbwT', name='arterial blood', ontology_id='UBERON:0013755', synonyms='portion of arterial blood|arterial blood|blood in artery', description='A Blood That Is Part Of A Artery.', updated_at=2023-12-08 11:34:24 UTC, bionty_source_id=25, created_by_id=1)]
organism = lb.Organism.from_bionty(name=external_meta["organism"])
organism.save()
organism
Organism(uid='EeBGvIYd', name='human', ontology_id='NCBITaxon:9606', scientific_name='homo_sapiens', updated_at=2023-12-08 11:34:24 UTC, bionty_source_id=1, created_by_id=1)
assay = lb.ExperimentalFactor.from_bionty(name=external_meta["assay"])
assay.save()
assay
✅ created 1 ExperimentalFactor record from Bionty matching synonyms: 'scRNA-seq'
ExperimentalFactor(uid='068T1Df6', name='single-cell RNA sequencing', ontology_id='EFO:0008913', synonyms='scRNA-seq|single cell RNA sequencing|single-cell transcriptome sequencing|single-cell RNA-seq', description='A Protocol That Provides The Expression Profiles Of Single Cells Via The Isolation And Barcoding Of Single Cells And Their Rna, Reverse Transcription, Amplification, Library Generation And Sequencing.', molecule='RNA assay', instrument='single cell sequencing', updated_at=2023-12-08 11:34:26 UTC, bionty_source_id=35, created_by_id=1)
experiment = ln.ULabel(name=external_meta["experiment"], description="An experiment")
experiment.save()
experiment
ULabel(uid='zO3AzrKC', name='EXP0001', description='An experiment', updated_at=2023-12-08 11:34:26 UTC, created_by_id=1)
project = ln.ULabel(name=external_meta["project"], description="A project")
project.save()
project
ULabel(uid='rz6OXKUv', name='PRJ0001', description='A project', updated_at=2023-12-08 11:34:26 UTC, created_by_id=1)
Annotate with features#
Non-external features are annotated when registering datasets using .from_df
or .from_anndata
methods:
(See the below “Annotate with labels stratified by metadata features” session for adding external features.)
dataset = ln.Dataset.from_anndata(
adata,
name="my RNA-seq",
field=lb.Gene.symbol, # the registry field to use for the data features
)
dataset.save()
Show code cell output
💡 parsing feature names of X stored in slot 'var'
✅ 3 terms (100.00%) are validated for symbol
✅ linked: FeatureSet(uid='ldfeFgsaNa4zgKei0y1g', n=3, type='number', registry='bionty.Gene', hash='7wsvCyRhtmkNeD2dpTHg', created_by_id=1)
💡 parsing feature names of slot 'obs'
✅ 2 terms (100.00%) are validated for name
✅ linked: FeatureSet(uid='eq9NxlLcTZSdjqTOV42S', n=2, registry='core.Feature', hash='Gneowc4mRTgy6p-nfyfb', created_by_id=1)
❗ returning existing file with same hash: File(uid='4V0LW9MtlWQLkGjgy19A', suffix='.h5ad', accessor='AnnData', description='See dataset 4V0LW9MtlWQLkGjgy19A', size=21224, hash='jBNzT3fmTNEcfJ19FK2euw', hash_type='md5', visibility=1, key_is_virtual=True, updated_at=2023-12-08 11:34:17 UTC, storage_id=1, transform_id=1, run_id=1, created_by_id=1)
❗ returning existing dataset with same hash: Dataset(uid='4V0LW9MtlWQLkGjgy19A', name='my RNA-seq', hash='jBNzT3fmTNEcfJ19FK2euw', visibility=1, updated_at=2023-12-08 11:34:17 UTC, transform_id=1, run_id=1, file_id=1, created_by_id=1)
✅ saved 2 feature sets for slots: 'var','obs'
This dataset is now annotated with features:
dataset.describe()
Dataset(uid='4V0LW9MtlWQLkGjgy19A', name='my RNA-seq', hash='jBNzT3fmTNEcfJ19FK2euw', visibility=1, updated_at=2023-12-08 11:34:26 UTC)
Provenance:
📔 transform: Transform(uid='sU0y1kF3igepz8', name='Annotate data', short_name='annotate', version='0', type='notebook', updated_at=2023-12-08 11:34:17 UTC, created_by_id=1)
👣 run: Run(uid='EDK6zUUGvSfryV9Ty383', run_at=2023-12-08 11:34:17 UTC, transform_id=1, created_by_id=1)
📄 file: File(uid='4V0LW9MtlWQLkGjgy19A', suffix='.h5ad', accessor='AnnData', description='See dataset 4V0LW9MtlWQLkGjgy19A', size=21224, hash='jBNzT3fmTNEcfJ19FK2euw', hash_type='md5', visibility=1, key_is_virtual=True, updated_at=2023-12-08 11:34:26 UTC, storage_id=1, transform_id=1, run_id=1, created_by_id=1)
👤 created_by: User(uid='DzTjkKse', handle='testuser1', name='Test User1', updated_at=2023-12-08 11:34:14 UTC)
Features:
var: FeatureSet(uid='ldfeFgsaNa4zgKei0y1g', n=3, type='number', registry='bionty.Gene', hash='7wsvCyRhtmkNeD2dpTHg', updated_at=2023-12-08 11:34:26 UTC, created_by_id=1)
'CD8A', 'CD4', 'CD14'
obs: FeatureSet(uid='eq9NxlLcTZSdjqTOV42S', n=2, registry='core.Feature', hash='Gneowc4mRTgy6p-nfyfb', updated_at=2023-12-08 11:34:26 UTC, created_by_id=1)
cell_type (category)
tissue (category)
You see two types of features are annotated and organized as featuresets by slot:
“var”: data features
“obs”: observational metadata features
dataset.features
Features:
var: FeatureSet(uid='ldfeFgsaNa4zgKei0y1g', n=3, type='number', registry='bionty.Gene', hash='7wsvCyRhtmkNeD2dpTHg', updated_at=2023-12-08 11:34:26 UTC, created_by_id=1)
'CD8A', 'CD4', 'CD14'
obs: FeatureSet(uid='eq9NxlLcTZSdjqTOV42S', n=2, registry='core.Feature', hash='Gneowc4mRTgy6p-nfyfb', updated_at=2023-12-08 11:34:26 UTC, created_by_id=1)
cell_type (category)
tissue (category)
Use slots to retrieve corresponding annotated features:
dataset.features["var"].df()
uid | symbol | stable_id | ensembl_gene_id | ncbi_gene_ids | biotype | description | synonyms | organism_id | bionty_source_id | updated_at | created_by_id | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
id | ||||||||||||
1 | P4lun3ltWYs0 | CD8A | None | ENSG00000153563 | 925 | protein_coding | CD8 subunit alpha [Source:HGNC Symbol;Acc:HGNC... | CD8|P32|CD8ALPHA | 1 | 9 | 2023-12-08 11:34:21.044601+00:00 | 1 |
2 | GL7Rh969kHgO | CD4 | None | ENSG00000010610 | 920 | protein_coding | CD4 molecule [Source:HGNC Symbol;Acc:HGNC:1678] | 1 | 9 | 2023-12-08 11:34:21.044676+00:00 | 1 | |
3 | apu1qTgmYDyi | CD14 | None | ENSG00000170458 | 929 | protein_coding | CD14 molecule [Source:HGNC Symbol;Acc:HGNC:1628] | 1 | 9 | 2023-12-08 11:34:21.044744+00:00 | 1 |
dataset.features["obs"].df()
uid | name | type | unit | description | registries | synonyms | updated_at | created_by_id | |
---|---|---|---|---|---|---|---|---|---|
id | |||||||||
1 | 7Dp01ydm9P2x | cell_type | category | None | None | None | None | 2023-12-08 11:34:21.059371+00:00 | 1 |
2 | ZJCkvCwZgdUd | tissue | category | None | None | None | None | 2023-12-08 11:34:21.059432+00:00 | 1 |
Annotate with labels#
If you simply want to tag a dataset with some descriptive labels, you can pass them to .labels.add
. For example, let’s add the experiment label “EXP0001” and project label “PRJ0001” to the dataset:
dataset.labels.add(experiment)
dataset.labels.add(project)
Now you see the dataset is annotated with ‘EXP0001’, ‘PRJ0001’ labels:
dataset.describe()
Dataset(uid='4V0LW9MtlWQLkGjgy19A', name='my RNA-seq', hash='jBNzT3fmTNEcfJ19FK2euw', visibility=1, updated_at=2023-12-08 11:34:26 UTC)
Provenance:
📔 transform: Transform(uid='sU0y1kF3igepz8', name='Annotate data', short_name='annotate', version='0', type='notebook', updated_at=2023-12-08 11:34:17 UTC, created_by_id=1)
👣 run: Run(uid='EDK6zUUGvSfryV9Ty383', run_at=2023-12-08 11:34:17 UTC, transform_id=1, created_by_id=1)
📄 file: File(uid='4V0LW9MtlWQLkGjgy19A', suffix='.h5ad', accessor='AnnData', description='See dataset 4V0LW9MtlWQLkGjgy19A', size=21224, hash='jBNzT3fmTNEcfJ19FK2euw', hash_type='md5', visibility=1, key_is_virtual=True, updated_at=2023-12-08 11:34:26 UTC, storage_id=1, transform_id=1, run_id=1, created_by_id=1)
👤 created_by: User(uid='DzTjkKse', handle='testuser1', name='Test User1', updated_at=2023-12-08 11:34:14 UTC)
Features:
var: FeatureSet(uid='ldfeFgsaNa4zgKei0y1g', n=3, type='number', registry='bionty.Gene', hash='7wsvCyRhtmkNeD2dpTHg', updated_at=2023-12-08 11:34:26 UTC, created_by_id=1)
'CD8A', 'CD4', 'CD14'
obs: FeatureSet(uid='eq9NxlLcTZSdjqTOV42S', n=2, registry='core.Feature', hash='Gneowc4mRTgy6p-nfyfb', updated_at=2023-12-08 11:34:26 UTC, created_by_id=1)
cell_type (category)
tissue (category)
Labels:
🏷️ ulabels (2, core.ULabel): 'EXP0001', 'PRJ0001'
To view all annotated labels:
dataset.labels
Labels:
🏷️ ulabels (2, core.ULabel): 'EXP0001', 'PRJ0001'
Since we didn’t specify which features the labels belongs to, they are accessible only through the default accessor “.ulabels” for ULabel
Registry.
You may already notice that it could be difficult to interpret labels without features if they belong to the same registry.
dataset.ulabels.df()
uid | name | description | reference | reference_type | updated_at | created_by_id | |
---|---|---|---|---|---|---|---|
id | |||||||
1 | zO3AzrKC | EXP0001 | An experiment | None | None | 2023-12-08 11:34:26.139284+00:00 | 1 |
2 | rz6OXKUv | PRJ0001 | A project | None | None | 2023-12-08 11:34:26.158474+00:00 | 1 |
Annotate with labels stratified by metadata features#
For labels associated with metadata features, you can pass “feature” to .labels.add
to stratified them by feature. (Another way to stratify labels is through ontological hierarchy, which is covered in the Quickstart)
Let’s add the experiment label “EXP0001” and project label “PRJ0001” to the dataset again, this time specifying their features:
# an auto-complete object of registered features
features = ln.Feature.lookup()
dataset.labels.add(experiment, feature=features.experiment)
dataset.labels.add(project, feature=features.project)
Show code cell output
✅ linked feature 'experiment' to registry 'core.ULabel'
✅ linked new feature 'experiment' together with new feature set FeatureSet(uid='BfNuWtRRVjPkojxV94yt', n=1, registry='core.Feature', hash='AXusq70dUIAHF-PiMRnQ', updated_at=2023-12-08 11:34:26 UTC, created_by_id=1)
✅ linked feature 'project' to registry 'core.ULabel'
💡 no file links to it anymore, deleting feature set FeatureSet(uid='BfNuWtRRVjPkojxV94yt', n=1, registry='core.Feature', hash='AXusq70dUIAHF-PiMRnQ', updated_at=2023-12-08 11:34:26 UTC, created_by_id=1)
✅ linked new feature 'project' together with new feature set FeatureSet(uid='OVg4oE93Z7tg2RRnVEBa', n=2, registry='core.Feature', hash='gJcv0-y8HW_LnugAxVQ5', updated_at=2023-12-08 11:34:26 UTC, created_by_id=1)
You now see a 3rd featureset is added to the dataset at slot “external”, and the labels are stratified by two features:
dataset.describe()
Dataset(uid='4V0LW9MtlWQLkGjgy19A', name='my RNA-seq', hash='jBNzT3fmTNEcfJ19FK2euw', visibility=1, updated_at=2023-12-08 11:34:26 UTC)
Provenance:
📔 transform: Transform(uid='sU0y1kF3igepz8', name='Annotate data', short_name='annotate', version='0', type='notebook', updated_at=2023-12-08 11:34:17 UTC, created_by_id=1)
👣 run: Run(uid='EDK6zUUGvSfryV9Ty383', run_at=2023-12-08 11:34:17 UTC, transform_id=1, created_by_id=1)
📄 file: File(uid='4V0LW9MtlWQLkGjgy19A', suffix='.h5ad', accessor='AnnData', description='See dataset 4V0LW9MtlWQLkGjgy19A', size=21224, hash='jBNzT3fmTNEcfJ19FK2euw', hash_type='md5', visibility=1, key_is_virtual=True, updated_at=2023-12-08 11:34:26 UTC, storage_id=1, transform_id=1, run_id=1, created_by_id=1)
👤 created_by: User(uid='DzTjkKse', handle='testuser1', name='Test User1', updated_at=2023-12-08 11:34:14 UTC)
Features:
var: FeatureSet(uid='ldfeFgsaNa4zgKei0y1g', n=3, type='number', registry='bionty.Gene', hash='7wsvCyRhtmkNeD2dpTHg', updated_at=2023-12-08 11:34:26 UTC, created_by_id=1)
'CD8A', 'CD4', 'CD14'
obs: FeatureSet(uid='eq9NxlLcTZSdjqTOV42S', n=2, registry='core.Feature', hash='Gneowc4mRTgy6p-nfyfb', updated_at=2023-12-08 11:34:26 UTC, created_by_id=1)
cell_type (category)
tissue (category)
external: FeatureSet(uid='OVg4oE93Z7tg2RRnVEBa', n=2, registry='core.Feature', hash='gJcv0-y8HW_LnugAxVQ5', updated_at=2023-12-08 11:34:26 UTC, created_by_id=1)
🔗 experiment (1, core.ULabel): 'EXP0001'
🔗 project (1, core.ULabel): 'PRJ0001'
Labels:
🏷️ ulabels (2, core.ULabel): 'EXP0001', 'PRJ0001'
With feature-stratified labels, you can retrieve labels by feature:
dataset.labels.get(features.experiment).df()
uid | name | description | reference | reference_type | updated_at | created_by_id | |
---|---|---|---|---|---|---|---|
id | |||||||
1 | zO3AzrKC | EXP0001 | An experiment | None | None | 2023-12-08 11:34:26.139284+00:00 | 1 |
Note that adding feature-stratified labels will also allow you to retrieve labels with the default accessor of respective registries:
dataset.labels.add(assay, feature=features.assay)
Show code cell output
✅ linked feature 'assay' to registry 'bionty.ExperimentalFactor'
💡 no file links to it anymore, deleting feature set FeatureSet(uid='OVg4oE93Z7tg2RRnVEBa', n=2, registry='core.Feature', hash='gJcv0-y8HW_LnugAxVQ5', updated_at=2023-12-08 11:34:26 UTC, created_by_id=1)
✅ linked new feature 'assay' together with new feature set FeatureSet(uid='OO0dQDkFVWq9ueEKJS86', n=3, registry='core.Feature', hash='EKufQMjxwSZBuy5128gi', updated_at=2023-12-08 11:34:26 UTC, created_by_id=1)
# access labels directly via default accessor "experimental_factors"
dataset.experimental_factors.df()
uid | name | ontology_id | abbr | synonyms | description | molecule | instrument | measurement | bionty_source_id | updated_at | created_by_id | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
id | ||||||||||||
1 | 068T1Df6 | single-cell RNA sequencing | EFO:0008913 | None | scRNA-seq|single cell RNA sequencing|single-ce... | A Protocol That Provides The Expression Profil... | RNA assay | single cell sequencing | None | 35 | 2023-12-08 11:34:26.125552+00:00 | 1 |
# access labels via feature
dataset.labels.get(features.assay).df()
uid | name | ontology_id | abbr | synonyms | description | molecule | instrument | measurement | bionty_source_id | updated_at | created_by_id | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
id | ||||||||||||
1 | 068T1Df6 | single-cell RNA sequencing | EFO:0008913 | None | scRNA-seq|single cell RNA sequencing|single-ce... | A Protocol That Provides The Expression Profil... | RNA assay | single cell sequencing | None | 35 | 2023-12-08 11:34:26.125552+00:00 | 1 |
Let’s finish the rest annotation of labels:
# labels of obs metadata features
dataset.labels.add(cell_types, feature=features.cell_type)
dataset.labels.add(tissues, feature=features.tissue)
# labels of external metadata features
dataset.labels.add(organism, feature=features.organism)
Show code cell output
✅ linked feature 'cell_type' to registry 'bionty.CellType'
✅ linked feature 'tissue' to registry 'bionty.Tissue'
✅ linked feature 'organism' to registry 'bionty.Organism'
💡 no file links to it anymore, deleting feature set FeatureSet(uid='OO0dQDkFVWq9ueEKJS86', n=3, registry='core.Feature', hash='EKufQMjxwSZBuy5128gi', updated_at=2023-12-08 11:34:26 UTC, created_by_id=1)
✅ linked new feature 'organism' together with new feature set FeatureSet(uid='8hIR1rmv5m3nItsMP0Sn', n=4, registry='core.Feature', hash='VRBwfJn0S5m6N-uo_QKR', updated_at=2023-12-08 11:34:26 UTC, created_by_id=1)
Now you’ve annotated your dataset with all features and labels:
dataset.describe()
Dataset(uid='4V0LW9MtlWQLkGjgy19A', name='my RNA-seq', hash='jBNzT3fmTNEcfJ19FK2euw', visibility=1, updated_at=2023-12-08 11:34:26 UTC)
Provenance:
📔 transform: Transform(uid='sU0y1kF3igepz8', name='Annotate data', short_name='annotate', version='0', type='notebook', updated_at=2023-12-08 11:34:17 UTC, created_by_id=1)
👣 run: Run(uid='EDK6zUUGvSfryV9Ty383', run_at=2023-12-08 11:34:17 UTC, transform_id=1, created_by_id=1)
📄 file: File(uid='4V0LW9MtlWQLkGjgy19A', suffix='.h5ad', accessor='AnnData', description='See dataset 4V0LW9MtlWQLkGjgy19A', size=21224, hash='jBNzT3fmTNEcfJ19FK2euw', hash_type='md5', visibility=1, key_is_virtual=True, updated_at=2023-12-08 11:34:26 UTC, storage_id=1, transform_id=1, run_id=1, created_by_id=1)
👤 created_by: User(uid='DzTjkKse', handle='testuser1', name='Test User1', updated_at=2023-12-08 11:34:14 UTC)
Features:
var: FeatureSet(uid='ldfeFgsaNa4zgKei0y1g', n=3, type='number', registry='bionty.Gene', hash='7wsvCyRhtmkNeD2dpTHg', updated_at=2023-12-08 11:34:26 UTC, created_by_id=1)
'CD8A', 'CD4', 'CD14'
obs: FeatureSet(uid='eq9NxlLcTZSdjqTOV42S', n=2, registry='core.Feature', hash='Gneowc4mRTgy6p-nfyfb', updated_at=2023-12-08 11:34:26 UTC, created_by_id=1)
🔗 cell_type (2, bionty.CellType): 'T cell', 'monocyte'
🔗 tissue (2, bionty.Tissue): 'capillary blood', 'arterial blood'
external: FeatureSet(uid='8hIR1rmv5m3nItsMP0Sn', n=4, registry='core.Feature', hash='VRBwfJn0S5m6N-uo_QKR', updated_at=2023-12-08 11:34:26 UTC, created_by_id=1)
🔗 organism (1, bionty.Organism): 'human'
🔗 assay (1, bionty.ExperimentalFactor): 'single-cell RNA sequencing'
🔗 experiment (1, core.ULabel): 'EXP0001'
🔗 project (1, core.ULabel): 'PRJ0001'
Labels:
🏷️ organism (1, bionty.Organism): 'human'
🏷️ tissues (2, bionty.Tissue): 'capillary blood', 'arterial blood'
🏷️ cell_types (2, bionty.CellType): 'T cell', 'monocyte'
🏷️ experimental_factors (1, bionty.ExperimentalFactor): 'single-cell RNA sequencing'
🏷️ ulabels (2, core.ULabel): 'EXP0001', 'PRJ0001'
Show code cell content
# clean up test instance
!lamin delete --force test-registries
!rm -r test-registries
💡 deleting instance testuser1/test-registries
❗ could not delete as instance settings do not exist locally. did you provide a wrong instance name? could you try loading it?
rm: cannot remove 'test-registries': No such file or directory