Validate, standardize & annotate#

We’ll walk you through the following flow:

define validation criteria
validate & standardize metadata
save validated & annotated artifacts

!lamin init --storage ./test-annotate --schema bionty

import lamindb as ln
import bionty as bt
import pandas as pd
import anndata as ad

ln.settings.verbosity = "hint"

💡 connected lamindb: testuser1/test-annotate

Let’s start with a DataFrame that we’d like to validate:

df = pd.DataFrame({
    "temperature": [37.2, 36.3, 38.2],
    "cell_type": ["cerebral pyramidal neuron", "astrocyte", "oligodendrocyte"],
    "assay_ontology_id": ["EFO:0008913", "EFO:0008913", "EFO:0008913"],
    "donor": ["D0001", "D0002", "DOOO3"],
})
df

	temperature	cell_type	assay_ontology_id	donor
0	37.2	cerebral pyramidal neuron	EFO:0008913	D0001
1	36.3	astrocyte	EFO:0008913	D0002
2	38.2	oligodendrocyte	EFO:0008913	DOOO3

Validate and standardize metadata#

# define validation criteria for the categoricals
categoricals = {
    "cell_type": bt.CellType.name,
    "assay_ontology_id": bt.ExperimentalFactor.ontology_id,
    "donor": ln.ULabel.name,
}
# create an object to guide validation and annotation
annotate = ln.Annotate.from_df(df, categoricals=categoricals)
# validate
validated = annotate.validate()
validated

✅ added 3 records with Feature.name for columns: ['cell_type', 'assay_ontology_id', 'donor']

❗ 1 non-validated categories are not saved in Feature.name: ['temperature']!
      → to lookup categories, use lookup().columns
      → to save, run add_new_from_columns

💡 mapping cell_type on CellType.name

❗    found 2 terms validated terms: ['astrocyte', 'oligodendrocyte']
      → save terms via .add_validated_from('cell_type')

❗    1 terms is not validated: 'cerebral pyramidal neuron'
      → save terms via .add_new_from('cell_type')

💡 mapping assay_ontology_id on ExperimentalFactor.ontology_id

❗    found 1 terms validated terms: ['EFO:0008913']
      → save terms via .add_validated_from('assay_ontology_id')

✅ assay_ontology_id is validated against ExperimentalFactor.ontology_id

💡 mapping donor on ULabel.name

❗    3 terms are not validated: 'DOOO3', 'D0002', 'D0001'
      → save terms via .add_new_from('donor')

False

Validate using registries in another instance#

Sometimes you want to validate against other existing registries, for instance cellxgene.

This allows us to directly transfer values that are currently missing in our registries from the cellxgene instance.

annotate = ln.Annotate.from_df(
    df, 
    categoricals=categoricals,
    using="laminlabs/cellxgene",  # pass the instance slug
)
annotate.validate()

❗ 1 non-validated categories are not saved in Feature.name: ['temperature']!
      → to lookup categories, use lookup().columns
      → to save, run add_new_from_columns

💡 mapping cell_type on CellType.name

❗    found 2 terms validated terms: ['astrocyte', 'oligodendrocyte']
      → save terms via .add_validated_from('cell_type')

❗    1 terms is not validated: 'cerebral pyramidal neuron'
      → save terms via .add_new_from('cell_type')

💡 mapping assay_ontology_id on ExperimentalFactor.ontology_id

❗    found 1 terms validated terms: ['EFO:0008913']
      → save terms via .add_validated_from('assay_ontology_id')

✅ assay_ontology_id is validated against ExperimentalFactor.ontology_id

💡 mapping donor on ULabel.name

❗    3 terms are not validated: 'DOOO3', 'D0002', 'D0001'
      → save terms via .add_new_from('donor')

False

Register new metadata labels#

Our current database instance is empty. Once you populated its registries, saving new labels will only rarely be needed. You’ll mostly use your lamindb instance to validate any incoming new data and annotate it.

annotate.add_validated_from(df.cell_type.name)

❗ 1 non-validated categories are not saved in CellType.name: ['cerebral pyramidal neuron']!
      → to lookup categories, use lookup().cell_type
      → to save, run .add_new_from('cell_type')

✅ added 2 records from laminlabs/cellxgene with CellType.name for cell_type: ['astrocyte', 'oligodendrocyte']

# use a lookup object to get the correct spelling of categories from public reference
# pass "public" to use the public reference
lookup = annotate.lookup()
lookup

Lookup objects from the laminlabs/cellxgene:
 .cell_type
 .assay_ontology_id
 .donor
 .columns
 

Example:
    → categories = validator.lookup().cell_type
    → categories.alveolar_type_1_fibroblast_cell

cell_types = lookup[df.cell_type.name]

cell_types.cerebral_cortex_pyramidal_neuron

CellType(uid='2sgq6sE7', name='cerebral cortex pyramidal neuron', ontology_id='CL:4023111', description='A Pyramidal Neuron With Soma Located In The Cerebral Cortex.', updated_at=2023-11-28 22:37:06 UTC, public_source_id=48, created_by_id=1)

# fix the typo
df.cell_type = df.cell_type.replace({"cerebral pyramidal neuron": cell_types.cerebral_cortex_pyramidal_neuron.name})

annotate.add_validated_from(df.cell_type.name)

✅ added 1 record from laminlabs/cellxgene with CellType.name for cell_type: ['cerebral cortex pyramidal neuron']

# register non-validated terms
annotate.add_new_from(df.donor.name)

✅ added 3 records with ULabel.name for donor: ['D0001', 'D0002', 'DOOO3']

# validate again
validated = annotate.validate()
validated

✅ cell_type is validated against CellType.name

💡 mapping assay_ontology_id on ExperimentalFactor.ontology_id

❗    found 1 terms validated terms: ['EFO:0008913']
      → save terms via .add_validated_from('assay_ontology_id')

✅ assay_ontology_id is validated against ExperimentalFactor.ontology_id

✅ donor is validated against ULabel.name

True

Validate an AnnData object#

Here we specify which var_fields and obs_fields to validate against.

df.index = ["obs1", "obs2", "obs3"]

X = pd.DataFrame({"TCF7": [1, 2, 3], "PDCD1": [4, 5, 6], "CD3E": [7, 8, 9], "CD4": [10, 11, 12], "CD8A": [13, 14, 15]}, index=["obs1", "obs2", "obs3"])

adata = ad.AnnData(X=X, obs=df)
adata

AnnData object with n_obs × n_vars = 3 × 5
    obs: 'temperature', 'cell_type', 'assay_ontology_id', 'donor'

annotate = ln.Annotate.from_anndata(
    adata, 
    var_index=bt.Gene.symbol,
    categoricals=categoricals, 
    organism="human",
)

❗ 1 non-validated categories are not saved in Feature.name: ['temperature']!
      → to lookup categories, use lookup().columns
      → to save, run add_new_from_columns

✅ added 6 records from public with Gene.symbol for var_index: ['TCF7', 'PDCD1', 'PDCD1', 'CD3E', 'CD4', 'CD8A']

annotate.validate()

✅ var_index is validated against Gene.symbol

✅ cell_type is validated against CellType.name

💡 mapping assay_ontology_id on ExperimentalFactor.ontology_id

❗    found 1 terms validated terms: ['EFO:0008913']
      → save terms via .add_validated_from('assay_ontology_id')

✅ assay_ontology_id is validated against ExperimentalFactor.ontology_id

✅ donor is validated against ULabel.name

True

annotate.add_validated_from("all")

💡 saving labels for 'cell_type'

💡 saving labels for 'assay_ontology_id'

✅ added 1 record from public with ExperimentalFactor.ontology_id for assay_ontology_id: ['EFO:0008913']

💡 saving labels for 'donor'

annotate.validate()

✅ var_index is validated against Gene.symbol

✅ cell_type is validated against CellType.name

✅ assay_ontology_id is validated against ExperimentalFactor.ontology_id

✅ donor is validated against ULabel.name

True

Save an artifact#

The validated object can be subsequently saved as an Artifact:

artifact = annotate.save_artifact(description="test AnnData")

❗ no run & transform get linked, consider calling ln.track()

💡 path content will be copied to default storage upon `save()` with key `None` ('.lamindb/rNu5qCvinTsIoUU1eVo7.h5ad')

✅ storing artifact 'rNu5qCvinTsIoUU1eVo7' at '/home/runner/work/lamindb/lamindb/docs/test-annotate/.lamindb/rNu5qCvinTsIoUU1eVo7.h5ad'

💡 you can auto-track these data as a run input by calling `ln.track()`

💡 parsing feature names of X stored in slot 'var'

✅    5 terms (100.00%) are validated for symbol

✅    linked: FeatureSet(uid='R4BCgvcjgjrYeX5vQwIH', n=6, type='number', registry='bionty.Gene', hash='12Mh3I-mUBuOvj1q6wNn', created_by_id=1)

💡 parsing feature names of slot 'obs'

✅    3 terms (75.00%) are validated for name

❗    1 term (25.00%) is not validated for name: temperature

✅    linked: FeatureSet(uid='O3hQtVmppSIFxQvNcaZS', n=3, registry='core.Feature', hash='WRqU6xp18zVAq9NZFJQ0', created_by_id=1)

✅ saved 2 feature sets for slots: 'var','obs'

✅ linked feature 'cell_type' to registry 'bionty.CellType'

✅ linked feature 'assay_ontology_id' to registry 'bionty.ExperimentalFactor'

✅ linked feature 'donor' to registry 'core.ULabel'

artifact.describe()

Artifact(uid='rNu5qCvinTsIoUU1eVo7', suffix='.h5ad', accessor='AnnData', description='test AnnData', size=20336, hash='wozXf_B6VsK6QXH81skJ8A', hash_type='md5', n_observations=3, visibility=1, key_is_virtual=True, updated_at=2024-05-01 18:49:47 UTC)

Provenance:
  📎 storage: Storage(uid='Pca4eGu9pKMS', root='/home/runner/work/lamindb/lamindb/docs/test-annotate', type='local', instance_uid='3kW5y8h7c8wG')
  📎 created_by: User(uid='DzTjkKse', handle='testuser1', name='Test User1')
Features:
  var: FeatureSet(uid='R4BCgvcjgjrYeX5vQwIH', n=6, type='number', registry='bionty.Gene')
    'TCF7', 'CD4', 'PDCD1', 'CD8A', 'CD3E'
  obs: FeatureSet(uid='O3hQtVmppSIFxQvNcaZS', n=3, registry='core.Feature')
    🔗 cell_type (3, bionty.CellType): 'cerebral cortex pyramidal neuron', 'oligodendrocyte', 'astrocyte'
    🔗 assay_ontology_id (1, bionty.ExperimentalFactor): 'single-cell RNA sequencing'
    🔗 donor (3, core.ULabel): 'DOOO3', 'D0002', 'D0001'
Labels:
  📎 cell_types (3, bionty.CellType): 'cerebral cortex pyramidal neuron', 'oligodendrocyte', 'astrocyte'
  📎 experimental_factors (1, bionty.ExperimentalFactor): 'single-cell RNA sequencing'
  📎 ulabels (3, core.ULabel): 'DOOO3', 'D0002', 'D0001'

Save a collection#

# register a new collection
collection = annotate.save_collection(
    artifact,  # registered artifact above, can also pass a list of artifacts
    name="Experiment X in brain",  # title of the publication
    description="10.1126/science.xxxxx",  # DOI of the publication
    reference="E-MTAB-xxxxx", # accession number (e.g. GSE#, E-MTAB#, etc.)
    reference_type="ArrayExpress" # source type (e.g. GEO, ArrayExpress, SRA, etc.)
)

❗ no run & transform get linked, consider calling ln.track()

✅ loaded: FeatureSet(uid='R4BCgvcjgjrYeX5vQwIH', n=6, type='number', registry='bionty.Gene', hash='12Mh3I-mUBuOvj1q6wNn', updated_at=2024-05-01 18:49:47 UTC, created_by_id=1)

✅ loaded: FeatureSet(uid='O3hQtVmppSIFxQvNcaZS', n=3, registry='core.Feature', hash='WRqU6xp18zVAq9NZFJQ0', updated_at=2024-05-01 18:49:47 UTC, created_by_id=1)

💡 you can auto-track these data as a run input by calling `ln.track()`

collection.artifacts.df()

	uid	storage_id	key	suffix	accessor	description	version	size	hash	hash_type	n_objects	n_observations	transform_id	run_id	visibility	key_is_virtual	created_at	updated_at	created_by_id
id
1	rNu5qCvinTsIoUU1eVo7	1	None	.h5ad	AnnData	test AnnData	None	20336	wozXf_B6VsK6QXH81skJ8A	md5	None	3	None	None	1	True	2024-05-01 18:49:47.488515+00:00	2024-05-01 18:49:47.570795+00:00	1