How do I create custom validators?#

Here, we will extend the basic validation guide to enforce more constraints with dictionary-like validators.

This is similar to what is enforced by the CZ CELLxGENE data portal, see here.

Setup#

!lamin init --storage test-validator --schema bionty
βœ… saved: User(uid='DzTjkKse', handle='testuser1', name='Test User1', updated_at=2023-12-08 11:33:50 UTC)
βœ… saved: Storage(uid='CjBJy4yv', root='/home/runner/work/lamindb/lamindb/docs/faq/test-validator', type='local', updated_at=2023-12-08 11:33:50 UTC, created_by_id=1)
πŸ’‘ loaded instance: testuser1/test-validator
πŸ’‘ did not register local instance on hub

import lamindb as ln
import lnschema_bionty as lb
from lamin_utils import logger

ln.settings.verbosity = "success"
lb.settings.organism = "human"
πŸ’‘ lamindb instance: testuser1/test-validator

Access data#

Let’s use an AnnData as the dataset to validate:

adata = ln.dev.datasets.anndata_human_immune_cells(populate_registries=True)
adata
AnnData object with n_obs Γ— n_vars = 1648 Γ— 36503
    obs: 'donor', 'tissue', 'cell_type', 'assay'
    var: 'feature_is_filtered', 'feature_reference', 'feature_biotype'
    uns: 'default_embedding'
    obsm: 'X_umap'

Define validation criteria#

Define validation criteria for an AnnData:

validators = {
    "var": {"index": lb.Gene.ensembl_gene_id},
    "obs": {
        "donor": ln.ULabel.name,
        "tissue": lb.Tissue.name,
        "cell_type": lb.CellType.name,
        "assay": lb.ExperimentalFactor.name,
    },
}

Run validation#

Run bulk validation:

features = ln.Feature.lookup()

for slot, slot_validators in validators.items():
    for name, validator in slot_validators.items():
        # access registry (a Django model)
        registry = validator.field.model

        # validate index
        if name == "index":
            logger.print(f"validating {slot}.{name}:")
            index = getattr(adata, slot).index
            validated = registry.validate(index, validator)
            if validated.sum() == len(index):
                logger.success("index matches")

        # validate columns
        else:
            logger.print(f"\nvalidating {slot}.{name}:")
            # check if the column name exist
            if name not in getattr(adata, slot).columns:
                logger.warning(f"{slot}.{name} field is missing")
            else:
                # check if a feature is registered for the column
                if not hasattr(features, name):
                    logger.warning(f"feature '{name}' is not registered")
                # validate categorical labels in a column
                else:
                    labels = getattr(adata, slot)[name]
                    validated = registry.validate(labels, validator)
                    if validated.sum() == len(labels):
                        logger.success("labels match")
validating var.index:
βœ… 36390 terms (99.70%) are validated for ensembl_gene_id
❗ 113 terms (0.30%) are not validated for ensembl_gene_id: ENSG00000269933, ENSG00000261737, ENSG00000259834, ENSG00000256374, ENSG00000263464, ENSG00000203812, ENSG00000272196, ENSG00000272880, ENSG00000270188, ENSG00000287116, ENSG00000237133, ENSG00000224739, ENSG00000227902, ENSG00000239467, ENSG00000272551, ENSG00000280374, ENSG00000236886, ENSG00000229352, ENSG00000286601, ENSG00000227021, ...
validating obs.donor:
βœ… 12 terms (100.00%) are validated for name
βœ… labels match
validating obs.tissue:
βœ… 17 terms (100.00%) are validated for name
βœ… labels match
validating obs.cell_type:
βœ… 32 terms (100.00%) are validated for name
βœ… labels match
validating obs.assay:
βœ… 3 terms (100.00%) are validated for name
βœ… labels match

Delete test instance:

!lamin delete --force test-validator
πŸ’‘ deleting instance testuser1/test-validator
βœ…     deleted instance settings file: /home/runner/.lamin/instance--testuser1--test-validator.env
βœ…     instance cache deleted
βœ…     deleted '.lndb' sqlite file
❗     consider manually deleting your stored data: /home/runner/work/lamindb/lamindb/docs/faq/test-validator