How do I create custom validators?#
Here, we will extend the basic validation guide to enforce more constraints with dictionary-like validators.
This is similar to what is enforced by the CZ CELLxGENE data portal, see here.
Setup#
!lamin init --storage test-validator --schema bionty
β
saved: User(uid='DzTjkKse', handle='testuser1', name='Test User1', updated_at=2023-12-08 11:33:50 UTC)
β
saved: Storage(uid='CjBJy4yv', root='/home/runner/work/lamindb/lamindb/docs/faq/test-validator', type='local', updated_at=2023-12-08 11:33:50 UTC, created_by_id=1)
π‘ loaded instance: testuser1/test-validator
π‘ did not register local instance on hub
import lamindb as ln
import lnschema_bionty as lb
from lamin_utils import logger
ln.settings.verbosity = "success"
lb.settings.organism = "human"
π‘ lamindb instance: testuser1/test-validator
Access data#
Letβs use an AnnData as the dataset to validate:
adata = ln.dev.datasets.anndata_human_immune_cells(populate_registries=True)
adata
AnnData object with n_obs Γ n_vars = 1648 Γ 36503
obs: 'donor', 'tissue', 'cell_type', 'assay'
var: 'feature_is_filtered', 'feature_reference', 'feature_biotype'
uns: 'default_embedding'
obsm: 'X_umap'
Define validation criteria#
Define validation criteria for an AnnData
:
validators = {
"var": {"index": lb.Gene.ensembl_gene_id},
"obs": {
"donor": ln.ULabel.name,
"tissue": lb.Tissue.name,
"cell_type": lb.CellType.name,
"assay": lb.ExperimentalFactor.name,
},
}
Run validation#
Run bulk validation:
features = ln.Feature.lookup()
for slot, slot_validators in validators.items():
for name, validator in slot_validators.items():
# access registry (a Django model)
registry = validator.field.model
# validate index
if name == "index":
logger.print(f"validating {slot}.{name}:")
index = getattr(adata, slot).index
validated = registry.validate(index, validator)
if validated.sum() == len(index):
logger.success("index matches")
# validate columns
else:
logger.print(f"\nvalidating {slot}.{name}:")
# check if the column name exist
if name not in getattr(adata, slot).columns:
logger.warning(f"{slot}.{name} field is missing")
else:
# check if a feature is registered for the column
if not hasattr(features, name):
logger.warning(f"feature '{name}' is not registered")
# validate categorical labels in a column
else:
labels = getattr(adata, slot)[name]
validated = registry.validate(labels, validator)
if validated.sum() == len(labels):
logger.success("labels match")
validating var.index:
β
36390 terms (99.70%) are validated for ensembl_gene_id
β 113 terms (0.30%) are not validated for ensembl_gene_id: ENSG00000269933, ENSG00000261737, ENSG00000259834, ENSG00000256374, ENSG00000263464, ENSG00000203812, ENSG00000272196, ENSG00000272880, ENSG00000270188, ENSG00000287116, ENSG00000237133, ENSG00000224739, ENSG00000227902, ENSG00000239467, ENSG00000272551, ENSG00000280374, ENSG00000236886, ENSG00000229352, ENSG00000286601, ENSG00000227021, ...
validating obs.donor:
β
12 terms (100.00%) are validated for name
β
labels match
validating obs.tissue:
β
17 terms (100.00%) are validated for name
β
labels match
validating obs.cell_type:
β
32 terms (100.00%) are validated for name
β
labels match
validating obs.assay:
β
3 terms (100.00%) are validated for name
β
labels match
Delete test instance:
!lamin delete --force test-validator
π‘ deleting instance testuser1/test-validator
β
deleted instance settings file: /home/runner/.lamin/instance--testuser1--test-validator.env
β
instance cache deleted
β
deleted '.lndb' sqlite file
β consider manually deleting your stored data: /home/runner/work/lamindb/lamindb/docs/faq/test-validator