Validate & standardize for developers#

LaminDB makes it easy to validate categorical variables based on registries (CanValidate).

How do I validate based on a public ontology?

CanValidate methods validate against the registries in your LaminDB instance. In Manage biological registries, you’ll see how to extend standard validation to validation against public references using a ReferenceTable ontology object: public = Registry.public(). By default, from_values() considers a match in a public reference a validated value for any bionty entity.

What to do for non-validated values?

Be aware when you are working with a freshly initialized instance: nothing is validated as no records have yet been registered. Run inspect to get instructions of how to register non-validated values. You may need to standardize your values, fix typos or simply register them.

Setup#

!lamin init --storage ./test-validate --schema bionty
Hide code cell output
πŸ’‘ connected lamindb: testuser1/test-validate
import lamindb as ln
import bionty as bt
import pandas as pd
πŸ’‘ connected lamindb: testuser1/test-validate
ln.settings.verbosity = "info"

Pre-populate registries:

df = pd.DataFrame({"A": 1, "B": 2}, index=["i1"])
ln.Artifact.from_df(df, description="test data").save()
ln.ULabel(name="Project A").save()
ln.ULabel(name="Project B").save()
bt.Disease.from_public(ontology_id="MONDO:0004975").save()
Hide code cell output
❗ no run & transform get linked, consider calling ln.track()
βœ… storing artifact 'VP96fw11heUrjd8L2sE3' at '/home/runner/work/lamindb/lamindb/docs/test-validate/.lamindb/VP96fw11heUrjd8L2sE3.parquet'
βœ… created 1 Disease record from Bionty matching ontology_id: 'MONDO:0004975'
πŸ’‘ also saving parents of Disease(uid='4F2HPJ3w', name='Alzheimer disease', ontology_id='MONDO:0004975', synonyms='Alzheimer dementia|Alzheimer disease|Alzheimer's disease|presenile and senile dementia|Alzheimers disease|Alzheimer's dementia|Alzheimers dementia|AD', description='A Progressive, Neurodegenerative Disease Characterized By Loss Of Function And Death Of Nerve Cells In Several Areas Of The Brain Leading To Loss Of Cognitive Function Such As Memory And Language.', updated_at=2024-04-19 17:41:05 UTC, public_source_id=29, created_by_id=1)
βœ… created 2 Disease records from Bionty matching ontology_id: 'MONDO:0005574', 'MONDO:0001627'
❗ now recursing through parents: this only happens once, but is much slower than bulk saving
πŸ’‘ also saving parents of Disease(uid='6AMrlbw8', name='dementia', ontology_id='MONDO:0001627', synonyms='dementia (disease)|dementia', description='Loss Of Intellectual Abilities Interfering With An Individual'S Social And Occupational Functions. Causes Include Alzheimer'S Disease, Brain Injuries, Brain Tumors, And Vascular Disorders.', updated_at=2024-04-19 17:41:06 UTC, public_source_id=29, created_by_id=1)
βœ… created 1 Disease record from Bionty matching ontology_id: 'MONDO:0002039'
πŸ’‘ also saving parents of Disease(uid='6yfRDD23', name='cognitive disorder', ontology_id='MONDO:0002039', synonyms='cognitive disease|cognitive disorder', description='A Disease Affects Cognitive Processes.', updated_at=2024-04-19 17:41:07 UTC, public_source_id=29, created_by_id=1)
βœ… created 1 Disease record from Bionty matching ontology_id: 'MONDO:0002025'
πŸ’‘ also saving parents of Disease(uid='6HNgrMK9', name='psychiatric disorder', ontology_id='MONDO:0002025', synonyms='Psychiatric disorder|Psychiatric disease', description='A Disorder Characterized By Behavioral And/Or Psychological Abnormalities, Often Accompanied By Physical Symptoms. The Symptoms May Cause Clinically Significant Distress Or Impairment In Social And Occupational Areas Of Functioning. Representative Examples Include Anxiety Disorders, Cognitive Disorders, Mood Disorders And Schizophrenia.', updated_at=2024-04-19 17:41:08 UTC, public_source_id=29, created_by_id=1)
βœ… created 1 Disease record from Bionty matching ontology_id: 'MONDO:0700096'
πŸ’‘ also saving parents of Disease(uid='3Pcu72hb', name='human disease', ontology_id='MONDO:0700096', synonyms='human disease or disorder', updated_at=2024-04-19 17:41:08 UTC, public_source_id=29, created_by_id=1)
βœ… created 1 Disease record from Bionty matching ontology_id: 'MONDO:0000001'
πŸ’‘ also saving parents of Disease(uid='6PeduEmE', name='tauopathy', ontology_id='MONDO:0005574', description='Neurodegenerative Disorders Involving Deposition Of Abnormal Tau Protein Isoforms (Tau Proteins) In Neurons And Glial Cells In The Brain. Pathological Aggregations Of Tau Proteins Are Associated With Mutation Of The Tau Gene On Chromosome 17 In Patients With Alzheimer Disease; Dementia; Parkinsonian Disorders; Progressive Supranuclear Palsy (Supranuclear Palsy, Progressive); And Corticobasal Degeneration.', updated_at=2024-04-19 17:41:06 UTC, public_source_id=29, created_by_id=1)
βœ… created 1 Disease record from Bionty matching ontology_id: 'MONDO:0005559'
πŸ’‘ also saving parents of Disease(uid='6sgNFaDE', name='neurodegenerative disease', ontology_id='MONDO:0005559', synonyms='central nervous system neurodegenerative disorder|brain degeneration|neurodegenerative disease|central nervous system degenerative disorder|degenerative disorder of central nervous system', description='A Disorder Of The Central Nervous System Characterized By Gradual And Progressive Loss Of Neural Tissue And Neurologic Function.', updated_at=2024-04-19 17:41:10 UTC, public_source_id=29, created_by_id=1)
βœ… created 1 Disease record from Bionty matching ontology_id: 'MONDO:0002602'
πŸ’‘ also saving parents of Disease(uid='5dTDVEfc', name='central nervous system disorder', ontology_id='MONDO:0002602', synonyms='central nervous system disease|central nervous system disorder|disease of the central nervous system|CNS disorder|central nervous system disease or disorder|disease of central nervous system|central nervous disease|disease or disorder of central nervous system|disorder of central nervous system', description='A Disease Involving The Central Nervous System.', updated_at=2024-04-19 17:41:11 UTC, public_source_id=29, created_by_id=1)
βœ… created 1 Disease record from Bionty matching ontology_id: 'MONDO:0005071'
πŸ’‘ also saving parents of Disease(uid='3NKHns2m', name='nervous system disorder', ontology_id='MONDO:0005071', synonyms='neurological disease|disorder of nervous system|nervous system disease|neurologic disease|nervous system disorder|nervous system disease or disorder|neurologic disorder|disease or disorder of nervous system|disease of nervous system|neurological disorder', description='A Non-Neoplastic Or Neoplastic Disorder That Affects The Brain, Spinal Cord, Or Peripheral Nerves.', updated_at=2024-04-19 17:41:12 UTC, public_source_id=29, created_by_id=1)

Standard validation#

Name duplication#

Creating a record with the same name field automatically returns the existing record:

ln.ULabel(name="Project A")
❗ loaded ULabel record with same name: 'Project A' (disable via ln.settings.upon_create_search_names)
ULabel(uid='T52J0uiY', name='Project A', updated_at=2024-04-19 17:41:04 UTC, created_by_id=1)

Bulk creating records using from_values() only returns validated records:

Note: Terms validated with public reference are also created with .from_values, see Manage biological registries for details.

projects = ["Project A", "Project B", "Project D", "Project E"]
ln.ULabel.from_values(projects)
βœ… loaded 2 ULabel records matching name: 'Project A', 'Project B'
❗ did not create ULabel records for 2 non-validated names: 'Project D', 'Project E'
[ULabel(uid='T52J0uiY', name='Project A', updated_at=2024-04-19 17:41:04 UTC, created_by_id=1),
 ULabel(uid='2lFHr0fh', name='Project B', updated_at=2024-04-19 17:41:04 UTC, created_by_id=1)]

(Versioned records also account for version in addition to name. Also see: idempotency.)

Data duplication#

Creating an artifact or collection with the same content automatically returns the existing record:

ln.Artifact.from_df(df, description="same data")
❗ no run & transform get linked, consider calling ln.track()
❗ returning existing artifact with same hash: Artifact(uid='VP96fw11heUrjd8L2sE3', suffix='.parquet', accessor='DataFrame', description='test data', size=2722, hash='GF6nXryX_xarbHjhMIa94g', hash_type='md5', visibility=1, key_is_virtual=True, updated_at=2024-04-19 17:41:04 UTC, storage_id=1, created_by_id=1)
Artifact(uid='VP96fw11heUrjd8L2sE3', suffix='.parquet', accessor='DataFrame', description='test data', size=2722, hash='GF6nXryX_xarbHjhMIa94g', hash_type='md5', visibility=1, key_is_virtual=True, updated_at=2024-04-19 17:41:04 UTC, storage_id=1, created_by_id=1)

Schema-based validation#

Type checks, constraint checks, and Django validators can be configured in the schema.

Registry-based validation#

validate() validates passed values against reference values in a registry. It returns a boolean vector indicating whether a value has an exact match in the reference values.

Using dedicated registries#

For instance, bionty types basic biological entities: every entity has its own registry, a Python class. By default, the first string field is used for validation. For Disease, it’s name:

diseases = ["Alzheimer disease", "Alzheimer's disease", "AD"]
validated = bt.Disease.validate(diseases)
validated
βœ… 1 term (33.30%) is validated for name
❗ 2 terms (66.70%) are not validated for name: Alzheimer's disease, AD
array([ True, False, False])

Validate against a non-default field:

bt.Disease.validate(
    ["MONDO:0004975", "MONDO:0004976", "MONDO:0004977"], bt.Disease.ontology_id
)
βœ… 1 term (33.30%) is validated for ontology_id
❗ 2 terms (66.70%) are not validated for ontology_id: MONDO:0004976, MONDO:0004977
array([ True, False, False])

Using the ULabel registry#

Any entity that doesn’t have its dedicated registry (β€œis not typed”) can be validated & registered using ULabel:

ln.ULabel.validate(["Project A", "Project B", "Project C"])
βœ… 2 terms (66.70%) are validated for name
❗ 1 term (33.30%) is not validated for name: Project C
array([ True,  True, False])

Inspect & standardize#

When validation fails, you can call inspect() to figure out what to do.

inspect() applies the same definition of validation as validate(), but returns a rich return value InspectResult. Most importantly, it logs recommended curation steps that would render the data validated.

result = bt.Disease.inspect(diseases)
βœ… 1 term (33.30%) is validated for name
❗ 2 terms (66.70%) are not validated for name: Alzheimer's disease, AD
   detected 2 terms with synonyms: Alzheimer's disease, AD
β†’  standardize terms via .standardize()

In this case, it suggests to call standardize() to standardize synonyms:

bt.Disease.standardize(result.non_validated)
πŸ’‘ standardized 2/2 terms
['Alzheimer disease', 'Alzheimer disease']

For more, see Manage biological registries.

Extend registries#

Sometimes, we simply want to register new records to extend the content of registries:

result = ln.ULabel.inspect(projects)
βœ… 2 terms (50.00%) are validated for name
❗ 2 terms (50.00%) are not validated for name: Project D, Project E
   couldn't validate 2 terms: 'Project D', 'Project E'
β†’  if you are sure, create new records via ln.ULabel() and save to your registry
new_labels = [ln.ULabel(name=name) for name in result.non_validated]
ln.save(new_labels)
new_labels
Hide code cell output
[ULabel(uid='PSfFvulB', name='Project D', updated_at=2024-04-19 17:41:12 UTC, created_by_id=1),
 ULabel(uid='1inDdqEl', name='Project E', updated_at=2024-04-19 17:41:12 UTC, created_by_id=1)]

Validate features#

When calling File.from_... and Collection.from_..., features are automatically validated. Validated features are grouped in β€œfeature sets” indexed by β€œslots”. For a basic example, see Tutorial: Features & labels.

For an overview of data formats used to model different data types, see Data types.

Bulk validation#

Hide code cell content
# clean up test instance
!lamin delete --force test-validate
!rm -r test-validate
πŸ’‘ deleting instance testuser1/test-validate
❗ manually delete your stored data: /home/runner/work/lamindb/lamindb/docs/test-validate