Validate data#
To make data more re-usable, LaminDB validates categorical variables when using registries (CanValidate
) in addition to standard data validation. The process of data validation also includes curating non-validated data & amending registries.
What does “validated” mean?
Validated value refer to registered records of your working LaminDB instance. They are categorical values exist in a specified field (default to “name”) of a registry.
For instance, if “Experiment 1” has been registered as the “name” of a ULabel
record, it is a validated value of the ULabel.name
.
CanValidate
methods validate()
, inspect()
, standardize()
, from_values()
primarily take 2 arguments: “values” and “field”. The argument “values” takes an iterable of input categorical values, and the argument “field” takes a typed field of a registry.
What does “Bionty-validated” mean?
CanValidate
methods validate against the in-house reference, aka records in your instance.
For Manage biological registries, you can extend to validate against public references using the bionty()
object: bionty_object = Registry.bionty()
Note that from_values()
is aware of Bionty.
What to do for non-validated values?
Be aware if you are working with a newly initialized instance, nothing is validated as no records have been registered previously.
Run inspect
to get instructions of how to register non-validated values. You may need to standardize your values, fix typos or simply register them.
Setup#
!lamin init --storage ./test-validate --schema bionty
Show code cell output
✅ saved: User(id='DzTjkKse', handle='testuser1', email='testuser1@lamin.ai', name='Test User1', updated_at=2023-09-26 15:21:44)
✅ saved: Storage(id='VxOzNoO1', root='/home/runner/work/lamindb/lamindb/docs/test-validate', type='local', updated_at=2023-09-26 15:21:44, created_by_id='DzTjkKse')
💡 loaded instance: testuser1/test-validate
💡 did not register local instance on hub (if you want, call `lamin register`)
import lamindb as ln
import lnschema_bionty as lb
import pandas as pd
💡 loaded instance: testuser1/test-validate (lamindb 0.54.2)
ln.settings.verbosity = "info"
Pre-populate registries:
df = pd.DataFrame({"A": 1, "B": 2}, index=["i1"])
ln.File(df, description="test data").save()
ln.ULabel(name="Project A").save()
ln.ULabel(name="Project B").save()
lb.Disease.from_bionty(ontology_id="MONDO:0004975").save()
Show code cell output
❗ no run & transform get linked, consider passing a `run` or calling ln.track()
✅ storing file 'I73iSjC6ZpklqmIms30A' at '.lamindb/I73iSjC6ZpklqmIms30A.parquet'
✅ created 1 Disease record from Bionty matching ontology_id: 'MONDO:0004975'
💡 also saving parents of Disease(id='nUmxpVTE', name='Alzheimer disease', ontology_id='MONDO:0004975', synonyms='Alzheimer disease|Alzheimer's dementia|AD|Alzheimer's disease|presenile and senile dementia|Alzheimers disease|Alzheimer dementia|Alzheimers dementia', description='A Progressive, Neurodegenerative Disease Characterized By Loss Of Function And Death Of Nerve Cells In Several Areas Of The Brain Leading To Loss Of Cognitive Function Such As Memory And Language.', updated_at=2023-09-26 15:21:48, bionty_source_id='SEOn', created_by_id='DzTjkKse')
✅ created 2 Disease records from Bionty matching ontology_id: 'MONDO:0005574', 'MONDO:0001627'
❗ now recursing through parents: this only happens once, but is much slower than bulk saving
💡 also saving parents of Disease(id='eFC0pPw5', name='tauopathy', ontology_id='MONDO:0005574', description='Neurodegenerative Disorders Involving Deposition Of Abnormal Tau Protein Isoforms (Tau Proteins) In Neurons And Glial Cells In The Brain. Pathological Aggregations Of Tau Proteins Are Associated With Mutation Of The Tau Gene On Chromosome 17 In Patients With Alzheimer Disease; Dementia; Parkinsonian Disorders; Progressive Supranuclear Palsy (Supranuclear Palsy, Progressive); And Corticobasal Degeneration.', updated_at=2023-09-26 15:21:50, bionty_source_id='SEOn', created_by_id='DzTjkKse')
✅ created 1 Disease record from Bionty matching ontology_id: 'MONDO:0005559'
💡 also saving parents of Disease(id='Wjh61VHa', name='neurodegenerative disease', ontology_id='MONDO:0005559', synonyms='neurodegenerative disease|brain degeneration|central nervous system neurodegenerative disorder|degenerative disorder of central nervous system|central nervous system degenerative disorder', description='A Disorder Of The Central Nervous System Characterized By Gradual And Progressive Loss Of Neural Tissue And Neurologic Function.', updated_at=2023-09-26 15:21:51, bionty_source_id='SEOn', created_by_id='DzTjkKse')
✅ created 1 Disease record from Bionty matching ontology_id: 'MONDO:0002602'
💡 also saving parents of Disease(id='0a0qu0VB', name='central nervous system disorder', ontology_id='MONDO:0002602', synonyms='central nervous disease|central nervous system disease|central nervous system disease or disorder|disorder of central nervous system|disease or disorder of central nervous system|disease of the central nervous system|central nervous system disorder|disease of central nervous system|CNS disorder', description='A Disease Involving The Central Nervous System.', updated_at=2023-09-26 15:21:57, bionty_source_id='SEOn', created_by_id='DzTjkKse')
✅ created 1 Disease record from Bionty matching ontology_id: 'MONDO:0005071'
💡 also saving parents of Disease(id='SqY0Lhs3', name='nervous system disorder', ontology_id='MONDO:0005071', synonyms='nervous system disease or disorder|disease or disorder of nervous system|neurologic disorder|neurological disease|neurologic disease|neurological disorder|nervous system disorder|disease of nervous system|nervous system disease|disorder of nervous system', description='A Non-Neoplastic Or Neoplastic Disorder That Affects The Brain, Spinal Cord, Or Peripheral Nerves.', updated_at=2023-09-26 15:21:58, bionty_source_id='SEOn', created_by_id='DzTjkKse')
✅ created 1 Disease record from Bionty matching ontology_id: 'MONDO:0700096'
💡 also saving parents of Disease(id='M1oV973s', name='human disease', ontology_id='MONDO:0700096', synonyms='human disease or disorder', updated_at=2023-09-26 15:21:59, bionty_source_id='SEOn', created_by_id='DzTjkKse')
✅ created 1 Disease record from Bionty matching ontology_id: 'MONDO:0000001'
💡 also saving parents of Disease(id='V5J5981p', name='dementia', ontology_id='MONDO:0001627', synonyms='dementia|dementia (disease)', description='Loss Of Intellectual Abilities Interfering With An Individual'S Social And Occupational Functions. Causes Include Alzheimer'S Disease, Brain Injuries, Brain Tumors, And Vascular Disorders.', updated_at=2023-09-26 15:21:50, bionty_source_id='SEOn', created_by_id='DzTjkKse')
✅ created 1 Disease record from Bionty matching ontology_id: 'MONDO:0002039'
💡 also saving parents of Disease(id='i3Tt7xD4', name='cognitive disorder', ontology_id='MONDO:0002039', synonyms='cognitive disorder|cognitive disease', description='A Disease Affects Cognitive Processes.', updated_at=2023-09-26 15:22:02, bionty_source_id='SEOn', created_by_id='DzTjkKse')
✅ created 1 Disease record from Bionty matching ontology_id: 'MONDO:0002025'
💡 also saving parents of Disease(id='WSZtc3B1', name='psychiatric disorder', ontology_id='MONDO:0002025', synonyms='Psychiatric disease|Psychiatric disorder', description='A Disorder Characterized By Behavioral And/Or Psychological Abnormalities, Often Accompanied By Physical Symptoms. The Symptoms May Cause Clinically Significant Distress Or Impairment In Social And Occupational Areas Of Functioning. Representative Examples Include Anxiety Disorders, Cognitive Disorders, Mood Disorders And Schizophrenia.', updated_at=2023-09-26 15:22:03, bionty_source_id='SEOn', created_by_id='DzTjkKse')
Standard validation#
Name duplication#
Creating a record with the same name field automatically returns the existing record:
ln.ULabel(name="Project A")
✅ loaded ULabel record with exact same name: 'Project A'
ULabel(id='xNtmMuXu', name='Project A', updated_at=2023-09-26 15:21:47, created_by_id='DzTjkKse')
Bulk creating records using from_values()
only returns validated records:
Note: Bionty-validated terms are also created with .from_values
, see Manage biological registries for details.
projects = ["Project A", "Project B", "Project D", "Project E"]
ln.ULabel.from_values(projects)
✅ loaded 2 ULabel records matching name: 'Project A', 'Project B'
❗ did not create ULabel records for 2 non-validated names: 'Project D', 'Project E'
[ULabel(id='xNtmMuXu', name='Project A', updated_at=2023-09-26 15:21:47, created_by_id='DzTjkKse'),
ULabel(id='blQQ1wgQ', name='Project B', updated_at=2023-09-26 15:21:47, created_by_id='DzTjkKse')]
(Versioned records also account for version
in addition to name
. Also see: idempotency.)
Data duplication#
Creating a file or dataset with the same content automatically returns the existing record:
ln.File(df, description="same data")
❗ no run & transform get linked, consider passing a `run` or calling ln.track()
❗ returning existing file with same hash: File(id='I73iSjC6ZpklqmIms30A', suffix='.parquet', accessor='DataFrame', description='test data', size=2722, hash='c4-_GGzusVRzpnj5sWc6IQ', hash_type='md5', updated_at=2023-09-26 15:21:47, storage_id='VxOzNoO1', created_by_id='DzTjkKse')
File(id='I73iSjC6ZpklqmIms30A', suffix='.parquet', accessor='DataFrame', description='test data', size=2722, hash='c4-_GGzusVRzpnj5sWc6IQ', hash_type='md5', updated_at=2023-09-26 15:21:47, storage_id='VxOzNoO1', created_by_id='DzTjkKse')
Schema-based validation#
Type checks, constraint checks, and Django validators can be configured in the schema.
Registry-based validation#
validate()
validates passed values against reference values in a registry.
It returns a boolean vector indicating whether a value has an exact match in the reference values.
Using dedicated registries#
For instance, lnschema_bionty
types basic biological entities: every entity has its own registry, a Python class.
By default, the first string field is used for validation. For Disease
, it’s name
:
diseases = ["Alzheimer disease", "Alzheimer's disease", "AD"]
validated = lb.Disease.validate(diseases)
validated
✅ 1 term (33.30%) is validated for name
❗ 2 terms (66.70%) are not validated for name: Alzheimer's disease, AD
array([ True, False, False])
Validate against a non-default field:
lb.Disease.validate(
["MONDO:0004975", "MONDO:0004976", "MONDO:0004977"], lb.Disease.ontology_id
)
✅ 1 term (33.30%) is validated for ontology_id
❗ 2 terms (66.70%) are not validated for ontology_id: MONDO:0004976, MONDO:0004977
array([ True, False, False])
Using the ULabel
registry#
Any entity that doesn’t have its dedicated registry (“is not typed”) can be validated & registered using ULabel
:
ln.ULabel.validate(["Project A", "Project B", "Project C"])
✅ 2 terms (66.70%) are validated for name
❗ 1 term (33.30%) is not validated for name: Project C
array([ True, True, False])
Inspect & standardize#
When validation fails, you can call inspect()
to figure out what to do.
inspect()
applies the same definition of validation as validate()
, but returns a rich return value InspectResult
. Most importantly, it logs recommended curation steps that would render the data validated.
result = lb.Disease.inspect(diseases)
✅ 1 term (33.30%) is validated for name
❗ 2 terms (66.70%) are not validated for name: Alzheimer's disease, AD
detected 2 terms with synonyms: Alzheimer's disease, AD
→ standardize terms via .standardize()
In this case, it suggests to call standardize()
to standardize synonyms:
lb.Disease.standardize(result.non_validated)
💡 standardized 2/2 terms
['Alzheimer disease', 'Alzheimer disease']
For more, see Manage biological registries.
Extend registries#
Sometimes, we simply want to register new records to extend the content of registries:
result = ln.ULabel.inspect(projects)
✅ 2 terms (50.00%) are validated for name
❗ 2 terms (50.00%) are not validated for name: Project D, Project E
couldn't validate 2 terms: 'Project D', 'Project E'
→ if you are sure, create new records via ln.ULabel() and save to your registry
new_labels = [ln.ULabel(name=name) for name in result.non_validated]
ln.save(new_labels)
new_labels
Show code cell output
[ULabel(id='amvR7dMf', name='Project D', updated_at=2023-09-26 15:22:03, created_by_id='DzTjkKse'),
ULabel(id='VvpaMtMM', name='Project E', updated_at=2023-09-26 15:22:03, created_by_id='DzTjkKse')]
Validate features#
When calling File.from_...
and Dataset.from_...
, features are automatically validated.
Validated features are grouped in “feature sets” indexed by “slots”.
For a basic example, see Tutorial: Features & labels.
For an overview of data formats used to model different data types, see Data types.
!lamin delete --force test-validate
!rm -r test-validate
Show code cell output
💡 deleting instance testuser1/test-validate
✅ deleted instance settings file: /home/runner/.lamin/instance--testuser1--test-validate.env
✅ instance cache deleted
✅ deleted '.lndb' sqlite file
❗ consider manually deleting your stored data: /home/runner/work/lamindb/lamindb/docs/test-validate