Jupyter Notebook

Tutorial: Features & labels#

In the previous tutorial (Tutorial: Files & datasets), we learned about about how to leverage basic metadata for files & datasets to access data (query, search, stage & load).

Here, we walk through annotating & validating data with features & labels to improve:

  1. Finding data: Which datasets measured expression of cell marker CD14? Which characterized cell line K562? Which datasets have a test & train split? Etc.

  2. Using data: Are there typos in feature names? Are there typos in sampled labels? Are units of features consistent? Etc.

What was LaminDB’s most basic inspiration?

The pydata family of objects is at the heart of most data science, ML & comp bio workflows: DataFrame, AnnData, pytorch.DataLoader, zarr.Array, pyarrow.Table, xarray.Dataset, …

And still, we couldn’t find a tool to link these objects to context so that they could be analyzed in context!

Context relevant for analyses includes anything that’s needed to interpret & model data.

So, lamindb.File and lamindb.Dataset track:

  • data sources, data transformations, models, users & pipelines that performed transformations (provenance)

  • any entity of the domain in which data is generated and modeled (features & labels)

import lamindb as ln
import pandas as pd
💡 loaded instance: testuser1/lamin-tutorial (lamindb 0.54.2)
ln.settings.verbosity = "hint"

Register metadata#

Register labels#

We study 3 species of the Iris plant: setosa, versicolor & virginica.

Let’s populate the universal (untyped) label registry (ULabel) for them:

labels = [ln.ULabel(name=name) for name in ["setosa", "versicolor", "virginica"]]
ln.save(labels)

labels
[ULabel(id='Y17iL2s4', name='setosa', updated_at=2023-09-26 15:21:37, created_by_id='DzTjkKse'),
 ULabel(id='LwtaMeN8', name='versicolor', updated_at=2023-09-26 15:21:37, created_by_id='DzTjkKse'),
 ULabel(id='MwQdHusM', name='virginica', updated_at=2023-09-26 15:21:37, created_by_id='DzTjkKse')]

Anticipating that we’ll have many different labels when working with more data, we’d like to express that all 3 labels are species labels:

parent = ln.ULabel(name="is_species")
parent.save()

for label in labels:
    label.parents.add(parent)

parent.view_parents(with_children=True)
_images/0de5c7c2d9060698f2e26c60d0740abec0bdba4b83c08e80ff2171cd4c97abb8.svg

ULabel enables you to manage an in-house ontology to manage all kinds of untyped labels.

If you’d like to leverage pre-built typed ontologies for basic biological entities in the same way, see: Manage biological registries.

In addition to species, we’d like to track the studies that produced the data:

ln.ULabel(name="study0").save()
Why label a data batch by study?

We can then

  1. query all files link to this experiment

  2. model it as a confounder when we’ll analyze similar data from a follow-up experiment, and concatenate data using the label as a feature in a data matrix

Register features#

For every set of studied labels (measured values), we typically also want an identifier for the corresponding measurement dimension: the feature.

When we integrate data batches, feature names will label columns that store data.

Let’s create and save two Feature records to identify measurements of the iris species label and the study:

ln.Feature(name="iris_species_name", type="category").save()
ln.Feature(name="study_name", type="category").save()
# create a lookup object so that we can access features with auto-complete
features = ln.Feature.lookup()

Run an ML model#

Let’s now run a ML model that transforms the images into 4 high-level features.

def run_ml_model() -> pd.DataFrame:
    transform = ln.Transform(name="Petal & sepal regressor", type="pipeline")
    ln.track(transform)
    input_dataset = ln.Dataset.filter(name="Iris study 1").one()
    input_paths = [file.stage() for file in input_dataset.files.all()]
    # transform the data...
    output_dataset = ln.dev.datasets.df_iris_in_meter_study1()
    return output_dataset


df = run_ml_model()
Hide code cell output
✅ saved: Transform(id='lVezwMHBMQBDVN', name='Petal & sepal regressor', type='pipeline', updated_at=2023-09-26 15:21:38, created_by_id='DzTjkKse')
✅ saved: Run(id='sOvO9n3IEgdKHEmIflcz', run_at=2023-09-26 15:21:38, transform_id='lVezwMHBMQBDVN', created_by_id='DzTjkKse')
💡 adding file BRfjqMzHzYHUZsILu4ns as input for run sOvO9n3IEgdKHEmIflcz, adding parent transform NJvdsWWbJlZSz8
💡 adding file RLMC2dAZ9JnEqLOIMOGz as input for run sOvO9n3IEgdKHEmIflcz, adding parent transform NJvdsWWbJlZSz8
💡 adding file ihJ13ctAlqiqZqQDzE8Q as input for run sOvO9n3IEgdKHEmIflcz, adding parent transform NJvdsWWbJlZSz8
💡 adding file mTqifcqWCYKp4M0VLrm7 as input for run sOvO9n3IEgdKHEmIflcz, adding parent transform NJvdsWWbJlZSz8
💡 adding file svCQu04U5Z30DIxFdDk3 as input for run sOvO9n3IEgdKHEmIflcz, adding parent transform NJvdsWWbJlZSz8

The output is a dataframe:

df.head()
sepal_length sepal_width petal_length petal_width iris_species_name
0 0.051 0.035 0.014 0.002 setosa
1 0.049 0.030 0.014 0.002 setosa
2 0.047 0.032 0.013 0.002 setosa
3 0.046 0.031 0.015 0.002 setosa
4 0.050 0.036 0.014 0.002 setosa

And this is the ML pipeline that produced the dataframe:

ln.run_context.transform.view_parents()
_images/f515c8c7e0a8ac306b7f54db58bd0ecf7df4bc47762f9474690b92159fcd54e6.svg

Register the output data#

Let’s first register the features of the transformed data:

new_features = ln.Feature.from_df(df)
ln.save(new_features)
How to track units of features?

Use the unit field of Feature. In the above example, you’d do:

for feature in features:
    if feature.type == "float":
        feature.unit = "m"  # SI unit for meters
        feature.save()

We can now validate & register the dataframe in one line by creating a Dataset record:

dataset = ln.Dataset.from_df(
    df,
    name="Iris study 1 - transformed",
    description="Iris dataset after measuring sepal & petal metrics",
)

dataset.save()
5 terms (100.00%) are validated for name
💡 file will be copied to default storage upon `save()` with key `None` ('.lamindb/zRURBRjsfLwoa7EusdDR.parquet')
❗ record with similar name exist! did you mean to load it?
id __ratio__
name
Iris study 1 muderX8fqRejxX0zyfwQ 90.0
✅ saved 1 feature set for slot: 'columns'
✅ storing file 'zRURBRjsfLwoa7EusdDR' at '.lamindb/zRURBRjsfLwoa7EusdDR.parquet'

Feature sets#

Get an overview of linked features:

dataset.features
Features:
  columns: FeatureSet(id='O0b6eK1pPgiu9B1tPNbf', n=5, registry='core.Feature', hash='J147ES6s3GeHtK-Tg-1P', updated_at=2023-09-26 15:21:39, created_by_id='DzTjkKse')
    sepal_width (number)
    petal_length (number)
    sepal_length (number)
    🔗 iris_species_name (0, core.ULabel): 
    petal_width (number)

You’ll see that they’re always grouped in sets that correspond to records of FeatureSet.

Why does LaminDB model feature sets, not just features?
  1. Performance: Imagine you measure the same panel of 20k transcripts in 1M samples. By modeling the panel as a feature set, you’ll only need to store 1M instead of 1M x 20k = 20B links.

  2. Interpretation: Model protein panels, gene panels, etc.

  3. Data integration: Feature sets provide the currency that determines whether two datasets can be easily concatenated.

These reasons do not hold for label sets. Hence, LaminDB does not model label sets.

A slot provides a string key to access feature sets. It’s typically the accessor within the registered data object, here pd.DataFrame.columns.

Let’s use it to access all linked features:

dataset.features["columns"].df()
name type modality_id unit description registries synonyms updated_at created_by_id
id
tfrtIcvtdvLi sepal_width number None None None None None 2023-09-26 15:21:39 DzTjkKse
P5RLdIIzOyYQ petal_length number None None None None None 2023-09-26 15:21:39 DzTjkKse
iXgSUYD1x7dQ sepal_length number None None None None None 2023-09-26 15:21:39 DzTjkKse
767l0u4ILnR1 iris_species_name category None None None core.ULabel None 2023-09-26 15:21:39 DzTjkKse
vBc92uVGAgrz petal_width number None None None None None 2023-09-26 15:21:39 DzTjkKse

There is one categorical feature, let’s add the species labels:

species_labels = ln.ULabel.filter(parents__name="is_species").all()
dataset.labels.add(species_labels, feature=features.iris_species_name)

Let’s now add study labels:

dataset.labels.add(study_label, feature=features.study_name)
✅ linked new feature 'study_name' together with new feature set FeatureSet(id='I1OhfOD7s31keofRNevs', n=1, registry='core.Feature', hash='kyauvLuPMvJfa2PEAOOz', updated_at=2023-09-26 15:21:39, modality_id='eyaG5esI', created_by_id='DzTjkKse')

In addition to the columns feature set, we now have an external feature set:

dataset.features
Features:
  columns: FeatureSet(id='O0b6eK1pPgiu9B1tPNbf', n=5, registry='core.Feature', hash='J147ES6s3GeHtK-Tg-1P', updated_at=2023-09-26 15:21:39, created_by_id='DzTjkKse')
    sepal_width (number)
    petal_length (number)
    sepal_length (number)
    🔗 iris_species_name (3, core.ULabel): 'versicolor', 'virginica', 'setosa'
    petal_width (number)
  external: FeatureSet(id='I1OhfOD7s31keofRNevs', n=1, registry='core.Feature', hash='kyauvLuPMvJfa2PEAOOz', updated_at=2023-09-26 15:21:39, modality_id='eyaG5esI', created_by_id='DzTjkKse')
    🔗 study_name (1, core.ULabel): 'study0'

This is the context for our file:

dataset.describe()
Dataset(id='zRURBRjsfLwoa7EusdDR', name='Iris study 1 - transformed', description='Iris dataset after measuring sepal & petal metrics', hash='S5_Yac5-etSiUSboJ3XETA', updated_at=2023-09-26 15:21:39)

Provenance:
  🧩 transform: Transform(id='lVezwMHBMQBDVN', name='Petal & sepal regressor', type='pipeline', updated_at=2023-09-26 15:21:39, created_by_id='DzTjkKse')
  👣 run: Run(id='sOvO9n3IEgdKHEmIflcz', run_at=2023-09-26 15:21:38, transform_id='lVezwMHBMQBDVN', created_by_id='DzTjkKse')
  📄 file: File(id='zRURBRjsfLwoa7EusdDR', suffix='.parquet', accessor='DataFrame', description='See dataset zRURBRjsfLwoa7EusdDR', size=5334, hash='S5_Yac5-etSiUSboJ3XETA', hash_type='md5', updated_at=2023-09-26 15:21:39, storage_id='at1jQOFk', transform_id='lVezwMHBMQBDVN', run_id='sOvO9n3IEgdKHEmIflcz', created_by_id='DzTjkKse')
  👤 created_by: User(id='DzTjkKse', handle='testuser1', email='testuser1@lamin.ai', name='Test User1', updated_at=2023-09-26 15:21:30)
Features:
  columns: FeatureSet(id='O0b6eK1pPgiu9B1tPNbf', n=5, registry='core.Feature', hash='J147ES6s3GeHtK-Tg-1P', updated_at=2023-09-26 15:21:39, created_by_id='DzTjkKse')
    sepal_width (number)
    petal_length (number)
    sepal_length (number)
    🔗 iris_species_name (3, core.ULabel): 'versicolor', 'virginica', 'setosa'
    petal_width (number)
  external: FeatureSet(id='I1OhfOD7s31keofRNevs', n=1, registry='core.Feature', hash='kyauvLuPMvJfa2PEAOOz', updated_at=2023-09-26 15:21:39, modality_id='eyaG5esI', created_by_id='DzTjkKse')
    🔗 study_name (1, core.ULabel): 'study0'
Labels:
  🏷️ ulabels (4, core.ULabel): 'versicolor', 'virginica', 'setosa', 'study0'
dataset.file.view_flow()
_images/03ef2afd0d6757fef5df965e720e2a827c78856268a1748594f9c61576b3970b.svg

See the database content:

ln.view(registries=["Feature", "FeatureSet", "ULabel"])
Hide code cell output
Feature
name type modality_id unit description registries synonyms updated_at created_by_id
id
8xmQYKO8Rv8z study_name category None None None core.ULabel None 2023-09-26 15:21:39 DzTjkKse
767l0u4ILnR1 iris_species_name category None None None core.ULabel None 2023-09-26 15:21:39 DzTjkKse
vBc92uVGAgrz petal_width number None None None None None 2023-09-26 15:21:39 DzTjkKse
P5RLdIIzOyYQ petal_length number None None None None None 2023-09-26 15:21:39 DzTjkKse
tfrtIcvtdvLi sepal_width number None None None None None 2023-09-26 15:21:39 DzTjkKse
iXgSUYD1x7dQ sepal_length number None None None None None 2023-09-26 15:21:39 DzTjkKse
FeatureSet
name n type modality_id registry hash updated_at created_by_id
id
I1OhfOD7s31keofRNevs None 1 None eyaG5esI core.Feature kyauvLuPMvJfa2PEAOOz 2023-09-26 15:21:39 DzTjkKse
O0b6eK1pPgiu9B1tPNbf None 5 None None core.Feature J147ES6s3GeHtK-Tg-1P 2023-09-26 15:21:39 DzTjkKse
fmr4e45Ioc0LK2wCgpWT None 2 None eyaG5esI core.Feature Xb6HtBAgcBLdL6Er3kaU 2023-09-26 15:21:38 DzTjkKse
ULabel
name description reference reference_type updated_at created_by_id
id
fiiP4wAB study0 None None None 2023-09-26 15:21:37 DzTjkKse
b5LYyjWj is_species None None None 2023-09-26 15:21:37 DzTjkKse
LwtaMeN8 versicolor None None None 2023-09-26 15:21:37 DzTjkKse
MwQdHusM virginica None None None 2023-09-26 15:21:37 DzTjkKse
Y17iL2s4 setosa None None None 2023-09-26 15:21:37 DzTjkKse

Manage follow-up data#

Assume that a couple of weeks later, we receive a new batch of data in a follow-up study 2.

Let’s track a new analysis:

ln.track()
💡 notebook imports: lamindb==0.54.2 pandas==2.1.1
✅ saved: Transform(id='dMtrt8YMSdl6z8', name='Tutorial: Features & labels', short_name='tutorial1', version='0', type=notebook, updated_at=2023-09-26 15:21:40, created_by_id='DzTjkKse')
✅ saved: Run(id='cEEjVhqdtzrZUPkSG1FP', run_at=2023-09-26 15:21:40, transform_id='dMtrt8YMSdl6z8', created_by_id='DzTjkKse')

Register a joint dataset#

Assume we already ran all preprocessing including the ML model.

We get a DataFrame and store it as a file:

df = ln.dev.datasets.df_iris_in_meter_study2()
ln.File.from_df(df, description="Iris study 2 - transformed").save()
Hide code cell output
💡 file will be copied to default storage upon `save()` with key `None` ('.lamindb/q2s7E5ECDwsik0Tidr0w.parquet')
5 terms (100.00%) are validated for name
✅ loaded: FeatureSet(id='O0b6eK1pPgiu9B1tPNbf', n=5, registry='core.Feature', hash='J147ES6s3GeHtK-Tg-1P', updated_at=2023-09-26 15:21:39, created_by_id='DzTjkKse')
✅ storing file 'q2s7E5ECDwsik0Tidr0w' at '.lamindb/q2s7E5ECDwsik0Tidr0w.parquet'

Let’s load both data batches as files:

dataset1 = ln.Dataset.filter(name="Iris study 1 - transformed").one()

file1 = dataset1.file
file2 = ln.File.filter(description="Iris study 2 - transformed").one()

We can now store the joint dataset:

dataset = ln.Dataset([file1, file2], name="Iris flower study 1 & 2 - transformed")

dataset.save()
❗ record with similar name exist! did you mean to load it?
id __ratio__
name
Iris study 1 - transformed zRURBRjsfLwoa7EusdDR 95.0
💡 adding file zRURBRjsfLwoa7EusdDR as input for run cEEjVhqdtzrZUPkSG1FP, adding parent transform lVezwMHBMQBDVN

Auto-concatenate data batches#

Because both data batches measured the same validated feature set, we can auto-concatenate the sharded dataset.

This means, we can load it as if it was stored in a single file:

dataset.load().tail()
sepal_length sepal_width petal_length petal_width iris_species_name
70 0.059 0.032 0.048 0.018 versicolor
71 0.061 0.028 0.040 0.013 versicolor
72 0.063 0.025 0.049 0.015 versicolor
73 0.061 0.028 0.047 0.012 versicolor
74 0.064 0.029 0.043 0.013 versicolor

We can also access & query the underlying two file objects:

dataset.files.list()
[File(id='q2s7E5ECDwsik0Tidr0w', suffix='.parquet', accessor='DataFrame', description='Iris study 2 - transformed', size=5392, hash='7vb00xEGyKBwoAHXGPyjBw', hash_type='md5', updated_at=2023-09-26 15:21:41, storage_id='at1jQOFk', transform_id='dMtrt8YMSdl6z8', run_id='cEEjVhqdtzrZUPkSG1FP', created_by_id='DzTjkKse'),
 File(id='zRURBRjsfLwoa7EusdDR', suffix='.parquet', accessor='DataFrame', description='See dataset zRURBRjsfLwoa7EusdDR', size=5334, hash='S5_Yac5-etSiUSboJ3XETA', hash_type='md5', updated_at=2023-09-26 15:21:39, storage_id='at1jQOFk', transform_id='lVezwMHBMQBDVN', run_id='sOvO9n3IEgdKHEmIflcz', created_by_id='DzTjkKse')]

Or look at their data flow:

dataset.view_flow()
_images/21366be587ef95398086c8fa640798cda4ef65ef21a7f6f68ecab72842055583.svg

Or look at the database:

ln.view()
Hide code cell output
Dataset
name description version hash reference reference_type transform_id run_id file_id initial_version_id updated_at created_by_id
id
o7qgUD16XfmrM38aGCdN Iris flower study 1 & 2 - transformed None None ZHn7Zoltu4Y7JQiMOc9y None None dMtrt8YMSdl6z8 cEEjVhqdtzrZUPkSG1FP None None 2023-09-26 15:21:41 DzTjkKse
zRURBRjsfLwoa7EusdDR Iris study 1 - transformed Iris dataset after measuring sepal & petal met... None S5_Yac5-etSiUSboJ3XETA None None lVezwMHBMQBDVN sOvO9n3IEgdKHEmIflcz zRURBRjsfLwoa7EusdDR None 2023-09-26 15:21:39 DzTjkKse
muderX8fqRejxX0zyfwQ Iris study 1 50 image files and metadata None qW6WbNWDV_xiHYqAhku7 None None NJvdsWWbJlZSz8 wgHeuqipA5Ujff8alH6Y None None 2023-09-26 15:21:34 DzTjkKse
Feature
name type modality_id unit description registries synonyms updated_at created_by_id
id
8xmQYKO8Rv8z study_name category None None None core.ULabel None 2023-09-26 15:21:39 DzTjkKse
767l0u4ILnR1 iris_species_name category None None None core.ULabel None 2023-09-26 15:21:39 DzTjkKse
vBc92uVGAgrz petal_width number None None None None None 2023-09-26 15:21:39 DzTjkKse
P5RLdIIzOyYQ petal_length number None None None None None 2023-09-26 15:21:39 DzTjkKse
tfrtIcvtdvLi sepal_width number None None None None None 2023-09-26 15:21:39 DzTjkKse
iXgSUYD1x7dQ sepal_length number None None None None None 2023-09-26 15:21:39 DzTjkKse
FeatureSet
name n type modality_id registry hash updated_at created_by_id
id
I1OhfOD7s31keofRNevs None 1 None eyaG5esI core.Feature kyauvLuPMvJfa2PEAOOz 2023-09-26 15:21:39 DzTjkKse
O0b6eK1pPgiu9B1tPNbf None 5 None None core.Feature J147ES6s3GeHtK-Tg-1P 2023-09-26 15:21:39 DzTjkKse
fmr4e45Ioc0LK2wCgpWT None 2 None eyaG5esI core.Feature Xb6HtBAgcBLdL6Er3kaU 2023-09-26 15:21:38 DzTjkKse
File
storage_id key suffix accessor description version size hash hash_type transform_id run_id initial_version_id updated_at created_by_id
id
q2s7E5ECDwsik0Tidr0w at1jQOFk None .parquet DataFrame Iris study 2 - transformed None 5392 7vb00xEGyKBwoAHXGPyjBw md5 dMtrt8YMSdl6z8 cEEjVhqdtzrZUPkSG1FP None 2023-09-26 15:21:41 DzTjkKse
zRURBRjsfLwoa7EusdDR at1jQOFk None .parquet DataFrame See dataset zRURBRjsfLwoa7EusdDR None 5334 S5_Yac5-etSiUSboJ3XETA md5 lVezwMHBMQBDVN sOvO9n3IEgdKHEmIflcz None 2023-09-26 15:21:39 DzTjkKse
ihJ13ctAlqiqZqQDzE8Q qBDFItXr iris_studies/study0_raw_images/iris-125b6645e0... .jpg None None None 21418 Bsko3tdvYxWq_JB5fdoIbw md5 NJvdsWWbJlZSz8 wgHeuqipA5Ujff8alH6Y None 2023-09-26 15:21:34 DzTjkKse
BRfjqMzHzYHUZsILu4ns qBDFItXr iris_studies/study0_raw_images/iris-0797945218... .jpg None None None 19842 v3G73F-8oISKexASY3RvUw md5 NJvdsWWbJlZSz8 wgHeuqipA5Ujff8alH6Y None 2023-09-26 15:21:34 DzTjkKse
mTqifcqWCYKp4M0VLrm7 qBDFItXr iris_studies/study0_raw_images/iris-0fec175448... .jpg None None None 10773 d3I43842Sd5PUMgFBrgjKA md5 NJvdsWWbJlZSz8 wgHeuqipA5Ujff8alH6Y None 2023-09-26 15:21:34 DzTjkKse
svCQu04U5Z30DIxFdDk3 qBDFItXr iris_studies/study0_raw_images/iris-0337d20a3b... .jpg None None None 14529 e0Gct8LodEyQzNwy1glOPA md5 NJvdsWWbJlZSz8 wgHeuqipA5Ujff8alH6Y None 2023-09-26 15:21:34 DzTjkKse
RLMC2dAZ9JnEqLOIMOGz qBDFItXr iris_studies/study0_raw_images/iris-0f133861ea... .jpg None None None 12201 1uP_ORc_dQpcuk3oKkIOLw md5 NJvdsWWbJlZSz8 wgHeuqipA5Ujff8alH6Y None 2023-09-26 15:21:34 DzTjkKse
Modality
name ontology_id abbr synonyms description molecule instrument measurement updated_at created_by_id
id
eyaG5esI meta None None None None None None None 2023-09-26 15:21:38 DzTjkKse
Run
transform_id run_at created_by_id reference reference_type
id
wgHeuqipA5Ujff8alH6Y NJvdsWWbJlZSz8 2023-09-26 15:21:32 DzTjkKse None None
sOvO9n3IEgdKHEmIflcz lVezwMHBMQBDVN 2023-09-26 15:21:38 DzTjkKse None None
cEEjVhqdtzrZUPkSG1FP dMtrt8YMSdl6z8 2023-09-26 15:21:40 DzTjkKse None None
Storage
root type region updated_at created_by_id
id
qBDFItXr s3://lamindb-dev-datasets s3 us-east-1 2023-09-26 15:21:33 DzTjkKse
at1jQOFk /home/runner/work/lamindb/lamindb/docs/lamin-t... local None 2023-09-26 15:21:30 DzTjkKse
Transform
name short_name version type reference reference_type initial_version_id updated_at created_by_id
id
dMtrt8YMSdl6z8 Tutorial: Features & labels tutorial1 0 notebook None None None 2023-09-26 15:21:41 DzTjkKse
lVezwMHBMQBDVN Petal & sepal regressor None None pipeline None None None 2023-09-26 15:21:39 DzTjkKse
NJvdsWWbJlZSz8 Tutorial: Files & datasets tutorial 0 notebook None None None 2023-09-26 15:21:34 DzTjkKse
ULabel
name description reference reference_type updated_at created_by_id
id
fiiP4wAB study0 None None None 2023-09-26 15:21:37 DzTjkKse
b5LYyjWj is_species None None None 2023-09-26 15:21:37 DzTjkKse
LwtaMeN8 versicolor None None None 2023-09-26 15:21:37 DzTjkKse
MwQdHusM virginica None None None 2023-09-26 15:21:37 DzTjkKse
Y17iL2s4 setosa None None None 2023-09-26 15:21:37 DzTjkKse
User
handle email name updated_at
id
DzTjkKse testuser1 testuser1@lamin.ai Test User1 2023-09-26 15:21:30

This is it! 😅

If you’re interested, please check out guides & use cases or make an issue on GitHub to discuss.

Appendix#

Manage metadata#

Hierarchical ontologies#

Say, we want to express that study0 belongs to project 1 and is a study, we can use .parents:

project1 = ln.ULabel(name="project1")
project1.save()
is_study = ln.ULabel(name="is_study")
is_study.save()
study_label.parents.set([project1, is_study])
study_label.view_parents()
_images/1ae119917700ef32fdbf7a7f6e9b28dbb21d07c00906a5bb08badae0430b42dd.svg

For more info, see view_parents().

Avoid duplicates#

We already created a project1 label before, let’s see what happens if we try to create it again:

label = ln.ULabel(name="project1")

label.save()
✅ loaded ULabel record with exact same name: 'project1'

Instead of creating a new record, LaminDB loads and returns the existing record from the database.

If there is no exact match, LaminDB will warn you upon creating a record about potential duplicates.

Say, we spell “project 1” with a white space:

ln.ULabel(name="project 1")
❗ record with similar name exist! did you mean to load it?
id __ratio__
name
project1 ss5hWPw8 94.117647
ULabel(id='MiP4f59T', name='project 1', created_by_id='DzTjkKse')

To avoid inserting duplicates when creating new records, a search compares whether a similar record already exists.

You can switch it off for performance gains via upon_create_search_names.

Update & delete records#

label = ln.ULabel.filter(name="project1").first()

label
ULabel(id='ss5hWPw8', name='project1', updated_at=2023-09-26 15:21:41, created_by_id='DzTjkKse')
label.name = "project1a"

label.save()

label
ULabel(id='ss5hWPw8', name='project1a', updated_at=2023-09-26 15:21:41, created_by_id='DzTjkKse')
label.delete()
(2, {'lnschema_core.ULabel_parents': 1, 'lnschema_core.ULabel': 1})

Manage storage#

Change default storage#

The default storage location is:

ln.settings.storage  # your "working data directory"
PosixUPath('/home/runner/work/lamindb/lamindb/docs/lamin-tutorial')

You can change it by setting ln.settings.storage = "s3://my-bucket".

See all storage locations#

ln.Storage.filter().df()
root type region updated_at created_by_id
id
at1jQOFk /home/runner/work/lamindb/lamindb/docs/lamin-t... local None 2023-09-26 15:21:30 DzTjkKse
qBDFItXr s3://lamindb-dev-datasets s3 us-east-1 2023-09-26 15:21:33 DzTjkKse

Set verbosity#

To reduce the number of logging messages, set verbosity:

ln.settings.verbosity = 3  # only show info, no hints
# clean up what we wrote in this notebook
!lamin delete --force lamin-tutorial
!rm -r lamin-tutorial
💡 deleting instance testuser1/lamin-tutorial
✅     deleted instance settings file: /home/runner/.lamin/instance--testuser1--lamin-tutorial.env
✅     instance cache deleted
✅     deleted '.lndb' sqlite file
❗     consider manually deleting your stored data: /home/runner/work/lamindb/lamindb/docs/lamin-tutorial