Tutorial: Features & labels#

In the previous tutorial (Tutorial: Artifacts), we learned about about how to leverage basic metadata for artifacts to access data (query, search, cache & load).

Here, we walk through annotating & validating data with features & labels to improve:

  1. Finding data: Which collections measured expression of cell marker CD14? Which characterized cell line K562? Which collections have a test & train split? Etc.

  2. Using data: Are there typos in feature names? Are there typos in sampled labels? Are units of features consistent? Etc.

What was LaminDB’s most basic inspiration?

The pydata family of objects is at the heart of most data science, ML & comp bio workflows: DataFrame, AnnData, pytorch.DataLoader, zarr.Array, pyarrow.Table, xarray.Collection, …

And still, we couldn’t find a tool to link these objects to context so that they could be analyzed in context!

Context relevant for analyses includes anything that’s needed to interpret & model data.

So, lamindb.Artifact and lamindb.Collection track:

  • data sources, data transformations, models, users & pipelines that performed transformations (provenance)

  • any entity of the domain in which data is generated and modeled (features & labels)

import lamindb as ln
import pandas as pd
💡 connected lamindb: anonymous/lamin-tutorial
ln.settings.verbosity = "hint"

Register metadata#

Features and labels are the primary ways of registering domain-knowledge related metadata in LaminDB.

Features represent measurement dimensions (e.g. organism) and labels represent measurement values (e.g. iris setosa, iris versicolor, iris virginica).

Register labels#

We study 3 organism of the Iris plant: setosa, versicolor & virginica.

Let’s populate the universal label registry (ULabel) for them:

organisms = [ln.ULabel(name=name) for name in ["setosa", "versicolor", "virginica"]]
ln.save(organisms)
organisms
[ULabel(uid='v9CL1k1r', name='setosa', updated_at=2024-05-01 18:50:30 UTC, created_by_id=1),
 ULabel(uid='uY6M63yK', name='versicolor', updated_at=2024-05-01 18:50:30 UTC, created_by_id=1),
 ULabel(uid='vqItCsbY', name='virginica', updated_at=2024-05-01 18:50:30 UTC, created_by_id=1)]

Anticipating that we’ll have many different labels when working with more data, we’d like to express that all 3 labels are organism labels:

is_organism = ln.ULabel(name="is_organism")
is_organism.save()
is_organism.children.set(organisms)
is_organism.view_parents(with_children=True)
_images/298a173da8139df0fabdc83a20d09a0255e8415557bf277128dddcdf9c568cac.svg

ULabel enables you to manage an in-house ontology to manage all kinds of untyped labels.

If you’d like to leverage pre-built typed ontologies for basic biological entities in the same way, see: Manage biological registries.

In addition to organism, we’d like to track the studies that produced the data:

studies = [ln.ULabel(name=name) for name in ["study0", "study1", "study2"]]
ln.save(studies)
is_study = ln.ULabel(name="is_study")
is_study.save()
is_study.children.set(studies)
is_study.view_parents(with_children=True)
_images/8fcb815386be83ace82d16f71658885acda660a65c1717d2cddf6cd7525776b6.svg
Why label a dataset by study?

We can then

  1. query all artifacts link to this experiment

  2. model it as a confounder when we’ll analyze similar data from a follow-up experiment, and concatenate data using the label as a feature in a data matrix

Register features#

For every set of studied labels (measured values), we typically also want an identifier for the corresponding measurement dimension: the feature.

When we integrate datasets, feature names will label columns that store data.

Let’s create and save two Feature records to identify measurements of the iris organism label and the study:

ln.Feature(name="iris_organism_name", type="category").save()
ln.Feature(name="study_name", type="category").save()
# create a lookup object so that we can access features with auto-complete
features = ln.Feature.lookup()

Run an ML model#

Let’s now run a ML model that transforms the images into 4 high-level features.

def run_ml_model() -> pd.DataFrame:
    transform = ln.Transform(name="Petal & sepal regressor", type="pipeline")
    ln.track(transform=transform)
    input_data = ln.Collection.filter(name__startswith="Iris collection", version="1").one()
    input_paths = [
        path.download_to(path.name) for path in input_data.artifacts[0].path.glob("*")
    ]
    # apply ML model
    output_data = ln.core.datasets.df_iris_in_meter_study1()
    return output_data


df = run_ml_model()
Hide code cell output
💡 saved: Transform(uid='fDh8lNBRLF8PQtTP', name='Petal & sepal regressor', type='pipeline', updated_at=2024-05-01 18:50:32 UTC, created_by_id=1)
💡 saved: Run(uid='4MhEkENdu3viywVPDhqf', transform_id=2, created_by_id=1)
💡 tracked pip freeze > /home/runner/.cache/lamindb/run_env_pip_4MhEkENdu3viywVPDhqf.txt

The output is a dataframe:

df.head()
sepal_length sepal_width petal_length petal_width iris_organism_name
0 0.051 0.035 0.014 0.002 setosa
1 0.049 0.030 0.014 0.002 setosa
2 0.047 0.032 0.013 0.002 setosa
3 0.046 0.031 0.015 0.002 setosa
4 0.050 0.036 0.014 0.002 setosa

And this is the ML pipeline that produced the dataframe:

ln.core.run_context.transform.view_parents()
_images/74a948affb14e814becfd892bfd44fda44d582e29642bcee1ef91cd130c592f0.svg

Register the output data#

Let’s first register the features of the transformed data:

new_features = ln.Feature.from_df(df)
ln.save(new_features)
How to track units of features?

Use the unit field of Feature. In the above example, you’d do:

for feature in features:
    if feature.type == "number":
        feature.unit = "m"  # SI unit for meters
        feature.save()

We can now validate & register the dataframe in one line:

artifact = ln.Artifact.from_df(
    df,
    description="Iris study 1 - after measuring sepal & petal metrics",
)
artifact.save()
💡 path content will be copied to default storage upon `save()` with key `None` ('.lamindb/Qtz8SzzWcsj7dLiZvYxJ.parquet')
✅ storing artifact 'Qtz8SzzWcsj7dLiZvYxJ' at '/home/runner/work/lamindb/lamindb/docs/lamin-tutorial/.lamindb/Qtz8SzzWcsj7dLiZvYxJ.parquet'
Artifact(uid='Qtz8SzzWcsj7dLiZvYxJ', suffix='.parquet', accessor='DataFrame', description='Iris study 1 - after measuring sepal & petal metrics', size=5347, hash='gWxE_pTwNrrYutXJaGbHqA', hash_type='md5', visibility=1, key_is_virtual=True, updated_at=2024-05-01 18:50:37 UTC, storage_id=1, transform_id=2, run_id=2, created_by_id=1)
artifact.features.add(new_features)

Feature sets#

Get an overview of linked features:

artifact.features
Features:
  columns: FeatureSet(uid='eAHIaYzbO8KemSbYPD7O', n=5, registry='core.Feature')
    🔗 iris_organism_name (0, core.ULabel): 
    sepal_length (number)
    sepal_width (number)
    petal_length (number)
    petal_width (number)

You’ll see that they’re always grouped in sets that correspond to records of FeatureSet.

Why does LaminDB model feature sets, not just features?
  1. Performance: Imagine you measure the same panel of 20k transcripts in 1M samples. By modeling the panel as a feature set, you’ll only need to store 1M instead of 1M x 20k = 20B links.

  2. Interpretation: Model protein panels, gene panels, etc.

  3. Data integration: Feature sets provide the currency that determines whether two collections can be easily concatenated.

These reasons do not hold for label sets. Hence, LaminDB does not model label sets.

A slot provides a string key to access feature sets. It’s typically the accessor within the registered data object, here pd.DataFrame.columns.

Let’s use it to access all linked features:

artifact.features["columns"].df()
uid name type unit description registries synonyms created_at updated_at created_by_id
id
1 7Ohz9SVp18RZ iris_organism_name category None None core.ULabel None 2024-05-01 18:50:30.969360+00:00 2024-05-01 18:50:31.950162+00:00 1
3 Om8Mw6pFkzIo sepal_length number None None None None 2024-05-01 18:50:37.714235+00:00 2024-05-01 18:50:37.714263+00:00 1
4 iwR15EkIJUKV sepal_width number None None None None 2024-05-01 18:50:37.714387+00:00 2024-05-01 18:50:37.714403+00:00 1
5 UL8Sh0vgFmp7 petal_length number None None None None 2024-05-01 18:50:37.714513+00:00 2024-05-01 18:50:37.714528+00:00 1
6 e1QfngsThY1R petal_width number None None None None 2024-05-01 18:50:37.714638+00:00 2024-05-01 18:50:37.714652+00:00 1

There is one categorical feature, let’s add the organism labels:

organism_labels = ln.ULabel.filter(parents__name="is_organism").all()
artifact.labels.add(organism_labels, feature=features.iris_organism_name)

Let’s now add study labels:

artifact.labels.add(ulabels.study0, feature=features.study_name)
✅ linked new feature 'study_name' together with new feature set FeatureSet(uid='qpVYAIhNMtboyMZ1urZR', n=1, registry='core.Feature', hash='wrVx9j2goWy1Emt2Eq9y', updated_at=2024-05-01 18:50:37 UTC, created_by_id=1)

In addition to the columns feature set, we now have an external feature set:

artifact.features
Features:
  columns: FeatureSet(uid='eAHIaYzbO8KemSbYPD7O', n=5, registry='core.Feature')
    🔗 iris_organism_name (3, core.ULabel): 'virginica', 'setosa', 'versicolor'
    sepal_length (number)
    sepal_width (number)
    petal_length (number)
    petal_width (number)
  external: FeatureSet(uid='qpVYAIhNMtboyMZ1urZR', n=1, registry='core.Feature')
    🔗 study_name (1, core.ULabel): 'study0'

This is the context for our artifact:

artifact.describe()
Artifact(uid='Qtz8SzzWcsj7dLiZvYxJ', suffix='.parquet', accessor='DataFrame', description='Iris study 1 - after measuring sepal & petal metrics', size=5347, hash='gWxE_pTwNrrYutXJaGbHqA', hash_type='md5', visibility=1, key_is_virtual=True, updated_at=2024-05-01 18:50:37 UTC)

Provenance:
  📎 storage: Storage(uid='sXnIoPYEDMeP', root='/home/runner/work/lamindb/lamindb/docs/lamin-tutorial', type='local', instance_uid='5WuFt3cW4zRx')
  📎 transform: Transform(uid='fDh8lNBRLF8PQtTP', name='Petal & sepal regressor', type='pipeline')
  📎 run: Run(uid='4MhEkENdu3viywVPDhqf', started_at=2024-05-01 18:50:32 UTC, is_consecutive=True)
  📎 created_by: User(uid='00000000', handle='anonymous')
Features:
  columns: FeatureSet(uid='eAHIaYzbO8KemSbYPD7O', n=5, registry='core.Feature')
    🔗 iris_organism_name (3, core.ULabel): 'virginica', 'setosa', 'versicolor'
    sepal_length (number)
    sepal_width (number)
    petal_length (number)
    petal_width (number)
  external: FeatureSet(uid='qpVYAIhNMtboyMZ1urZR', n=1, registry='core.Feature')
    🔗 study_name (1, core.ULabel): 'study0'
Labels:
  📎 ulabels (4, core.ULabel): 'virginica', 'study0', 'setosa', 'versicolor'
artifact.view_lineage()
_images/31187fea2e60576112c64443c7f76804367711df05f63df2313850b6b7aefaae.svg

See the database content:

ln.view(registries=["Feature", "FeatureSet", "ULabel"])
Hide code cell output
Feature
uid name type unit description registries synonyms created_at updated_at created_by_id
id
6 e1QfngsThY1R petal_width number None None None None 2024-05-01 18:50:37.714638+00:00 2024-05-01 18:50:37.714652+00:00 1
5 UL8Sh0vgFmp7 petal_length number None None None None 2024-05-01 18:50:37.714513+00:00 2024-05-01 18:50:37.714528+00:00 1
4 iwR15EkIJUKV sepal_width number None None None None 2024-05-01 18:50:37.714387+00:00 2024-05-01 18:50:37.714403+00:00 1
3 Om8Mw6pFkzIo sepal_length number None None None None 2024-05-01 18:50:37.714235+00:00 2024-05-01 18:50:37.714263+00:00 1
1 7Ohz9SVp18RZ iris_organism_name category None None core.ULabel None 2024-05-01 18:50:30.969360+00:00 2024-05-01 18:50:31.950162+00:00 1
2 Nrcl3zMwkL9q study_name category None None core.ULabel None 2024-05-01 18:50:30.985072+00:00 2024-05-01 18:50:31.591442+00:00 1
FeatureSet
uid name n type registry hash created_at updated_at created_by_id
id
5 qpVYAIhNMtboyMZ1urZR None 1 None core.Feature wrVx9j2goWy1Emt2Eq9y 2024-05-01 18:50:37.816761+00:00 2024-05-01 18:50:37.821525+00:00 1
4 eAHIaYzbO8KemSbYPD7O None 5 None core.Feature sXUHyxw8y_MHRRkdVPak 2024-05-01 18:50:37.743324+00:00 2024-05-01 18:50:37.743350+00:00 1
2 p0rX4YMjebjGiZaQItKm None 2 None core.Feature AeIcn9GpMb8154_Qhc4Z 2024-05-01 18:50:31.956398+00:00 2024-05-01 18:50:32.415491+00:00 1
ULabel
uid name description reference reference_type created_at updated_at created_by_id
id
8 qIie3ota is_study None None None 2024-05-01 18:50:30.910140+00:00 2024-05-01 18:50:30.910164+00:00 1
7 pf47KQzy study2 None None None 2024-05-01 18:50:30.900888+00:00 2024-05-01 18:50:30.900903+00:00 1
6 iUgtZ2xK study1 None None None 2024-05-01 18:50:30.900785+00:00 2024-05-01 18:50:30.900801+00:00 1
5 mPISxrYD study0 None None None 2024-05-01 18:50:30.900654+00:00 2024-05-01 18:50:30.900686+00:00 1
4 wU4im1Pn is_organism None None None 2024-05-01 18:50:30.802939+00:00 2024-05-01 18:50:30.802967+00:00 1
3 vqItCsbY virginica None None None 2024-05-01 18:50:30.764203+00:00 2024-05-01 18:50:30.764217+00:00 1
2 uY6M63yK versicolor None None None 2024-05-01 18:50:30.764098+00:00 2024-05-01 18:50:30.764114+00:00 1

Manage follow-up data#

Assume that a couple of weeks later, we receive a new dataset in a follow-up study 2.

Let’s track a new analysis:

ln.settings.transform.stem_uid = "dMtrt8YMSdl6"
ln.settings.transform.version = "1"

ln.track()
💡 Assuming editor is Jupyter Lab.
💡 Attaching notebook metadata
💡 notebook imports: lamindb==0.71.0 pandas==1.5.3
💡 saved: Transform(uid='dMtrt8YMSdl65zKv', name='Tutorial: Features & labels', key='tutorial2', version='1', type='notebook', updated_at=2024-05-01 18:50:39 UTC, created_by_id=1)
💡 saved: Run(uid='4kwedLR19YL3udJmPMoF', transform_id=3, created_by_id=1)
💡 tracked pip freeze > /home/runner/.cache/lamindb/run_env_pip_4kwedLR19YL3udJmPMoF.txt

Register a joint collection#

Assume we already ran all preprocessing including the ML model.

We get a DataFrame and store it as an artifact:

df = ln.core.datasets.df_iris_in_meter_study2()
ln.Artifact.from_df(df, description="Iris study 2 - transformed").save()
Hide code cell output
💡 path content will be copied to default storage upon `save()` with key `None` ('.lamindb/F5bhv575rauQVoDjgtOM.parquet')
✅ storing artifact 'F5bhv575rauQVoDjgtOM' at '/home/runner/work/lamindb/lamindb/docs/lamin-tutorial/.lamindb/F5bhv575rauQVoDjgtOM.parquet'
Artifact(uid='F5bhv575rauQVoDjgtOM', suffix='.parquet', accessor='DataFrame', description='Iris study 2 - transformed', size=5397, hash='Sz8pY3NO5XxBNEis7uq_ZQ', hash_type='md5', visibility=1, key_is_virtual=True, updated_at=2024-05-01 18:50:40 UTC, storage_id=1, transform_id=3, run_id=3, created_by_id=1)

Let’s load it:

artifact2 = ln.Artifact.filter(description="Iris study 2 - transformed").one()

We can now store the joint collection:

collection = ln.Collection(
    [artifact, artifact2], name="Iris flower study 1 & 2 - transformed"
)
collection.save()
✅ loaded: FeatureSet(uid='eAHIaYzbO8KemSbYPD7O', n=5, registry='core.Feature', hash='sXUHyxw8y_MHRRkdVPak', updated_at=2024-05-01 18:50:37 UTC, created_by_id=1)
✅ loaded: FeatureSet(uid='qpVYAIhNMtboyMZ1urZR', n=1, registry='core.Feature', hash='wrVx9j2goWy1Emt2Eq9y', updated_at=2024-05-01 18:50:37 UTC, created_by_id=1)
💡 adding artifact [5] as input for run 3, adding parent transform 2

Auto-concatenate datasets#

Because both datasets measured the same validated feature set, we can auto-concatenate the collection:

collection.load().tail()
sepal_length sepal_width petal_length petal_width iris_organism_name
145 0.067 0.030 0.052 0.023 virginica
146 0.063 0.025 0.050 0.019 virginica
147 0.065 0.030 0.052 0.020 virginica
148 0.062 0.034 0.054 0.023 virginica
149 0.059 0.030 0.051 0.018 virginica

We can also access & query the underlying two artifact objects:

collection.artifacts.df()
uid storage_id key suffix accessor description version size hash hash_type n_objects n_observations transform_id run_id visibility key_is_virtual created_at updated_at created_by_id
id
5 Qtz8SzzWcsj7dLiZvYxJ 1 None .parquet DataFrame Iris study 1 - after measuring sepal & petal m... None 5347 gWxE_pTwNrrYutXJaGbHqA md5 None None 2 2 1 True 2024-05-01 18:50:37.729786+00:00 2024-05-01 18:50:37.729812+00:00 1
6 F5bhv575rauQVoDjgtOM 1 None .parquet DataFrame Iris study 2 - transformed None 5397 Sz8pY3NO5XxBNEis7uq_ZQ md5 None None 3 3 1 True 2024-05-01 18:50:40.394478+00:00 2024-05-01 18:50:40.394507+00:00 1

Or look at their data lineage:

collection.view_lineage()
_images/9bed30fa402cb2f7d37f6f0613e50648d6f527ec1ac786800628a827f3c634af.svg

Or look at the database:

ln.view()
Hide code cell output
Artifact
uid storage_id key suffix accessor description version size hash hash_type n_objects n_observations transform_id run_id visibility key_is_virtual created_at updated_at created_by_id
id
6 F5bhv575rauQVoDjgtOM 1 None .parquet DataFrame Iris study 2 - transformed None 5397 Sz8pY3NO5XxBNEis7uq_ZQ md5 NaN None 3 3 1 True 2024-05-01 18:50:40.394478+00:00 2024-05-01 18:50:40.394507+00:00 1
5 Qtz8SzzWcsj7dLiZvYxJ 1 None .parquet DataFrame Iris study 1 - after measuring sepal & petal m... None 5347 gWxE_pTwNrrYutXJaGbHqA md5 NaN None 2 2 1 True 2024-05-01 18:50:37.729786+00:00 2024-05-01 18:50:37.729812+00:00 1
4 KN5tE6KT9QdtP29X8w5C 2 iris_studies/study2_raw_images None None None 665518 PX8Vt9T28y-uCEJO1tKm7A md5-d 51.0 None 1 1 1 False 2024-05-01 18:50:28.003536+00:00 2024-05-01 18:50:28.003564+00:00 1
3 3ksHD7VWCbUJld5fiYzd 2 iris_studies/study1_raw_images None None None 640617 j61W__GgImA18CKrIf7FVg md5-d 49.0 None 1 1 1 False 2024-05-01 18:50:27.385608+00:00 2024-05-01 18:50:27.385637+00:00 1
2 7zPAagBwZTJzQqVLFyqB 2 iris_studies/study0_raw_images None None None 656692 wVYKPpEsmmrqSpAZIRXCFg md5-d 51.0 None 1 1 1 False 2024-05-01 18:50:26.600312+00:00 2024-05-01 18:50:26.600341+00:00 1
1 o5tTv4TU4OZGSWnULdR9 2 iris_studies/study0_raw_images/meta.csv .csv None None None 4355 ZpAEpN0iFYH6vjZNigic7g md5 NaN None 1 1 1 False 2024-05-01 18:50:25.683074+00:00 2024-05-01 18:50:25.683102+00:00 1
Collection
uid name description version hash reference reference_type transform_id run_id artifact_id visibility created_at updated_at created_by_id
id
4 RAz3v1ppt4kR6lFG6Wck Iris flower study 1 & 2 - transformed None None r405fYDOPfwROfWo23b1 None None 3 3 None 1 2024-05-01 18:50:40.438900+00:00 2024-05-01 18:50:40.438924+00:00 1
3 32cBtnkazA7BtJB4pisU Iris collection Another 50 images 3 T-U8z2Zi5rFYdAD9pzmS None None 1 1 None 1 2024-05-01 18:50:28.016434+00:00 2024-05-01 18:50:28.016457+00:00 1
2 32cBtnkazA7BtJB4Db1u Iris collection Another 50 images 2 5cCK6ZLOPB0cV3tyeZup None None 1 1 None 1 2024-05-01 18:50:27.398726+00:00 2024-05-01 18:50:27.398750+00:00 1
1 32cBtnkazA7BtJB4YV3x Iris collection 50 image files and metadata 1 WwFLpNFmK8GMC2dSGj1W None None 1 1 None 1 2024-05-01 18:50:26.800334+00:00 2024-05-01 18:50:26.800363+00:00 1
Feature
uid name type unit description registries synonyms created_at updated_at created_by_id
id
6 e1QfngsThY1R petal_width number None None None None 2024-05-01 18:50:37.714638+00:00 2024-05-01 18:50:37.714652+00:00 1
5 UL8Sh0vgFmp7 petal_length number None None None None 2024-05-01 18:50:37.714513+00:00 2024-05-01 18:50:37.714528+00:00 1
4 iwR15EkIJUKV sepal_width number None None None None 2024-05-01 18:50:37.714387+00:00 2024-05-01 18:50:37.714403+00:00 1
3 Om8Mw6pFkzIo sepal_length number None None None None 2024-05-01 18:50:37.714235+00:00 2024-05-01 18:50:37.714263+00:00 1
1 7Ohz9SVp18RZ iris_organism_name category None None core.ULabel None 2024-05-01 18:50:30.969360+00:00 2024-05-01 18:50:31.950162+00:00 1
2 Nrcl3zMwkL9q study_name category None None core.ULabel None 2024-05-01 18:50:30.985072+00:00 2024-05-01 18:50:31.591442+00:00 1
FeatureSet
uid name n type registry hash created_at updated_at created_by_id
id
5 qpVYAIhNMtboyMZ1urZR None 1 None core.Feature wrVx9j2goWy1Emt2Eq9y 2024-05-01 18:50:37.816761+00:00 2024-05-01 18:50:37.821525+00:00 1
4 eAHIaYzbO8KemSbYPD7O None 5 None core.Feature sXUHyxw8y_MHRRkdVPak 2024-05-01 18:50:37.743324+00:00 2024-05-01 18:50:37.743350+00:00 1
2 p0rX4YMjebjGiZaQItKm None 2 None core.Feature AeIcn9GpMb8154_Qhc4Z 2024-05-01 18:50:31.956398+00:00 2024-05-01 18:50:32.415491+00:00 1
Run
uid transform_id started_at finished_at created_by_id json report_id environment_id is_consecutive reference reference_type created_at
id
1 gYXIXnIw7o7ys3cRQgEl 1 2024-05-01 18:50:22.948429+00:00 None 1 None None None True None None 2024-05-01 18:50:22.948563+00:00
2 4MhEkENdu3viywVPDhqf 2 2024-05-01 18:50:32.462950+00:00 None 1 None None None True None None 2024-05-01 18:50:32.463085+00:00
3 4kwedLR19YL3udJmPMoF 3 2024-05-01 18:50:39.337405+00:00 None 1 None None None True None None 2024-05-01 18:50:39.337657+00:00
Storage
uid root description type region instance_uid created_at updated_at created_by_id
id
2 EzvEnPnH s3://lamindb-dev-datasets None s3 us-east-1 pZ1VQkyD3haH 2024-05-01 18:50:25.627040+00:00 2024-05-01 18:50:25.627073+00:00 1
1 sXnIoPYEDMeP /home/runner/work/lamindb/lamindb/docs/lamin-t... None local None 5WuFt3cW4zRx 2024-05-01 18:50:20.106918+00:00 2024-05-01 18:50:20.106942+00:00 1
Transform
uid name key version description type latest_report_id source_code_id reference reference_type created_at updated_at created_by_id
id
3 dMtrt8YMSdl65zKv Tutorial: Features & labels tutorial2 1 None notebook None None None None 2024-05-01 18:50:39.330633+00:00 2024-05-01 18:50:39.330677+00:00 1
2 fDh8lNBRLF8PQtTP Petal & sepal regressor None None None pipeline None None None None 2024-05-01 18:50:32.458950+00:00 2024-05-01 18:50:32.458976+00:00 1
1 NJvdsWWbJlZS6K79 Tutorial: Artifacts tutorial 0 None notebook None None None None 2024-05-01 18:50:22.941759+00:00 2024-05-01 18:50:22.941802+00:00 1
ULabel
uid name description reference reference_type created_at updated_at created_by_id
id
8 qIie3ota is_study None None None 2024-05-01 18:50:30.910140+00:00 2024-05-01 18:50:30.910164+00:00 1
7 pf47KQzy study2 None None None 2024-05-01 18:50:30.900888+00:00 2024-05-01 18:50:30.900903+00:00 1
6 iUgtZ2xK study1 None None None 2024-05-01 18:50:30.900785+00:00 2024-05-01 18:50:30.900801+00:00 1
5 mPISxrYD study0 None None None 2024-05-01 18:50:30.900654+00:00 2024-05-01 18:50:30.900686+00:00 1
4 wU4im1Pn is_organism None None None 2024-05-01 18:50:30.802939+00:00 2024-05-01 18:50:30.802967+00:00 1
3 vqItCsbY virginica None None None 2024-05-01 18:50:30.764203+00:00 2024-05-01 18:50:30.764217+00:00 1
2 uY6M63yK versicolor None None None 2024-05-01 18:50:30.764098+00:00 2024-05-01 18:50:30.764114+00:00 1
User
uid handle name created_at updated_at
id
1 00000000 anonymous None 2024-05-01 18:50:20.102450+00:00 2024-05-01 18:50:20.102474+00:00

This is it! 😅

If you’re interested, please check out guides & use cases or make an issue on GitHub to discuss.

Appendix#

Manage metadata#

Avoid duplicates#

Let’s create a label “project1”:

ln.ULabel(name="project1").save()
ULabel(uid='hYWK36Qu', name='project1', updated_at=2024-05-01 18:50:40 UTC, created_by_id=1)

We already created a project1 label before, let’s see what happens if we try to create it again:

label = ln.ULabel(name="project1")
label.save()
❗ loaded ULabel record with same name: 'project1' (disable via `ln.settings.upon_create_search_names`)
ULabel(uid='hYWK36Qu', name='project1', updated_at=2024-05-01 18:50:40 UTC, created_by_id=1)

Instead of creating a new record, LaminDB loads and returns the existing record from the database.

If there is no exact match, LaminDB will warn you upon creating a record about potential duplicates.

Say, we spell “project 1” with a white space:

ln.ULabel(name="project 1")
❗ record with similar name exist! did you mean to load it?
uid score
name
project1 hYWK36Qu 94.1
ULabel(uid='4bma7rl4', name='project 1', created_by_id=1)

To avoid inserting duplicates when creating new records, a search compares whether a similar record already exists.

You can switch it off for performance gains via upon_create_search_names.

Update & delete records#

label = ln.ULabel.filter(name="project1").first()
label
ULabel(uid='hYWK36Qu', name='project1', updated_at=2024-05-01 18:50:40 UTC, created_by_id=1)
label.name = "project1a"
label.save()
label
ULabel(uid='hYWK36Qu', name='project1a', updated_at=2024-05-01 18:50:40 UTC, created_by_id=1)
label.delete()
(1, {'lnschema_core.ULabel': 1})

Manage storage#

Change default storage#

The default storage location is:

ln.settings.storage
PosixUPath('/home/runner/work/lamindb/lamindb/docs/lamin-tutorial')

You can change it by setting ln.settings.storage = "s3://my-bucket".

See all storage locations#

ln.Storage.df()
uid root description type region instance_uid created_at updated_at created_by_id
id
2 EzvEnPnH s3://lamindb-dev-datasets None s3 us-east-1 pZ1VQkyD3haH 2024-05-01 18:50:25.627040+00:00 2024-05-01 18:50:25.627073+00:00 1
1 sXnIoPYEDMeP /home/runner/work/lamindb/lamindb/docs/lamin-t... None local None 5WuFt3cW4zRx 2024-05-01 18:50:20.106918+00:00 2024-05-01 18:50:20.106942+00:00 1

Set verbosity#

To reduce the number of logging messages, set verbosity:

ln.settings.verbosity = 3  # only show info, no hints
# clean up what we wrote in this notebook
!lamin delete --force lamin-tutorial
!rm -r lamin-tutorial
Traceback (most recent call last):
  File "/opt/hostedtoolcache/Python/3.11.9/x64/bin/lamin", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/opt/hostedtoolcache/Python/3.11.9/x64/lib/python3.11/site-packages/rich_click/rich_command.py", line 360, in __call__
    return super().__call__(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/hostedtoolcache/Python/3.11.9/x64/lib/python3.11/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/hostedtoolcache/Python/3.11.9/x64/lib/python3.11/site-packages/rich_click/rich_command.py", line 152, in main
    rv = self.invoke(ctx)
         ^^^^^^^^^^^^^^^^
  File "/opt/hostedtoolcache/Python/3.11.9/x64/lib/python3.11/site-packages/click/core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/hostedtoolcache/Python/3.11.9/x64/lib/python3.11/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/hostedtoolcache/Python/3.11.9/x64/lib/python3.11/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/hostedtoolcache/Python/3.11.9/x64/lib/python3.11/site-packages/lamin_cli/__main__.py", line 103, in delete
    return delete(instance, force=force)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/hostedtoolcache/Python/3.11.9/x64/lib/python3.11/site-packages/lamindb_setup/_delete.py", line 140, in delete
    n_objects = check_storage_is_empty(
                ^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/hostedtoolcache/Python/3.11.9/x64/lib/python3.11/site-packages/lamindb_setup/core/upath.py", line 814, in check_storage_is_empty
    raise InstanceNotEmpty(message)
lamindb_setup.core.upath.InstanceNotEmpty: Storage /home/runner/work/lamindb/lamindb/docs/lamin-tutorial/.lamindb contains 3 objects ('./lamindb/_is_initialized'  ignored) - delete them prior to deleting the instance
['/home/runner/work/lamindb/lamindb/docs/lamin-tutorial/.lamindb/F5bhv575rauQVoDjgtOM.parquet', '/home/runner/work/lamindb/lamindb/docs/lamin-tutorial/.lamindb/Qtz8SzzWcsj7dLiZvYxJ.parquet', '/home/runner/work/lamindb/lamindb/docs/lamin-tutorial/.lamindb/_is_initialized']