Tutorial: Features & labels#

In the previous tutorial (Tutorial: Artifacts), we learned about about how to leverage basic metadata for artifacts to access data (query, search, cache & load).

Here, we walk through annotating & validating data with features & labels to improve:

Finding data: Which collections measured expression of cell marker CD14? Which characterized cell line K562? Which collections have a test & train split? Etc.
Using data: Are there typos in feature names? Are there typos in sampled labels? Are units of features consistent? Etc.

import lamindb as ln
import pandas as pd

💡 connected lamindb: anonymous/lamin-tutorial

ln.settings.verbosity = "hint"

Register metadata#

Features and labels are the primary ways of registering domain-knowledge related metadata in LaminDB.

Features represent measurement dimensions (e.g. organism) and labels represent measurement values (e.g. iris setosa, iris versicolor, iris virginica).

Register labels#

We study 3 organism of the Iris plant: setosa, versicolor & virginica.

Let’s populate the universal label registry (ULabel) for them:

organisms = [ln.ULabel(name=name) for name in ["setosa", "versicolor", "virginica"]]
ln.save(organisms)
organisms

[ULabel(uid='v9CL1k1r', name='setosa', updated_at=2024-05-01 18:50:30 UTC, created_by_id=1),
 ULabel(uid='uY6M63yK', name='versicolor', updated_at=2024-05-01 18:50:30 UTC, created_by_id=1),
 ULabel(uid='vqItCsbY', name='virginica', updated_at=2024-05-01 18:50:30 UTC, created_by_id=1)]

Anticipating that we’ll have many different labels when working with more data, we’d like to express that all 3 labels are organism labels:

is_organism = ln.ULabel(name="is_organism")
is_organism.save()
is_organism.children.set(organisms)
is_organism.view_parents(with_children=True)

_images/298a173da8139df0fabdc83a20d09a0255e8415557bf277128dddcdf9c568cac.svg

ULabel enables you to manage an in-house ontology to manage all kinds of untyped labels.

If you’d like to leverage pre-built typed ontologies for basic biological entities in the same way, see: Manage biological registries.

In addition to organism, we’d like to track the studies that produced the data:

studies = [ln.ULabel(name=name) for name in ["study0", "study1", "study2"]]
ln.save(studies)
is_study = ln.ULabel(name="is_study")
is_study.save()
is_study.children.set(studies)
is_study.view_parents(with_children=True)

_images/8fcb815386be83ace82d16f71658885acda660a65c1717d2cddf6cd7525776b6.svg

Register features#

For every set of studied labels (measured values), we typically also want an identifier for the corresponding measurement dimension: the feature.

When we integrate datasets, feature names will label columns that store data.

Let’s create and save two Feature records to identify measurements of the iris organism label and the study:

ln.Feature(name="iris_organism_name", type="category").save()
ln.Feature(name="study_name", type="category").save()
# create a lookup object so that we can access features with auto-complete
features = ln.Feature.lookup()

Validate & link labels#

We already looked at the metadata for study0, before:

meta_artifact = ln.Artifact.filter(key="iris_studies/study0_raw_images/meta.csv").one()
meta = meta_artifact.load(index_col=0)  # load a dataframe
meta.head()

💡 you can auto-track these data as a run input by calling `ln.track()`

	0	1
0	iris-0797945218a97d6e5251b4758a2ba1b418cbd52ce...	setosa
1	iris-0f133861ea3fe1b68f9f1b59ebd9116ff963ee710...	versicolor
2	iris-9ffe51c2abd973d25a299647fa9ccaf6aa9c8eecf...	versicolor
3	iris-83f433381b755101b9fc9fbc9743e35fbb8a1a109...	setosa
4	iris-bdae8314e4385d8e2322abd8e63a82758a9063c77...	virginica

Validate metadata#

Depending on the data generation process, such metadata might or might not match the labels we defined in our registries.

Let’s validate the labels by mapping the values stored in the artifact on the ULabel registry:

ln.ULabel.validate(meta["1"], field="name")

✅ 3 terms (100.00%) are validated for name

array([ True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True])

Everything passed and no fixes are needed!

If validation doesn’t pass, standardize() and inspect() will help standardize data.

Label artifacts#

Labeling a set of artifacts is useful if we want to make the set queryable among a large number of artifacts.

You can label an artifact by calling artifact.labels.add() and pass a single or multiple label records.

Let’s do this based on the labels in meta.csv:

ln.Artifact.df()

	uid	storage_id	key	suffix	accessor	description	version	size	hash	hash_type	n_objects	n_observations	transform_id	run_id	visibility	key_is_virtual	created_at	updated_at	created_by_id
id
4	KN5tE6KT9QdtP29X8w5C	2	iris_studies/study2_raw_images		None	None	None	665518	PX8Vt9T28y-uCEJO1tKm7A	md5-d	51.0	None	1	1	1	False	2024-05-01 18:50:28.003536+00:00	2024-05-01 18:50:28.003564+00:00	1
3	3ksHD7VWCbUJld5fiYzd	2	iris_studies/study1_raw_images		None	None	None	640617	j61W__GgImA18CKrIf7FVg	md5-d	49.0	None	1	1	1	False	2024-05-01 18:50:27.385608+00:00	2024-05-01 18:50:27.385637+00:00	1
2	7zPAagBwZTJzQqVLFyqB	2	iris_studies/study0_raw_images		None	None	None	656692	wVYKPpEsmmrqSpAZIRXCFg	md5-d	51.0	None	1	1	1	False	2024-05-01 18:50:26.600312+00:00	2024-05-01 18:50:26.600341+00:00	1
1	o5tTv4TU4OZGSWnULdR9	2	iris_studies/study0_raw_images/meta.csv	.csv	None	None	None	4355	ZpAEpN0iFYH6vjZNigic7g	md5	NaN	None	1	1	1	False	2024-05-01 18:50:25.683074+00:00	2024-05-01 18:50:25.683102+00:00	1

study_artifacts = ln.Artifact.filter(key__startswith="iris_studies/", suffix="").all()
study_labels = ln.ULabel.filter(name="is_study").one().children.all()
for artifact, study in zip(study_artifacts, study_labels):
    artifact.labels.add(study, feature=features.study_name)
    df = pd.read_csv(artifact.path / "meta.csv", index_col=0)
    organism_labels = ln.ULabel.from_values(df["1"].unique())
    artifact.labels.add(organism_labels, feature=features.iris_organism_name)

Show code cell output Hide code cell output

✅ linked feature 'study_name' to registry 'core.ULabel'

✅ linked new feature 'study_name' together with new feature set FeatureSet(uid='NDTU8K9Agb5jeiccYVqY', n=1, registry='core.Feature', hash='wrVx9j2goWy1Emt2Eq9y', updated_at=2024-05-01 18:50:31 UTC, created_by_id=1)

✅ linked feature 'iris_organism_name' to registry 'core.ULabel'

💡 nothing links to it anymore, deleting feature set FeatureSet(uid='NDTU8K9Agb5jeiccYVqY', n=1, registry='core.Feature', hash='wrVx9j2goWy1Emt2Eq9y', updated_at=2024-05-01 18:50:31 UTC, created_by_id=1)

✅ linked new feature 'iris_organism_name' together with new feature set FeatureSet(uid='p0rX4YMjebjGiZaQItKm', n=2, registry='core.Feature', hash='AeIcn9GpMb8154_Qhc4Z', updated_at=2024-05-01 18:50:31 UTC, created_by_id=1)

✅ linked new feature 'study_name' together with new feature set FeatureSet(uid='3C62rFhV9fwwApAYGufF', n=1, registry='core.Feature', hash='wrVx9j2goWy1Emt2Eq9y', updated_at=2024-05-01 18:50:31 UTC, created_by_id=1)

✅ loaded: FeatureSet(uid='p0rX4YMjebjGiZaQItKm', n=2, registry='core.Feature', hash='AeIcn9GpMb8154_Qhc4Z', updated_at=2024-05-01 18:50:31 UTC, created_by_id=1)

✅ linked new feature 'iris_organism_name' together with new feature set FeatureSet(uid='p0rX4YMjebjGiZaQItKm', n=2, registry='core.Feature', hash='AeIcn9GpMb8154_Qhc4Z', updated_at=2024-05-01 18:50:32 UTC, created_by_id=1)

✅ loaded: FeatureSet(uid='3C62rFhV9fwwApAYGufF', n=1, registry='core.Feature', hash='wrVx9j2goWy1Emt2Eq9y', updated_at=2024-05-01 18:50:31 UTC, created_by_id=1)

✅ linked new feature 'study_name' together with new feature set FeatureSet(uid='3C62rFhV9fwwApAYGufF', n=1, registry='core.Feature', hash='wrVx9j2goWy1Emt2Eq9y', updated_at=2024-05-01 18:50:32 UTC, created_by_id=1)

✅ loaded: FeatureSet(uid='p0rX4YMjebjGiZaQItKm', n=2, registry='core.Feature', hash='AeIcn9GpMb8154_Qhc4Z', updated_at=2024-05-01 18:50:32 UTC, created_by_id=1)

✅ linked new feature 'iris_organism_name' together with new feature set FeatureSet(uid='p0rX4YMjebjGiZaQItKm', n=2, registry='core.Feature', hash='AeIcn9GpMb8154_Qhc4Z', updated_at=2024-05-01 18:50:32 UTC, created_by_id=1)

Query artifacts by labels#

Using the new annotations, you can now query image artifacts by organism & study labels:

ulabels = ln.ULabel.lookup()
artifact = ln.Artifact.filter(ulabels=ulabels.study0).first()

We also see them when calling describe():

artifact.describe()

Artifact(uid='7zPAagBwZTJzQqVLFyqB', key='iris_studies/study0_raw_images', suffix='', size=656692, hash='wVYKPpEsmmrqSpAZIRXCFg', hash_type='md5-d', n_objects=51, visibility=1, key_is_virtual=False, updated_at=2024-05-01 18:50:26 UTC)

Provenance:
  📎 storage: Storage(uid='EzvEnPnH', root='s3://lamindb-dev-datasets', type='s3', region='us-east-1', instance_uid='pZ1VQkyD3haH')
  📎 transform: Transform(uid='NJvdsWWbJlZS6K79', name='Tutorial: Artifacts', key='tutorial', version='0', type='notebook')
  📎 run: Run(uid='gYXIXnIw7o7ys3cRQgEl', started_at=2024-05-01 18:50:22 UTC, is_consecutive=True)
  📎 created_by: User(uid='00000000', handle='anonymous')
Features:
  external: FeatureSet(uid='p0rX4YMjebjGiZaQItKm', n=2, registry='core.Feature')
    🔗 iris_organism_name (3, core.ULabel): 'virginica', 'setosa', 'versicolor'
    🔗 study_name (1, core.ULabel): 'study0'
Labels:
  📎 ulabels (4, core.ULabel): 'virginica', 'study0', 'setosa', 'versicolor'

Label collections#

Labeling collections works in the same way as labeling artifacts:

collection = ln.Collection.filter(name__startswith="Iris collection", version="1").one()
collection.labels.add(ulabels.study0, feature=features.study_name)
all_organism_labels = ln.ULabel.filter(parents__name="is_organism").all()
collection.labels.add(all_organism_labels, feature=features.iris_organism_name)

collection.describe()

Collection(uid='32cBtnkazA7BtJB4YV3x', name='Iris collection', description='50 image files and metadata', version='1', hash='WwFLpNFmK8GMC2dSGj1W', visibility=1, updated_at=2024-05-01 18:50:26 UTC)

Provenance:
  📎 transform: Transform(uid='NJvdsWWbJlZS6K79', name='Tutorial: Artifacts', key='tutorial', version='0', type='notebook')
  📎 run: Run(uid='gYXIXnIw7o7ys3cRQgEl', started_at=2024-05-01 18:50:22 UTC, is_consecutive=True)
  📎 created_by: User(uid='00000000', handle='anonymous')
Features:
  external: FeatureSet(uid='p0rX4YMjebjGiZaQItKm', n=2, registry='core.Feature')
    🔗 iris_organism_name (3, core.ULabel): 'virginica', 'setosa', 'versicolor'
    🔗 study_name (1, core.ULabel): 'study0'
Labels:
  📎 ulabels (4, core.ULabel): 'virginica', 'study0', 'setosa', 'versicolor'

Run an ML model#

Let’s now run a ML model that transforms the images into 4 high-level features.

def run_ml_model() -> pd.DataFrame:
    transform = ln.Transform(name="Petal & sepal regressor", type="pipeline")
    ln.track(transform=transform)
    input_data = ln.Collection.filter(name__startswith="Iris collection", version="1").one()
    input_paths = [
        path.download_to(path.name) for path in input_data.artifacts[0].path.glob("*")
    ]
    # apply ML model
    output_data = ln.core.datasets.df_iris_in_meter_study1()
    return output_data


df = run_ml_model()

The output is a dataframe:

df.head()

	sepal_length	sepal_width	petal_length	petal_width	iris_organism_name
0	0.051	0.035	0.014	0.002	setosa
1	0.049	0.030	0.014	0.002	setosa
2	0.047	0.032	0.013	0.002	setosa
3	0.046	0.031	0.015	0.002	setosa
4	0.050	0.036	0.014	0.002	setosa

And this is the ML pipeline that produced the dataframe:

ln.core.run_context.transform.view_parents()

_images/74a948affb14e814becfd892bfd44fda44d582e29642bcee1ef91cd130c592f0.svg

Register the output data#

Let’s first register the features of the transformed data:

new_features = ln.Feature.from_df(df)
ln.save(new_features)

We can now validate & register the dataframe in one line:

artifact = ln.Artifact.from_df(
    df,
    description="Iris study 1 - after measuring sepal & petal metrics",
)
artifact.save()

💡 path content will be copied to default storage upon `save()` with key `None` ('.lamindb/Qtz8SzzWcsj7dLiZvYxJ.parquet')

✅ storing artifact 'Qtz8SzzWcsj7dLiZvYxJ' at '/home/runner/work/lamindb/lamindb/docs/lamin-tutorial/.lamindb/Qtz8SzzWcsj7dLiZvYxJ.parquet'

Artifact(uid='Qtz8SzzWcsj7dLiZvYxJ', suffix='.parquet', accessor='DataFrame', description='Iris study 1 - after measuring sepal & petal metrics', size=5347, hash='gWxE_pTwNrrYutXJaGbHqA', hash_type='md5', visibility=1, key_is_virtual=True, updated_at=2024-05-01 18:50:37 UTC, storage_id=1, transform_id=2, run_id=2, created_by_id=1)

artifact.features.add(new_features)

Feature sets#

Get an overview of linked features:

artifact.features

Features:
  columns: FeatureSet(uid='eAHIaYzbO8KemSbYPD7O', n=5, registry='core.Feature')
    🔗 iris_organism_name (0, core.ULabel): 
    sepal_length (number)
    sepal_width (number)
    petal_length (number)
    petal_width (number)

You’ll see that they’re always grouped in sets that correspond to records of FeatureSet.

A slot provides a string key to access feature sets. It’s typically the accessor within the registered data object, here pd.DataFrame.columns.

Let’s use it to access all linked features:

artifact.features["columns"].df()

	uid	name	type	unit	description	registries	synonyms	created_at	updated_at	created_by_id
id
1	7Ohz9SVp18RZ	iris_organism_name	category	None	None	core.ULabel	None	2024-05-01 18:50:30.969360+00:00	2024-05-01 18:50:31.950162+00:00	1
3	Om8Mw6pFkzIo	sepal_length	number	None	None	None	None	2024-05-01 18:50:37.714235+00:00	2024-05-01 18:50:37.714263+00:00	1
4	iwR15EkIJUKV	sepal_width	number	None	None	None	None	2024-05-01 18:50:37.714387+00:00	2024-05-01 18:50:37.714403+00:00	1
5	UL8Sh0vgFmp7	petal_length	number	None	None	None	None	2024-05-01 18:50:37.714513+00:00	2024-05-01 18:50:37.714528+00:00	1
6	e1QfngsThY1R	petal_width	number	None	None	None	None	2024-05-01 18:50:37.714638+00:00	2024-05-01 18:50:37.714652+00:00	1

There is one categorical feature, let’s add the organism labels:

organism_labels = ln.ULabel.filter(parents__name="is_organism").all()
artifact.labels.add(organism_labels, feature=features.iris_organism_name)

Let’s now add study labels:

artifact.labels.add(ulabels.study0, feature=features.study_name)

✅ linked new feature 'study_name' together with new feature set FeatureSet(uid='qpVYAIhNMtboyMZ1urZR', n=1, registry='core.Feature', hash='wrVx9j2goWy1Emt2Eq9y', updated_at=2024-05-01 18:50:37 UTC, created_by_id=1)

In addition to the columns feature set, we now have an external feature set:

artifact.features

Features:
  columns: FeatureSet(uid='eAHIaYzbO8KemSbYPD7O', n=5, registry='core.Feature')
    🔗 iris_organism_name (3, core.ULabel): 'virginica', 'setosa', 'versicolor'
    sepal_length (number)
    sepal_width (number)
    petal_length (number)
    petal_width (number)
  external: FeatureSet(uid='qpVYAIhNMtboyMZ1urZR', n=1, registry='core.Feature')
    🔗 study_name (1, core.ULabel): 'study0'

This is the context for our artifact:

artifact.describe()

Artifact(uid='Qtz8SzzWcsj7dLiZvYxJ', suffix='.parquet', accessor='DataFrame', description='Iris study 1 - after measuring sepal & petal metrics', size=5347, hash='gWxE_pTwNrrYutXJaGbHqA', hash_type='md5', visibility=1, key_is_virtual=True, updated_at=2024-05-01 18:50:37 UTC)

Provenance:
  📎 storage: Storage(uid='sXnIoPYEDMeP', root='/home/runner/work/lamindb/lamindb/docs/lamin-tutorial', type='local', instance_uid='5WuFt3cW4zRx')
  📎 transform: Transform(uid='fDh8lNBRLF8PQtTP', name='Petal & sepal regressor', type='pipeline')
  📎 run: Run(uid='4MhEkENdu3viywVPDhqf', started_at=2024-05-01 18:50:32 UTC, is_consecutive=True)
  📎 created_by: User(uid='00000000', handle='anonymous')
Features:
  columns: FeatureSet(uid='eAHIaYzbO8KemSbYPD7O', n=5, registry='core.Feature')
    🔗 iris_organism_name (3, core.ULabel): 'virginica', 'setosa', 'versicolor'
    sepal_length (number)
    sepal_width (number)
    petal_length (number)
    petal_width (number)
  external: FeatureSet(uid='qpVYAIhNMtboyMZ1urZR', n=1, registry='core.Feature')
    🔗 study_name (1, core.ULabel): 'study0'
Labels:
  📎 ulabels (4, core.ULabel): 'virginica', 'study0', 'setosa', 'versicolor'

artifact.view_lineage()

_images/31187fea2e60576112c64443c7f76804367711df05f63df2313850b6b7aefaae.svg

See the database content:

ln.view(registries=["Feature", "FeatureSet", "ULabel"])

Show code cell output Hide code cell output

Feature

	uid	name	type	unit	description	registries	synonyms	created_at	updated_at	created_by_id
id
6	e1QfngsThY1R	petal_width	number	None	None	None	None	2024-05-01 18:50:37.714638+00:00	2024-05-01 18:50:37.714652+00:00	1
5	UL8Sh0vgFmp7	petal_length	number	None	None	None	None	2024-05-01 18:50:37.714513+00:00	2024-05-01 18:50:37.714528+00:00	1
4	iwR15EkIJUKV	sepal_width	number	None	None	None	None	2024-05-01 18:50:37.714387+00:00	2024-05-01 18:50:37.714403+00:00	1
3	Om8Mw6pFkzIo	sepal_length	number	None	None	None	None	2024-05-01 18:50:37.714235+00:00	2024-05-01 18:50:37.714263+00:00	1
1	7Ohz9SVp18RZ	iris_organism_name	category	None	None	core.ULabel	None	2024-05-01 18:50:30.969360+00:00	2024-05-01 18:50:31.950162+00:00	1
2	Nrcl3zMwkL9q	study_name	category	None	None	core.ULabel	None	2024-05-01 18:50:30.985072+00:00	2024-05-01 18:50:31.591442+00:00	1

FeatureSet

	uid	name	n	type	registry	hash	created_at	updated_at	created_by_id
id
5	qpVYAIhNMtboyMZ1urZR	None	1	None	core.Feature	wrVx9j2goWy1Emt2Eq9y	2024-05-01 18:50:37.816761+00:00	2024-05-01 18:50:37.821525+00:00	1
4	eAHIaYzbO8KemSbYPD7O	None	5	None	core.Feature	sXUHyxw8y_MHRRkdVPak	2024-05-01 18:50:37.743324+00:00	2024-05-01 18:50:37.743350+00:00	1
2	p0rX4YMjebjGiZaQItKm	None	2	None	core.Feature	AeIcn9GpMb8154_Qhc4Z	2024-05-01 18:50:31.956398+00:00	2024-05-01 18:50:32.415491+00:00	1

ULabel

	uid	name	description	reference	reference_type	created_at	updated_at	created_by_id
id
8	qIie3ota	is_study	None	None	None	2024-05-01 18:50:30.910140+00:00	2024-05-01 18:50:30.910164+00:00	1
7	pf47KQzy	study2	None	None	None	2024-05-01 18:50:30.900888+00:00	2024-05-01 18:50:30.900903+00:00	1
6	iUgtZ2xK	study1	None	None	None	2024-05-01 18:50:30.900785+00:00	2024-05-01 18:50:30.900801+00:00	1
5	mPISxrYD	study0	None	None	None	2024-05-01 18:50:30.900654+00:00	2024-05-01 18:50:30.900686+00:00	1
4	wU4im1Pn	is_organism	None	None	None	2024-05-01 18:50:30.802939+00:00	2024-05-01 18:50:30.802967+00:00	1
3	vqItCsbY	virginica	None	None	None	2024-05-01 18:50:30.764203+00:00	2024-05-01 18:50:30.764217+00:00	1
2	uY6M63yK	versicolor	None	None	None	2024-05-01 18:50:30.764098+00:00	2024-05-01 18:50:30.764114+00:00	1

Manage follow-up data#

Assume that a couple of weeks later, we receive a new dataset in a follow-up study 2.

Let’s track a new analysis:

ln.settings.transform.stem_uid = "dMtrt8YMSdl6"
ln.settings.transform.version = "1"

ln.track()

💡 Assuming editor is Jupyter Lab.

💡 Attaching notebook metadata

💡 notebook imports: lamindb==0.71.0 pandas==1.5.3

💡 saved: Transform(uid='dMtrt8YMSdl65zKv', name='Tutorial: Features & labels', key='tutorial2', version='1', type='notebook', updated_at=2024-05-01 18:50:39 UTC, created_by_id=1)

💡 saved: Run(uid='4kwedLR19YL3udJmPMoF', transform_id=3, created_by_id=1)

💡 tracked pip freeze > /home/runner/.cache/lamindb/run_env_pip_4kwedLR19YL3udJmPMoF.txt

Register a joint collection#

Assume we already ran all preprocessing including the ML model.

We get a DataFrame and store it as an artifact:

df = ln.core.datasets.df_iris_in_meter_study2()
ln.Artifact.from_df(df, description="Iris study 2 - transformed").save()

Let’s load it:

artifact2 = ln.Artifact.filter(description="Iris study 2 - transformed").one()

We can now store the joint collection:

collection = ln.Collection(
    [artifact, artifact2], name="Iris flower study 1 & 2 - transformed"
)
collection.save()

✅ loaded: FeatureSet(uid='eAHIaYzbO8KemSbYPD7O', n=5, registry='core.Feature', hash='sXUHyxw8y_MHRRkdVPak', updated_at=2024-05-01 18:50:37 UTC, created_by_id=1)

✅ loaded: FeatureSet(uid='qpVYAIhNMtboyMZ1urZR', n=1, registry='core.Feature', hash='wrVx9j2goWy1Emt2Eq9y', updated_at=2024-05-01 18:50:37 UTC, created_by_id=1)

💡 adding artifact [5] as input for run 3, adding parent transform 2

Auto-concatenate datasets#

Because both datasets measured the same validated feature set, we can auto-concatenate the collection:

collection.load().tail()

	sepal_length	sepal_width	petal_length	petal_width	iris_organism_name
145	0.067	0.030	0.052	0.023	virginica
146	0.063	0.025	0.050	0.019	virginica
147	0.065	0.030	0.052	0.020	virginica
148	0.062	0.034	0.054	0.023	virginica
149	0.059	0.030	0.051	0.018	virginica

We can also access & query the underlying two artifact objects:

collection.artifacts.df()

	uid	storage_id	key	suffix	accessor	description	version	size	hash	hash_type	n_objects	n_observations	transform_id	run_id	visibility	key_is_virtual	created_at	updated_at	created_by_id
id
5	Qtz8SzzWcsj7dLiZvYxJ	1	None	.parquet	DataFrame	Iris study 1 - after measuring sepal & petal m...	None	5347	gWxE_pTwNrrYutXJaGbHqA	md5	None	None	2	2	1	True	2024-05-01 18:50:37.729786+00:00	2024-05-01 18:50:37.729812+00:00	1
6	F5bhv575rauQVoDjgtOM	1	None	.parquet	DataFrame	Iris study 2 - transformed	None	5397	Sz8pY3NO5XxBNEis7uq_ZQ	md5	None	None	3	3	1	True	2024-05-01 18:50:40.394478+00:00	2024-05-01 18:50:40.394507+00:00	1

Or look at their data lineage:

collection.view_lineage()

_images/9bed30fa402cb2f7d37f6f0613e50648d6f527ec1ac786800628a827f3c634af.svg

Or look at the database:

ln.view()

Show code cell output Hide code cell output

Artifact

	uid	storage_id	key	suffix	accessor	description	version	size	hash	hash_type	n_objects	n_observations	transform_id	run_id	visibility	key_is_virtual	created_at	updated_at	created_by_id
id
6	F5bhv575rauQVoDjgtOM	1	None	.parquet	DataFrame	Iris study 2 - transformed	None	5397	Sz8pY3NO5XxBNEis7uq_ZQ	md5	NaN	None	3	3	1	True	2024-05-01 18:50:40.394478+00:00	2024-05-01 18:50:40.394507+00:00	1
5	Qtz8SzzWcsj7dLiZvYxJ	1	None	.parquet	DataFrame	Iris study 1 - after measuring sepal & petal m...	None	5347	gWxE_pTwNrrYutXJaGbHqA	md5	NaN	None	2	2	1	True	2024-05-01 18:50:37.729786+00:00	2024-05-01 18:50:37.729812+00:00	1
4	KN5tE6KT9QdtP29X8w5C	2	iris_studies/study2_raw_images		None	None	None	665518	PX8Vt9T28y-uCEJO1tKm7A	md5-d	51.0	None	1	1	1	False	2024-05-01 18:50:28.003536+00:00	2024-05-01 18:50:28.003564+00:00	1
3	3ksHD7VWCbUJld5fiYzd	2	iris_studies/study1_raw_images		None	None	None	640617	j61W__GgImA18CKrIf7FVg	md5-d	49.0	None	1	1	1	False	2024-05-01 18:50:27.385608+00:00	2024-05-01 18:50:27.385637+00:00	1
2	7zPAagBwZTJzQqVLFyqB	2	iris_studies/study0_raw_images		None	None	None	656692	wVYKPpEsmmrqSpAZIRXCFg	md5-d	51.0	None	1	1	1	False	2024-05-01 18:50:26.600312+00:00	2024-05-01 18:50:26.600341+00:00	1
1	o5tTv4TU4OZGSWnULdR9	2	iris_studies/study0_raw_images/meta.csv	.csv	None	None	None	4355	ZpAEpN0iFYH6vjZNigic7g	md5	NaN	None	1	1	1	False	2024-05-01 18:50:25.683074+00:00	2024-05-01 18:50:25.683102+00:00	1

Collection

	uid	name	description	version	hash	reference	reference_type	transform_id	run_id	artifact_id	visibility	created_at	updated_at	created_by_id
id
4	RAz3v1ppt4kR6lFG6Wck	Iris flower study 1 & 2 - transformed	None	None	r405fYDOPfwROfWo23b1	None	None	3	3	None	1	2024-05-01 18:50:40.438900+00:00	2024-05-01 18:50:40.438924+00:00	1
3	32cBtnkazA7BtJB4pisU	Iris collection	Another 50 images	3	T-U8z2Zi5rFYdAD9pzmS	None	None	1	1	None	1	2024-05-01 18:50:28.016434+00:00	2024-05-01 18:50:28.016457+00:00	1
2	32cBtnkazA7BtJB4Db1u	Iris collection	Another 50 images	2	5cCK6ZLOPB0cV3tyeZup	None	None	1	1	None	1	2024-05-01 18:50:27.398726+00:00	2024-05-01 18:50:27.398750+00:00	1
1	32cBtnkazA7BtJB4YV3x	Iris collection	50 image files and metadata	1	WwFLpNFmK8GMC2dSGj1W	None	None	1	1	None	1	2024-05-01 18:50:26.800334+00:00	2024-05-01 18:50:26.800363+00:00	1

Feature

	uid	name	type	unit	description	registries	synonyms	created_at	updated_at	created_by_id
id
6	e1QfngsThY1R	petal_width	number	None	None	None	None	2024-05-01 18:50:37.714638+00:00	2024-05-01 18:50:37.714652+00:00	1
5	UL8Sh0vgFmp7	petal_length	number	None	None	None	None	2024-05-01 18:50:37.714513+00:00	2024-05-01 18:50:37.714528+00:00	1
4	iwR15EkIJUKV	sepal_width	number	None	None	None	None	2024-05-01 18:50:37.714387+00:00	2024-05-01 18:50:37.714403+00:00	1
3	Om8Mw6pFkzIo	sepal_length	number	None	None	None	None	2024-05-01 18:50:37.714235+00:00	2024-05-01 18:50:37.714263+00:00	1
1	7Ohz9SVp18RZ	iris_organism_name	category	None	None	core.ULabel	None	2024-05-01 18:50:30.969360+00:00	2024-05-01 18:50:31.950162+00:00	1
2	Nrcl3zMwkL9q	study_name	category	None	None	core.ULabel	None	2024-05-01 18:50:30.985072+00:00	2024-05-01 18:50:31.591442+00:00	1

FeatureSet

	uid	name	n	type	registry	hash	created_at	updated_at	created_by_id
id
5	qpVYAIhNMtboyMZ1urZR	None	1	None	core.Feature	wrVx9j2goWy1Emt2Eq9y	2024-05-01 18:50:37.816761+00:00	2024-05-01 18:50:37.821525+00:00	1
4	eAHIaYzbO8KemSbYPD7O	None	5	None	core.Feature	sXUHyxw8y_MHRRkdVPak	2024-05-01 18:50:37.743324+00:00	2024-05-01 18:50:37.743350+00:00	1
2	p0rX4YMjebjGiZaQItKm	None	2	None	core.Feature	AeIcn9GpMb8154_Qhc4Z	2024-05-01 18:50:31.956398+00:00	2024-05-01 18:50:32.415491+00:00	1

Run

	uid	transform_id	started_at	finished_at	created_by_id	json	report_id	environment_id	is_consecutive	reference	reference_type	created_at
id
1	gYXIXnIw7o7ys3cRQgEl	1	2024-05-01 18:50:22.948429+00:00	None	1	None	None	None	True	None	None	2024-05-01 18:50:22.948563+00:00
2	4MhEkENdu3viywVPDhqf	2	2024-05-01 18:50:32.462950+00:00	None	1	None	None	None	True	None	None	2024-05-01 18:50:32.463085+00:00
3	4kwedLR19YL3udJmPMoF	3	2024-05-01 18:50:39.337405+00:00	None	1	None	None	None	True	None	None	2024-05-01 18:50:39.337657+00:00

Storage

	uid	root	description	type	region	instance_uid	created_at	updated_at	created_by_id
id
2	EzvEnPnH	s3://lamindb-dev-datasets	None	s3	us-east-1	pZ1VQkyD3haH	2024-05-01 18:50:25.627040+00:00	2024-05-01 18:50:25.627073+00:00	1
1	sXnIoPYEDMeP	/home/runner/work/lamindb/lamindb/docs/lamin-t...	None	local	None	5WuFt3cW4zRx	2024-05-01 18:50:20.106918+00:00	2024-05-01 18:50:20.106942+00:00	1

Transform

	uid	name	key	version	description	type	latest_report_id	source_code_id	reference	reference_type	created_at	updated_at	created_by_id
id
3	dMtrt8YMSdl65zKv	Tutorial: Features & labels	tutorial2	1	None	notebook	None	None	None	None	2024-05-01 18:50:39.330633+00:00	2024-05-01 18:50:39.330677+00:00	1
2	fDh8lNBRLF8PQtTP	Petal & sepal regressor	None	None	None	pipeline	None	None	None	None	2024-05-01 18:50:32.458950+00:00	2024-05-01 18:50:32.458976+00:00	1
1	NJvdsWWbJlZS6K79	Tutorial: Artifacts	tutorial	0	None	notebook	None	None	None	None	2024-05-01 18:50:22.941759+00:00	2024-05-01 18:50:22.941802+00:00	1

ULabel

	uid	name	description	reference	reference_type	created_at	updated_at	created_by_id
id
8	qIie3ota	is_study	None	None	None	2024-05-01 18:50:30.910140+00:00	2024-05-01 18:50:30.910164+00:00	1
7	pf47KQzy	study2	None	None	None	2024-05-01 18:50:30.900888+00:00	2024-05-01 18:50:30.900903+00:00	1
6	iUgtZ2xK	study1	None	None	None	2024-05-01 18:50:30.900785+00:00	2024-05-01 18:50:30.900801+00:00	1
5	mPISxrYD	study0	None	None	None	2024-05-01 18:50:30.900654+00:00	2024-05-01 18:50:30.900686+00:00	1
4	wU4im1Pn	is_organism	None	None	None	2024-05-01 18:50:30.802939+00:00	2024-05-01 18:50:30.802967+00:00	1
3	vqItCsbY	virginica	None	None	None	2024-05-01 18:50:30.764203+00:00	2024-05-01 18:50:30.764217+00:00	1
2	uY6M63yK	versicolor	None	None	None	2024-05-01 18:50:30.764098+00:00	2024-05-01 18:50:30.764114+00:00	1

User

	uid	handle	name	created_at	updated_at
id
1	00000000	anonymous	None	2024-05-01 18:50:20.102450+00:00	2024-05-01 18:50:20.102474+00:00

This is it! 😅

If you’re interested, please check out guides & use cases or make an issue on GitHub to discuss.

Appendix#

Manage metadata#

Avoid duplicates#

Let’s create a label “project1”:

ln.ULabel(name="project1").save()

ULabel(uid='hYWK36Qu', name='project1', updated_at=2024-05-01 18:50:40 UTC, created_by_id=1)

We already created a project1 label before, let’s see what happens if we try to create it again:

label = ln.ULabel(name="project1")
label.save()

❗ loaded ULabel record with same name: 'project1' (disable via `ln.settings.upon_create_search_names`)

ULabel(uid='hYWK36Qu', name='project1', updated_at=2024-05-01 18:50:40 UTC, created_by_id=1)

Instead of creating a new record, LaminDB loads and returns the existing record from the database.

If there is no exact match, LaminDB will warn you upon creating a record about potential duplicates.

Say, we spell “project 1” with a white space:

ln.ULabel(name="project 1")

❗ record with similar name exist! did you mean to load it?

	uid	score
name
project1	hYWK36Qu	94.1

ULabel(uid='4bma7rl4', name='project 1', created_by_id=1)

To avoid inserting duplicates when creating new records, a search compares whether a similar record already exists.

You can switch it off for performance gains via upon_create_search_names.

Update & delete records#

label = ln.ULabel.filter(name="project1").first()
label

ULabel(uid='hYWK36Qu', name='project1', updated_at=2024-05-01 18:50:40 UTC, created_by_id=1)

label.name = "project1a"
label.save()
label

ULabel(uid='hYWK36Qu', name='project1a', updated_at=2024-05-01 18:50:40 UTC, created_by_id=1)

label.delete()

(1, {'lnschema_core.ULabel': 1})

Manage storage#

Change default storage#

The default storage location is:

ln.settings.storage

PosixUPath('/home/runner/work/lamindb/lamindb/docs/lamin-tutorial')

You can change it by setting ln.settings.storage = "s3://my-bucket".

See all storage locations#

ln.Storage.df()

	uid	root	description	type	region	instance_uid	created_at	updated_at	created_by_id
id
2	EzvEnPnH	s3://lamindb-dev-datasets	None	s3	us-east-1	pZ1VQkyD3haH	2024-05-01 18:50:25.627040+00:00	2024-05-01 18:50:25.627073+00:00	1
1	sXnIoPYEDMeP	/home/runner/work/lamindb/lamindb/docs/lamin-t...	None	local	None	5WuFt3cW4zRx	2024-05-01 18:50:20.106918+00:00	2024-05-01 18:50:20.106942+00:00	1

Set verbosity#

To reduce the number of logging messages, set verbosity:

ln.settings.verbosity = 3  # only show info, no hints

# clean up what we wrote in this notebook
!lamin delete --force lamin-tutorial
!rm -r lamin-tutorial

Traceback (most recent call last):
  File "/opt/hostedtoolcache/Python/3.11.9/x64/bin/lamin", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/opt/hostedtoolcache/Python/3.11.9/x64/lib/python3.11/site-packages/rich_click/rich_command.py", line 360, in __call__
    return super().__call__(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/hostedtoolcache/Python/3.11.9/x64/lib/python3.11/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/hostedtoolcache/Python/3.11.9/x64/lib/python3.11/site-packages/rich_click/rich_command.py", line 152, in main
    rv = self.invoke(ctx)
         ^^^^^^^^^^^^^^^^
  File "/opt/hostedtoolcache/Python/3.11.9/x64/lib/python3.11/site-packages/click/core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/hostedtoolcache/Python/3.11.9/x64/lib/python3.11/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)

           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/hostedtoolcache/Python/3.11.9/x64/lib/python3.11/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/hostedtoolcache/Python/3.11.9/x64/lib/python3.11/site-packages/lamin_cli/__main__.py", line 103, in delete
    return delete(instance, force=force)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/hostedtoolcache/Python/3.11.9/x64/lib/python3.11/site-packages/lamindb_setup/_delete.py", line 140, in delete
    n_objects = check_storage_is_empty(
                ^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/hostedtoolcache/Python/3.11.9/x64/lib/python3.11/site-packages/lamindb_setup/core/upath.py", line 814, in check_storage_is_empty
    raise InstanceNotEmpty(message)
lamindb_setup.core.upath.InstanceNotEmpty: Storage /home/runner/work/lamindb/lamindb/docs/lamin-tutorial/.lamindb contains 3 objects ('./lamindb/_is_initialized'  ignored) - delete them prior to deleting the instance
['/home/runner/work/lamindb/lamindb/docs/lamin-tutorial/.lamindb/F5bhv575rauQVoDjgtOM.parquet', '/home/runner/work/lamindb/lamindb/docs/lamin-tutorial/.lamindb/Qtz8SzzWcsj7dLiZvYxJ.parquet', '/home/runner/work/lamindb/lamindb/docs/lamin-tutorial/.lamindb/_is_initialized']