Tutorial: Features & labels#
In the previous tutorial (Tutorial: Artifacts), we learned about about how to leverage basic metadata for artifacts to access data (query, search, cache & load).
Here, we walk through annotating & validating data with features & labels to improve:
Finding data: Which collections measured expression of cell marker
CD14
? Which characterized cell lineK562
? Which collections have a test & train split? Etc.Using data: Are there typos in feature names? Are there typos in sampled labels? Are units of features consistent? Etc.
What was LaminDB’s most basic inspiration?
The pydata family of objects is at the heart of most data science, ML & comp bio workflows: DataFrame
, AnnData
, pytorch.DataLoader
, zarr.Array
, pyarrow.Table
, xarray.Collection
, …
And still, we couldn’t find a tool to link these objects to context so that they could be analyzed in context!
Context relevant for analyses includes anything that’s needed to interpret & model data.
So, lamindb.Artifact
and lamindb.Collection
track:
data sources, data transformations, models, users & pipelines that performed transformations (provenance)
any entity of the domain in which data is generated and modeled (features & labels)
import lamindb as ln
import pandas as pd
💡 connected lamindb: anonymous/lamin-tutorial
ln.settings.verbosity = "hint"
Register metadata#
Features and labels are the primary ways of registering domain-knowledge related metadata in LaminDB.
Features represent measurement dimensions (e.g. organism) and labels represent measurement values (e.g. iris setosa, iris versicolor, iris virginica).
Register labels#
We study 3 organism of the Iris plant: setosa
, versicolor
& virginica
.
Let’s populate the universal label registry (ULabel
) for them:
organisms = [ln.ULabel(name=name) for name in ["setosa", "versicolor", "virginica"]]
ln.save(organisms)
organisms
[ULabel(uid='v9CL1k1r', name='setosa', updated_at=2024-05-01 18:50:30 UTC, created_by_id=1),
ULabel(uid='uY6M63yK', name='versicolor', updated_at=2024-05-01 18:50:30 UTC, created_by_id=1),
ULabel(uid='vqItCsbY', name='virginica', updated_at=2024-05-01 18:50:30 UTC, created_by_id=1)]
Anticipating that we’ll have many different labels when working with more data, we’d like to express that all 3 labels are organism labels:
is_organism = ln.ULabel(name="is_organism")
is_organism.save()
is_organism.children.set(organisms)
is_organism.view_parents(with_children=True)
ULabel
enables you to manage an in-house ontology to manage all kinds of untyped labels.
If you’d like to leverage pre-built typed ontologies for basic biological entities in the same way, see: Manage biological registries.
In addition to organism, we’d like to track the studies that produced the data:
studies = [ln.ULabel(name=name) for name in ["study0", "study1", "study2"]]
ln.save(studies)
is_study = ln.ULabel(name="is_study")
is_study.save()
is_study.children.set(studies)
is_study.view_parents(with_children=True)
Why label a dataset by study?
We can then
query all artifacts link to this experiment
model it as a confounder when we’ll analyze similar data from a follow-up experiment, and concatenate data using the label as a feature in a data matrix
Register features#
For every set of studied labels (measured values), we typically also want an identifier for the corresponding measurement dimension: the feature.
When we integrate datasets, feature names will label columns that store data.
Let’s create and save two Feature
records to identify measurements of the iris organism label and the study:
ln.Feature(name="iris_organism_name", type="category").save()
ln.Feature(name="study_name", type="category").save()
# create a lookup object so that we can access features with auto-complete
features = ln.Feature.lookup()
Validate & link labels#
We already looked at the metadata for study0
, before:
meta_artifact = ln.Artifact.filter(key="iris_studies/study0_raw_images/meta.csv").one()
meta = meta_artifact.load(index_col=0) # load a dataframe
meta.head()
💡 you can auto-track these data as a run input by calling `ln.track()`
0 | 1 | |
---|---|---|
0 | iris-0797945218a97d6e5251b4758a2ba1b418cbd52ce... | setosa |
1 | iris-0f133861ea3fe1b68f9f1b59ebd9116ff963ee710... | versicolor |
2 | iris-9ffe51c2abd973d25a299647fa9ccaf6aa9c8eecf... | versicolor |
3 | iris-83f433381b755101b9fc9fbc9743e35fbb8a1a109... | setosa |
4 | iris-bdae8314e4385d8e2322abd8e63a82758a9063c77... | virginica |
Validate metadata#
Depending on the data generation process, such metadata might or might not match the labels we defined in our registries.
Let’s validate the labels by mapping the values stored in the artifact on the ULabel
registry:
ln.ULabel.validate(meta["1"], field="name")
✅ 3 terms (100.00%) are validated for name
array([ True, True, True, True, True, True, True, True, True,
True, True, True, True, True, True, True, True, True,
True, True, True, True, True, True, True, True, True,
True, True, True, True, True, True, True, True, True,
True, True, True, True, True, True, True, True, True,
True, True, True, True, True])
Everything passed and no fixes are needed!
If validation doesn’t pass, standardize()
and inspect()
will help standardize data.
Label artifacts#
Labeling a set of artifacts is useful if we want to make the set queryable among a large number of artifacts.
You can label an artifact by calling artifact.labels.add()
and pass a single or multiple label records.
Let’s do this based on the labels in meta.csv
:
ln.Artifact.df()
uid | storage_id | key | suffix | accessor | description | version | size | hash | hash_type | n_objects | n_observations | transform_id | run_id | visibility | key_is_virtual | created_at | updated_at | created_by_id | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
id | |||||||||||||||||||
4 | KN5tE6KT9QdtP29X8w5C | 2 | iris_studies/study2_raw_images | None | None | None | 665518 | PX8Vt9T28y-uCEJO1tKm7A | md5-d | 51.0 | None | 1 | 1 | 1 | False | 2024-05-01 18:50:28.003536+00:00 | 2024-05-01 18:50:28.003564+00:00 | 1 | |
3 | 3ksHD7VWCbUJld5fiYzd | 2 | iris_studies/study1_raw_images | None | None | None | 640617 | j61W__GgImA18CKrIf7FVg | md5-d | 49.0 | None | 1 | 1 | 1 | False | 2024-05-01 18:50:27.385608+00:00 | 2024-05-01 18:50:27.385637+00:00 | 1 | |
2 | 7zPAagBwZTJzQqVLFyqB | 2 | iris_studies/study0_raw_images | None | None | None | 656692 | wVYKPpEsmmrqSpAZIRXCFg | md5-d | 51.0 | None | 1 | 1 | 1 | False | 2024-05-01 18:50:26.600312+00:00 | 2024-05-01 18:50:26.600341+00:00 | 1 | |
1 | o5tTv4TU4OZGSWnULdR9 | 2 | iris_studies/study0_raw_images/meta.csv | .csv | None | None | None | 4355 | ZpAEpN0iFYH6vjZNigic7g | md5 | NaN | None | 1 | 1 | 1 | False | 2024-05-01 18:50:25.683074+00:00 | 2024-05-01 18:50:25.683102+00:00 | 1 |
study_artifacts = ln.Artifact.filter(key__startswith="iris_studies/", suffix="").all()
study_labels = ln.ULabel.filter(name="is_study").one().children.all()
for artifact, study in zip(study_artifacts, study_labels):
artifact.labels.add(study, feature=features.study_name)
df = pd.read_csv(artifact.path / "meta.csv", index_col=0)
organism_labels = ln.ULabel.from_values(df["1"].unique())
artifact.labels.add(organism_labels, feature=features.iris_organism_name)
Show code cell output
✅ linked feature 'study_name' to registry 'core.ULabel'
✅ linked new feature 'study_name' together with new feature set FeatureSet(uid='NDTU8K9Agb5jeiccYVqY', n=1, registry='core.Feature', hash='wrVx9j2goWy1Emt2Eq9y', updated_at=2024-05-01 18:50:31 UTC, created_by_id=1)
✅ linked feature 'iris_organism_name' to registry 'core.ULabel'
💡 nothing links to it anymore, deleting feature set FeatureSet(uid='NDTU8K9Agb5jeiccYVqY', n=1, registry='core.Feature', hash='wrVx9j2goWy1Emt2Eq9y', updated_at=2024-05-01 18:50:31 UTC, created_by_id=1)
✅ linked new feature 'iris_organism_name' together with new feature set FeatureSet(uid='p0rX4YMjebjGiZaQItKm', n=2, registry='core.Feature', hash='AeIcn9GpMb8154_Qhc4Z', updated_at=2024-05-01 18:50:31 UTC, created_by_id=1)
✅ linked new feature 'study_name' together with new feature set FeatureSet(uid='3C62rFhV9fwwApAYGufF', n=1, registry='core.Feature', hash='wrVx9j2goWy1Emt2Eq9y', updated_at=2024-05-01 18:50:31 UTC, created_by_id=1)
✅ loaded: FeatureSet(uid='p0rX4YMjebjGiZaQItKm', n=2, registry='core.Feature', hash='AeIcn9GpMb8154_Qhc4Z', updated_at=2024-05-01 18:50:31 UTC, created_by_id=1)
✅ linked new feature 'iris_organism_name' together with new feature set FeatureSet(uid='p0rX4YMjebjGiZaQItKm', n=2, registry='core.Feature', hash='AeIcn9GpMb8154_Qhc4Z', updated_at=2024-05-01 18:50:32 UTC, created_by_id=1)
✅ loaded: FeatureSet(uid='3C62rFhV9fwwApAYGufF', n=1, registry='core.Feature', hash='wrVx9j2goWy1Emt2Eq9y', updated_at=2024-05-01 18:50:31 UTC, created_by_id=1)
✅ linked new feature 'study_name' together with new feature set FeatureSet(uid='3C62rFhV9fwwApAYGufF', n=1, registry='core.Feature', hash='wrVx9j2goWy1Emt2Eq9y', updated_at=2024-05-01 18:50:32 UTC, created_by_id=1)
✅ loaded: FeatureSet(uid='p0rX4YMjebjGiZaQItKm', n=2, registry='core.Feature', hash='AeIcn9GpMb8154_Qhc4Z', updated_at=2024-05-01 18:50:32 UTC, created_by_id=1)
✅ linked new feature 'iris_organism_name' together with new feature set FeatureSet(uid='p0rX4YMjebjGiZaQItKm', n=2, registry='core.Feature', hash='AeIcn9GpMb8154_Qhc4Z', updated_at=2024-05-01 18:50:32 UTC, created_by_id=1)
Query artifacts by labels#
Using the new annotations, you can now query image artifacts by organism & study labels:
ulabels = ln.ULabel.lookup()
artifact = ln.Artifact.filter(ulabels=ulabels.study0).first()
We also see them when calling describe()
:
artifact.describe()
Artifact(uid='7zPAagBwZTJzQqVLFyqB', key='iris_studies/study0_raw_images', suffix='', size=656692, hash='wVYKPpEsmmrqSpAZIRXCFg', hash_type='md5-d', n_objects=51, visibility=1, key_is_virtual=False, updated_at=2024-05-01 18:50:26 UTC)
Provenance:
📎 storage: Storage(uid='EzvEnPnH', root='s3://lamindb-dev-datasets', type='s3', region='us-east-1', instance_uid='pZ1VQkyD3haH')
📎 transform: Transform(uid='NJvdsWWbJlZS6K79', name='Tutorial: Artifacts', key='tutorial', version='0', type='notebook')
📎 run: Run(uid='gYXIXnIw7o7ys3cRQgEl', started_at=2024-05-01 18:50:22 UTC, is_consecutive=True)
📎 created_by: User(uid='00000000', handle='anonymous')
Features:
external: FeatureSet(uid='p0rX4YMjebjGiZaQItKm', n=2, registry='core.Feature')
🔗 iris_organism_name (3, core.ULabel): 'virginica', 'setosa', 'versicolor'
🔗 study_name (1, core.ULabel): 'study0'
Labels:
📎 ulabels (4, core.ULabel): 'virginica', 'study0', 'setosa', 'versicolor'
Label collections#
Labeling collections works in the same way as labeling artifacts:
collection = ln.Collection.filter(name__startswith="Iris collection", version="1").one()
collection.labels.add(ulabels.study0, feature=features.study_name)
all_organism_labels = ln.ULabel.filter(parents__name="is_organism").all()
collection.labels.add(all_organism_labels, feature=features.iris_organism_name)
Show code cell output
✅ loaded: FeatureSet(uid='3C62rFhV9fwwApAYGufF', n=1, registry='core.Feature', hash='wrVx9j2goWy1Emt2Eq9y', updated_at=2024-05-01 18:50:32 UTC, created_by_id=1)
✅ linked new feature 'study_name' together with new feature set FeatureSet(uid='3C62rFhV9fwwApAYGufF', n=1, registry='core.Feature', hash='wrVx9j2goWy1Emt2Eq9y', updated_at=2024-05-01 18:50:32 UTC, created_by_id=1)
✅ loaded: FeatureSet(uid='p0rX4YMjebjGiZaQItKm', n=2, registry='core.Feature', hash='AeIcn9GpMb8154_Qhc4Z', updated_at=2024-05-01 18:50:32 UTC, created_by_id=1)
💡 nothing links to it anymore, deleting feature set FeatureSet(uid='3C62rFhV9fwwApAYGufF', n=1, registry='core.Feature', hash='wrVx9j2goWy1Emt2Eq9y', updated_at=2024-05-01 18:50:32 UTC, created_by_id=1)
✅ linked new feature 'iris_organism_name' together with new feature set FeatureSet(uid='p0rX4YMjebjGiZaQItKm', n=2, registry='core.Feature', hash='AeIcn9GpMb8154_Qhc4Z', updated_at=2024-05-01 18:50:32 UTC, created_by_id=1)
collection.describe()
Collection(uid='32cBtnkazA7BtJB4YV3x', name='Iris collection', description='50 image files and metadata', version='1', hash='WwFLpNFmK8GMC2dSGj1W', visibility=1, updated_at=2024-05-01 18:50:26 UTC)
Provenance:
📎 transform: Transform(uid='NJvdsWWbJlZS6K79', name='Tutorial: Artifacts', key='tutorial', version='0', type='notebook')
📎 run: Run(uid='gYXIXnIw7o7ys3cRQgEl', started_at=2024-05-01 18:50:22 UTC, is_consecutive=True)
📎 created_by: User(uid='00000000', handle='anonymous')
Features:
external: FeatureSet(uid='p0rX4YMjebjGiZaQItKm', n=2, registry='core.Feature')
🔗 iris_organism_name (3, core.ULabel): 'virginica', 'setosa', 'versicolor'
🔗 study_name (1, core.ULabel): 'study0'
Labels:
📎 ulabels (4, core.ULabel): 'virginica', 'study0', 'setosa', 'versicolor'
Run an ML model#
Let’s now run a ML model that transforms the images into 4 high-level features.
def run_ml_model() -> pd.DataFrame:
transform = ln.Transform(name="Petal & sepal regressor", type="pipeline")
ln.track(transform=transform)
input_data = ln.Collection.filter(name__startswith="Iris collection", version="1").one()
input_paths = [
path.download_to(path.name) for path in input_data.artifacts[0].path.glob("*")
]
# apply ML model
output_data = ln.core.datasets.df_iris_in_meter_study1()
return output_data
df = run_ml_model()
Show code cell output
💡 saved: Transform(uid='fDh8lNBRLF8PQtTP', name='Petal & sepal regressor', type='pipeline', updated_at=2024-05-01 18:50:32 UTC, created_by_id=1)
💡 saved: Run(uid='4MhEkENdu3viywVPDhqf', transform_id=2, created_by_id=1)
💡 tracked pip freeze > /home/runner/.cache/lamindb/run_env_pip_4MhEkENdu3viywVPDhqf.txt
The output is a dataframe:
df.head()
sepal_length | sepal_width | petal_length | petal_width | iris_organism_name | |
---|---|---|---|---|---|
0 | 0.051 | 0.035 | 0.014 | 0.002 | setosa |
1 | 0.049 | 0.030 | 0.014 | 0.002 | setosa |
2 | 0.047 | 0.032 | 0.013 | 0.002 | setosa |
3 | 0.046 | 0.031 | 0.015 | 0.002 | setosa |
4 | 0.050 | 0.036 | 0.014 | 0.002 | setosa |
And this is the ML pipeline that produced the dataframe:
ln.core.run_context.transform.view_parents()
Register the output data#
Let’s first register the features of the transformed data:
new_features = ln.Feature.from_df(df)
ln.save(new_features)
How to track units of features?
Use the unit
field of Feature
. In the above example, you’d do:
for feature in features:
if feature.type == "number":
feature.unit = "m" # SI unit for meters
feature.save()
We can now validate & register the dataframe in one line:
artifact = ln.Artifact.from_df(
df,
description="Iris study 1 - after measuring sepal & petal metrics",
)
artifact.save()
💡 path content will be copied to default storage upon `save()` with key `None` ('.lamindb/Qtz8SzzWcsj7dLiZvYxJ.parquet')
✅ storing artifact 'Qtz8SzzWcsj7dLiZvYxJ' at '/home/runner/work/lamindb/lamindb/docs/lamin-tutorial/.lamindb/Qtz8SzzWcsj7dLiZvYxJ.parquet'
Artifact(uid='Qtz8SzzWcsj7dLiZvYxJ', suffix='.parquet', accessor='DataFrame', description='Iris study 1 - after measuring sepal & petal metrics', size=5347, hash='gWxE_pTwNrrYutXJaGbHqA', hash_type='md5', visibility=1, key_is_virtual=True, updated_at=2024-05-01 18:50:37 UTC, storage_id=1, transform_id=2, run_id=2, created_by_id=1)
artifact.features.add(new_features)
Feature sets#
Get an overview of linked features:
artifact.features
Features:
columns: FeatureSet(uid='eAHIaYzbO8KemSbYPD7O', n=5, registry='core.Feature')
🔗 iris_organism_name (0, core.ULabel):
sepal_length (number)
sepal_width (number)
petal_length (number)
petal_width (number)
You’ll see that they’re always grouped in sets that correspond to records of FeatureSet
.
Why does LaminDB model feature sets, not just features?
Performance: Imagine you measure the same panel of 20k transcripts in 1M samples. By modeling the panel as a feature set, you’ll only need to store 1M instead of 1M x 20k = 20B links.
Interpretation: Model protein panels, gene panels, etc.
Data integration: Feature sets provide the currency that determines whether two collections can be easily concatenated.
These reasons do not hold for label sets. Hence, LaminDB does not model label sets.
A slot
provides a string key to access feature sets. It’s typically the accessor within the registered data object, here pd.DataFrame.columns
.
Let’s use it to access all linked features:
artifact.features["columns"].df()
uid | name | type | unit | description | registries | synonyms | created_at | updated_at | created_by_id | |
---|---|---|---|---|---|---|---|---|---|---|
id | ||||||||||
1 | 7Ohz9SVp18RZ | iris_organism_name | category | None | None | core.ULabel | None | 2024-05-01 18:50:30.969360+00:00 | 2024-05-01 18:50:31.950162+00:00 | 1 |
3 | Om8Mw6pFkzIo | sepal_length | number | None | None | None | None | 2024-05-01 18:50:37.714235+00:00 | 2024-05-01 18:50:37.714263+00:00 | 1 |
4 | iwR15EkIJUKV | sepal_width | number | None | None | None | None | 2024-05-01 18:50:37.714387+00:00 | 2024-05-01 18:50:37.714403+00:00 | 1 |
5 | UL8Sh0vgFmp7 | petal_length | number | None | None | None | None | 2024-05-01 18:50:37.714513+00:00 | 2024-05-01 18:50:37.714528+00:00 | 1 |
6 | e1QfngsThY1R | petal_width | number | None | None | None | None | 2024-05-01 18:50:37.714638+00:00 | 2024-05-01 18:50:37.714652+00:00 | 1 |
There is one categorical feature, let’s add the organism labels:
organism_labels = ln.ULabel.filter(parents__name="is_organism").all()
artifact.labels.add(organism_labels, feature=features.iris_organism_name)
Let’s now add study labels:
artifact.labels.add(ulabels.study0, feature=features.study_name)
✅ linked new feature 'study_name' together with new feature set FeatureSet(uid='qpVYAIhNMtboyMZ1urZR', n=1, registry='core.Feature', hash='wrVx9j2goWy1Emt2Eq9y', updated_at=2024-05-01 18:50:37 UTC, created_by_id=1)
In addition to the columns
feature set, we now have an external
feature set:
artifact.features
Features:
columns: FeatureSet(uid='eAHIaYzbO8KemSbYPD7O', n=5, registry='core.Feature')
🔗 iris_organism_name (3, core.ULabel): 'virginica', 'setosa', 'versicolor'
sepal_length (number)
sepal_width (number)
petal_length (number)
petal_width (number)
external: FeatureSet(uid='qpVYAIhNMtboyMZ1urZR', n=1, registry='core.Feature')
🔗 study_name (1, core.ULabel): 'study0'
This is the context for our artifact:
artifact.describe()
Artifact(uid='Qtz8SzzWcsj7dLiZvYxJ', suffix='.parquet', accessor='DataFrame', description='Iris study 1 - after measuring sepal & petal metrics', size=5347, hash='gWxE_pTwNrrYutXJaGbHqA', hash_type='md5', visibility=1, key_is_virtual=True, updated_at=2024-05-01 18:50:37 UTC)
Provenance:
📎 storage: Storage(uid='sXnIoPYEDMeP', root='/home/runner/work/lamindb/lamindb/docs/lamin-tutorial', type='local', instance_uid='5WuFt3cW4zRx')
📎 transform: Transform(uid='fDh8lNBRLF8PQtTP', name='Petal & sepal regressor', type='pipeline')
📎 run: Run(uid='4MhEkENdu3viywVPDhqf', started_at=2024-05-01 18:50:32 UTC, is_consecutive=True)
📎 created_by: User(uid='00000000', handle='anonymous')
Features:
columns: FeatureSet(uid='eAHIaYzbO8KemSbYPD7O', n=5, registry='core.Feature')
🔗 iris_organism_name (3, core.ULabel): 'virginica', 'setosa', 'versicolor'
sepal_length (number)
sepal_width (number)
petal_length (number)
petal_width (number)
external: FeatureSet(uid='qpVYAIhNMtboyMZ1urZR', n=1, registry='core.Feature')
🔗 study_name (1, core.ULabel): 'study0'
Labels:
📎 ulabels (4, core.ULabel): 'virginica', 'study0', 'setosa', 'versicolor'
artifact.view_lineage()
See the database content:
ln.view(registries=["Feature", "FeatureSet", "ULabel"])
Show code cell output
Feature
uid | name | type | unit | description | registries | synonyms | created_at | updated_at | created_by_id | |
---|---|---|---|---|---|---|---|---|---|---|
id | ||||||||||
6 | e1QfngsThY1R | petal_width | number | None | None | None | None | 2024-05-01 18:50:37.714638+00:00 | 2024-05-01 18:50:37.714652+00:00 | 1 |
5 | UL8Sh0vgFmp7 | petal_length | number | None | None | None | None | 2024-05-01 18:50:37.714513+00:00 | 2024-05-01 18:50:37.714528+00:00 | 1 |
4 | iwR15EkIJUKV | sepal_width | number | None | None | None | None | 2024-05-01 18:50:37.714387+00:00 | 2024-05-01 18:50:37.714403+00:00 | 1 |
3 | Om8Mw6pFkzIo | sepal_length | number | None | None | None | None | 2024-05-01 18:50:37.714235+00:00 | 2024-05-01 18:50:37.714263+00:00 | 1 |
1 | 7Ohz9SVp18RZ | iris_organism_name | category | None | None | core.ULabel | None | 2024-05-01 18:50:30.969360+00:00 | 2024-05-01 18:50:31.950162+00:00 | 1 |
2 | Nrcl3zMwkL9q | study_name | category | None | None | core.ULabel | None | 2024-05-01 18:50:30.985072+00:00 | 2024-05-01 18:50:31.591442+00:00 | 1 |
FeatureSet
uid | name | n | type | registry | hash | created_at | updated_at | created_by_id | |
---|---|---|---|---|---|---|---|---|---|
id | |||||||||
5 | qpVYAIhNMtboyMZ1urZR | None | 1 | None | core.Feature | wrVx9j2goWy1Emt2Eq9y | 2024-05-01 18:50:37.816761+00:00 | 2024-05-01 18:50:37.821525+00:00 | 1 |
4 | eAHIaYzbO8KemSbYPD7O | None | 5 | None | core.Feature | sXUHyxw8y_MHRRkdVPak | 2024-05-01 18:50:37.743324+00:00 | 2024-05-01 18:50:37.743350+00:00 | 1 |
2 | p0rX4YMjebjGiZaQItKm | None | 2 | None | core.Feature | AeIcn9GpMb8154_Qhc4Z | 2024-05-01 18:50:31.956398+00:00 | 2024-05-01 18:50:32.415491+00:00 | 1 |
ULabel
uid | name | description | reference | reference_type | created_at | updated_at | created_by_id | |
---|---|---|---|---|---|---|---|---|
id | ||||||||
8 | qIie3ota | is_study | None | None | None | 2024-05-01 18:50:30.910140+00:00 | 2024-05-01 18:50:30.910164+00:00 | 1 |
7 | pf47KQzy | study2 | None | None | None | 2024-05-01 18:50:30.900888+00:00 | 2024-05-01 18:50:30.900903+00:00 | 1 |
6 | iUgtZ2xK | study1 | None | None | None | 2024-05-01 18:50:30.900785+00:00 | 2024-05-01 18:50:30.900801+00:00 | 1 |
5 | mPISxrYD | study0 | None | None | None | 2024-05-01 18:50:30.900654+00:00 | 2024-05-01 18:50:30.900686+00:00 | 1 |
4 | wU4im1Pn | is_organism | None | None | None | 2024-05-01 18:50:30.802939+00:00 | 2024-05-01 18:50:30.802967+00:00 | 1 |
3 | vqItCsbY | virginica | None | None | None | 2024-05-01 18:50:30.764203+00:00 | 2024-05-01 18:50:30.764217+00:00 | 1 |
2 | uY6M63yK | versicolor | None | None | None | 2024-05-01 18:50:30.764098+00:00 | 2024-05-01 18:50:30.764114+00:00 | 1 |
Manage follow-up data#
Assume that a couple of weeks later, we receive a new dataset in a follow-up study 2.
Let’s track a new analysis:
ln.settings.transform.stem_uid = "dMtrt8YMSdl6"
ln.settings.transform.version = "1"
ln.track()
💡 Assuming editor is Jupyter Lab.
💡 Attaching notebook metadata
💡 notebook imports: lamindb==0.71.0 pandas==1.5.3
💡 saved: Transform(uid='dMtrt8YMSdl65zKv', name='Tutorial: Features & labels', key='tutorial2', version='1', type='notebook', updated_at=2024-05-01 18:50:39 UTC, created_by_id=1)
💡 saved: Run(uid='4kwedLR19YL3udJmPMoF', transform_id=3, created_by_id=1)
💡 tracked pip freeze > /home/runner/.cache/lamindb/run_env_pip_4kwedLR19YL3udJmPMoF.txt
Register a joint collection#
Assume we already ran all preprocessing including the ML model.
We get a DataFrame and store it as an artifact:
df = ln.core.datasets.df_iris_in_meter_study2()
ln.Artifact.from_df(df, description="Iris study 2 - transformed").save()
Show code cell output
💡 path content will be copied to default storage upon `save()` with key `None` ('.lamindb/F5bhv575rauQVoDjgtOM.parquet')
✅ storing artifact 'F5bhv575rauQVoDjgtOM' at '/home/runner/work/lamindb/lamindb/docs/lamin-tutorial/.lamindb/F5bhv575rauQVoDjgtOM.parquet'
Artifact(uid='F5bhv575rauQVoDjgtOM', suffix='.parquet', accessor='DataFrame', description='Iris study 2 - transformed', size=5397, hash='Sz8pY3NO5XxBNEis7uq_ZQ', hash_type='md5', visibility=1, key_is_virtual=True, updated_at=2024-05-01 18:50:40 UTC, storage_id=1, transform_id=3, run_id=3, created_by_id=1)
Let’s load it:
artifact2 = ln.Artifact.filter(description="Iris study 2 - transformed").one()
We can now store the joint collection:
collection = ln.Collection(
[artifact, artifact2], name="Iris flower study 1 & 2 - transformed"
)
collection.save()
✅ loaded: FeatureSet(uid='eAHIaYzbO8KemSbYPD7O', n=5, registry='core.Feature', hash='sXUHyxw8y_MHRRkdVPak', updated_at=2024-05-01 18:50:37 UTC, created_by_id=1)
✅ loaded: FeatureSet(uid='qpVYAIhNMtboyMZ1urZR', n=1, registry='core.Feature', hash='wrVx9j2goWy1Emt2Eq9y', updated_at=2024-05-01 18:50:37 UTC, created_by_id=1)
💡 adding artifact [5] as input for run 3, adding parent transform 2
Auto-concatenate datasets#
Because both datasets measured the same validated feature set, we can auto-concatenate the collection:
collection.load().tail()
sepal_length | sepal_width | petal_length | petal_width | iris_organism_name | |
---|---|---|---|---|---|
145 | 0.067 | 0.030 | 0.052 | 0.023 | virginica |
146 | 0.063 | 0.025 | 0.050 | 0.019 | virginica |
147 | 0.065 | 0.030 | 0.052 | 0.020 | virginica |
148 | 0.062 | 0.034 | 0.054 | 0.023 | virginica |
149 | 0.059 | 0.030 | 0.051 | 0.018 | virginica |
We can also access & query the underlying two artifact objects:
collection.artifacts.df()
uid | storage_id | key | suffix | accessor | description | version | size | hash | hash_type | n_objects | n_observations | transform_id | run_id | visibility | key_is_virtual | created_at | updated_at | created_by_id | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
id | |||||||||||||||||||
5 | Qtz8SzzWcsj7dLiZvYxJ | 1 | None | .parquet | DataFrame | Iris study 1 - after measuring sepal & petal m... | None | 5347 | gWxE_pTwNrrYutXJaGbHqA | md5 | None | None | 2 | 2 | 1 | True | 2024-05-01 18:50:37.729786+00:00 | 2024-05-01 18:50:37.729812+00:00 | 1 |
6 | F5bhv575rauQVoDjgtOM | 1 | None | .parquet | DataFrame | Iris study 2 - transformed | None | 5397 | Sz8pY3NO5XxBNEis7uq_ZQ | md5 | None | None | 3 | 3 | 1 | True | 2024-05-01 18:50:40.394478+00:00 | 2024-05-01 18:50:40.394507+00:00 | 1 |
Or look at their data lineage:
collection.view_lineage()
Or look at the database:
ln.view()
Show code cell output
Artifact
uid | storage_id | key | suffix | accessor | description | version | size | hash | hash_type | n_objects | n_observations | transform_id | run_id | visibility | key_is_virtual | created_at | updated_at | created_by_id | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
id | |||||||||||||||||||
6 | F5bhv575rauQVoDjgtOM | 1 | None | .parquet | DataFrame | Iris study 2 - transformed | None | 5397 | Sz8pY3NO5XxBNEis7uq_ZQ | md5 | NaN | None | 3 | 3 | 1 | True | 2024-05-01 18:50:40.394478+00:00 | 2024-05-01 18:50:40.394507+00:00 | 1 |
5 | Qtz8SzzWcsj7dLiZvYxJ | 1 | None | .parquet | DataFrame | Iris study 1 - after measuring sepal & petal m... | None | 5347 | gWxE_pTwNrrYutXJaGbHqA | md5 | NaN | None | 2 | 2 | 1 | True | 2024-05-01 18:50:37.729786+00:00 | 2024-05-01 18:50:37.729812+00:00 | 1 |
4 | KN5tE6KT9QdtP29X8w5C | 2 | iris_studies/study2_raw_images | None | None | None | 665518 | PX8Vt9T28y-uCEJO1tKm7A | md5-d | 51.0 | None | 1 | 1 | 1 | False | 2024-05-01 18:50:28.003536+00:00 | 2024-05-01 18:50:28.003564+00:00 | 1 | |
3 | 3ksHD7VWCbUJld5fiYzd | 2 | iris_studies/study1_raw_images | None | None | None | 640617 | j61W__GgImA18CKrIf7FVg | md5-d | 49.0 | None | 1 | 1 | 1 | False | 2024-05-01 18:50:27.385608+00:00 | 2024-05-01 18:50:27.385637+00:00 | 1 | |
2 | 7zPAagBwZTJzQqVLFyqB | 2 | iris_studies/study0_raw_images | None | None | None | 656692 | wVYKPpEsmmrqSpAZIRXCFg | md5-d | 51.0 | None | 1 | 1 | 1 | False | 2024-05-01 18:50:26.600312+00:00 | 2024-05-01 18:50:26.600341+00:00 | 1 | |
1 | o5tTv4TU4OZGSWnULdR9 | 2 | iris_studies/study0_raw_images/meta.csv | .csv | None | None | None | 4355 | ZpAEpN0iFYH6vjZNigic7g | md5 | NaN | None | 1 | 1 | 1 | False | 2024-05-01 18:50:25.683074+00:00 | 2024-05-01 18:50:25.683102+00:00 | 1 |
Collection
uid | name | description | version | hash | reference | reference_type | transform_id | run_id | artifact_id | visibility | created_at | updated_at | created_by_id | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
id | ||||||||||||||
4 | RAz3v1ppt4kR6lFG6Wck | Iris flower study 1 & 2 - transformed | None | None | r405fYDOPfwROfWo23b1 | None | None | 3 | 3 | None | 1 | 2024-05-01 18:50:40.438900+00:00 | 2024-05-01 18:50:40.438924+00:00 | 1 |
3 | 32cBtnkazA7BtJB4pisU | Iris collection | Another 50 images | 3 | T-U8z2Zi5rFYdAD9pzmS | None | None | 1 | 1 | None | 1 | 2024-05-01 18:50:28.016434+00:00 | 2024-05-01 18:50:28.016457+00:00 | 1 |
2 | 32cBtnkazA7BtJB4Db1u | Iris collection | Another 50 images | 2 | 5cCK6ZLOPB0cV3tyeZup | None | None | 1 | 1 | None | 1 | 2024-05-01 18:50:27.398726+00:00 | 2024-05-01 18:50:27.398750+00:00 | 1 |
1 | 32cBtnkazA7BtJB4YV3x | Iris collection | 50 image files and metadata | 1 | WwFLpNFmK8GMC2dSGj1W | None | None | 1 | 1 | None | 1 | 2024-05-01 18:50:26.800334+00:00 | 2024-05-01 18:50:26.800363+00:00 | 1 |
Feature
uid | name | type | unit | description | registries | synonyms | created_at | updated_at | created_by_id | |
---|---|---|---|---|---|---|---|---|---|---|
id | ||||||||||
6 | e1QfngsThY1R | petal_width | number | None | None | None | None | 2024-05-01 18:50:37.714638+00:00 | 2024-05-01 18:50:37.714652+00:00 | 1 |
5 | UL8Sh0vgFmp7 | petal_length | number | None | None | None | None | 2024-05-01 18:50:37.714513+00:00 | 2024-05-01 18:50:37.714528+00:00 | 1 |
4 | iwR15EkIJUKV | sepal_width | number | None | None | None | None | 2024-05-01 18:50:37.714387+00:00 | 2024-05-01 18:50:37.714403+00:00 | 1 |
3 | Om8Mw6pFkzIo | sepal_length | number | None | None | None | None | 2024-05-01 18:50:37.714235+00:00 | 2024-05-01 18:50:37.714263+00:00 | 1 |
1 | 7Ohz9SVp18RZ | iris_organism_name | category | None | None | core.ULabel | None | 2024-05-01 18:50:30.969360+00:00 | 2024-05-01 18:50:31.950162+00:00 | 1 |
2 | Nrcl3zMwkL9q | study_name | category | None | None | core.ULabel | None | 2024-05-01 18:50:30.985072+00:00 | 2024-05-01 18:50:31.591442+00:00 | 1 |
FeatureSet
uid | name | n | type | registry | hash | created_at | updated_at | created_by_id | |
---|---|---|---|---|---|---|---|---|---|
id | |||||||||
5 | qpVYAIhNMtboyMZ1urZR | None | 1 | None | core.Feature | wrVx9j2goWy1Emt2Eq9y | 2024-05-01 18:50:37.816761+00:00 | 2024-05-01 18:50:37.821525+00:00 | 1 |
4 | eAHIaYzbO8KemSbYPD7O | None | 5 | None | core.Feature | sXUHyxw8y_MHRRkdVPak | 2024-05-01 18:50:37.743324+00:00 | 2024-05-01 18:50:37.743350+00:00 | 1 |
2 | p0rX4YMjebjGiZaQItKm | None | 2 | None | core.Feature | AeIcn9GpMb8154_Qhc4Z | 2024-05-01 18:50:31.956398+00:00 | 2024-05-01 18:50:32.415491+00:00 | 1 |
Run
uid | transform_id | started_at | finished_at | created_by_id | json | report_id | environment_id | is_consecutive | reference | reference_type | created_at | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
id | ||||||||||||
1 | gYXIXnIw7o7ys3cRQgEl | 1 | 2024-05-01 18:50:22.948429+00:00 | None | 1 | None | None | None | True | None | None | 2024-05-01 18:50:22.948563+00:00 |
2 | 4MhEkENdu3viywVPDhqf | 2 | 2024-05-01 18:50:32.462950+00:00 | None | 1 | None | None | None | True | None | None | 2024-05-01 18:50:32.463085+00:00 |
3 | 4kwedLR19YL3udJmPMoF | 3 | 2024-05-01 18:50:39.337405+00:00 | None | 1 | None | None | None | True | None | None | 2024-05-01 18:50:39.337657+00:00 |
Storage
uid | root | description | type | region | instance_uid | created_at | updated_at | created_by_id | |
---|---|---|---|---|---|---|---|---|---|
id | |||||||||
2 | EzvEnPnH | s3://lamindb-dev-datasets | None | s3 | us-east-1 | pZ1VQkyD3haH | 2024-05-01 18:50:25.627040+00:00 | 2024-05-01 18:50:25.627073+00:00 | 1 |
1 | sXnIoPYEDMeP | /home/runner/work/lamindb/lamindb/docs/lamin-t... | None | local | None | 5WuFt3cW4zRx | 2024-05-01 18:50:20.106918+00:00 | 2024-05-01 18:50:20.106942+00:00 | 1 |
Transform
uid | name | key | version | description | type | latest_report_id | source_code_id | reference | reference_type | created_at | updated_at | created_by_id | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
id | |||||||||||||
3 | dMtrt8YMSdl65zKv | Tutorial: Features & labels | tutorial2 | 1 | None | notebook | None | None | None | None | 2024-05-01 18:50:39.330633+00:00 | 2024-05-01 18:50:39.330677+00:00 | 1 |
2 | fDh8lNBRLF8PQtTP | Petal & sepal regressor | None | None | None | pipeline | None | None | None | None | 2024-05-01 18:50:32.458950+00:00 | 2024-05-01 18:50:32.458976+00:00 | 1 |
1 | NJvdsWWbJlZS6K79 | Tutorial: Artifacts | tutorial | 0 | None | notebook | None | None | None | None | 2024-05-01 18:50:22.941759+00:00 | 2024-05-01 18:50:22.941802+00:00 | 1 |
ULabel
uid | name | description | reference | reference_type | created_at | updated_at | created_by_id | |
---|---|---|---|---|---|---|---|---|
id | ||||||||
8 | qIie3ota | is_study | None | None | None | 2024-05-01 18:50:30.910140+00:00 | 2024-05-01 18:50:30.910164+00:00 | 1 |
7 | pf47KQzy | study2 | None | None | None | 2024-05-01 18:50:30.900888+00:00 | 2024-05-01 18:50:30.900903+00:00 | 1 |
6 | iUgtZ2xK | study1 | None | None | None | 2024-05-01 18:50:30.900785+00:00 | 2024-05-01 18:50:30.900801+00:00 | 1 |
5 | mPISxrYD | study0 | None | None | None | 2024-05-01 18:50:30.900654+00:00 | 2024-05-01 18:50:30.900686+00:00 | 1 |
4 | wU4im1Pn | is_organism | None | None | None | 2024-05-01 18:50:30.802939+00:00 | 2024-05-01 18:50:30.802967+00:00 | 1 |
3 | vqItCsbY | virginica | None | None | None | 2024-05-01 18:50:30.764203+00:00 | 2024-05-01 18:50:30.764217+00:00 | 1 |
2 | uY6M63yK | versicolor | None | None | None | 2024-05-01 18:50:30.764098+00:00 | 2024-05-01 18:50:30.764114+00:00 | 1 |
User
uid | handle | name | created_at | updated_at | |
---|---|---|---|---|---|
id | |||||
1 | 00000000 | anonymous | None | 2024-05-01 18:50:20.102450+00:00 | 2024-05-01 18:50:20.102474+00:00 |
This is it! 😅
If you’re interested, please check out guides & use cases or make an issue on GitHub to discuss.
Appendix#
Manage metadata#
Avoid duplicates#
Let’s create a label “project1”:
ln.ULabel(name="project1").save()
ULabel(uid='hYWK36Qu', name='project1', updated_at=2024-05-01 18:50:40 UTC, created_by_id=1)
We already created a project1
label before, let’s see what happens if we try to create it again:
label = ln.ULabel(name="project1")
label.save()
❗ loaded ULabel record with same name: 'project1' (disable via `ln.settings.upon_create_search_names`)
ULabel(uid='hYWK36Qu', name='project1', updated_at=2024-05-01 18:50:40 UTC, created_by_id=1)
Instead of creating a new record, LaminDB loads and returns the existing record from the database.
If there is no exact match, LaminDB will warn you upon creating a record about potential duplicates.
Say, we spell “project 1” with a white space:
ln.ULabel(name="project 1")
❗ record with similar name exist! did you mean to load it?
uid | score | |
---|---|---|
name | ||
project1 | hYWK36Qu | 94.1 |
ULabel(uid='4bma7rl4', name='project 1', created_by_id=1)
To avoid inserting duplicates when creating new records, a search compares whether a similar record already exists.
You can switch it off for performance gains via upon_create_search_names
.
Update & delete records#
label = ln.ULabel.filter(name="project1").first()
label
ULabel(uid='hYWK36Qu', name='project1', updated_at=2024-05-01 18:50:40 UTC, created_by_id=1)
label.name = "project1a"
label.save()
label
ULabel(uid='hYWK36Qu', name='project1a', updated_at=2024-05-01 18:50:40 UTC, created_by_id=1)
label.delete()
(1, {'lnschema_core.ULabel': 1})
Manage storage#
Change default storage#
The default storage location is:
ln.settings.storage
PosixUPath('/home/runner/work/lamindb/lamindb/docs/lamin-tutorial')
You can change it by setting ln.settings.storage = "s3://my-bucket"
.
See all storage locations#
ln.Storage.df()
uid | root | description | type | region | instance_uid | created_at | updated_at | created_by_id | |
---|---|---|---|---|---|---|---|---|---|
id | |||||||||
2 | EzvEnPnH | s3://lamindb-dev-datasets | None | s3 | us-east-1 | pZ1VQkyD3haH | 2024-05-01 18:50:25.627040+00:00 | 2024-05-01 18:50:25.627073+00:00 | 1 |
1 | sXnIoPYEDMeP | /home/runner/work/lamindb/lamindb/docs/lamin-t... | None | local | None | 5WuFt3cW4zRx | 2024-05-01 18:50:20.106918+00:00 | 2024-05-01 18:50:20.106942+00:00 | 1 |
Set verbosity#
To reduce the number of logging messages, set verbosity
:
ln.settings.verbosity = 3 # only show info, no hints
# clean up what we wrote in this notebook
!lamin delete --force lamin-tutorial
!rm -r lamin-tutorial
Traceback (most recent call last):
File "/opt/hostedtoolcache/Python/3.11.9/x64/bin/lamin", line 8, in <module>
sys.exit(main())
^^^^^^
File "/opt/hostedtoolcache/Python/3.11.9/x64/lib/python3.11/site-packages/rich_click/rich_command.py", line 360, in __call__
return super().__call__(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/hostedtoolcache/Python/3.11.9/x64/lib/python3.11/site-packages/click/core.py", line 1157, in __call__
return self.main(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/hostedtoolcache/Python/3.11.9/x64/lib/python3.11/site-packages/rich_click/rich_command.py", line 152, in main
rv = self.invoke(ctx)
^^^^^^^^^^^^^^^^
File "/opt/hostedtoolcache/Python/3.11.9/x64/lib/python3.11/site-packages/click/core.py", line 1688, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/hostedtoolcache/Python/3.11.9/x64/lib/python3.11/site-packages/click/core.py", line 1434, in invoke
return ctx.invoke(self.callback, **ctx.params)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/hostedtoolcache/Python/3.11.9/x64/lib/python3.11/site-packages/click/core.py", line 783, in invoke
return __callback(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/hostedtoolcache/Python/3.11.9/x64/lib/python3.11/site-packages/lamin_cli/__main__.py", line 103, in delete
return delete(instance, force=force)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/hostedtoolcache/Python/3.11.9/x64/lib/python3.11/site-packages/lamindb_setup/_delete.py", line 140, in delete
n_objects = check_storage_is_empty(
^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/hostedtoolcache/Python/3.11.9/x64/lib/python3.11/site-packages/lamindb_setup/core/upath.py", line 814, in check_storage_is_empty
raise InstanceNotEmpty(message)
lamindb_setup.core.upath.InstanceNotEmpty: Storage /home/runner/work/lamindb/lamindb/docs/lamin-tutorial/.lamindb contains 3 objects ('./lamindb/_is_initialized' ignored) - delete them prior to deleting the instance
['/home/runner/work/lamindb/lamindb/docs/lamin-tutorial/.lamindb/F5bhv575rauQVoDjgtOM.parquet', '/home/runner/work/lamindb/lamindb/docs/lamin-tutorial/.lamindb/Qtz8SzzWcsj7dLiZvYxJ.parquet', '/home/runner/work/lamindb/lamindb/docs/lamin-tutorial/.lamindb/_is_initialized']