Tutorial: Artifacts#

Biology is measured in samples that generate batched datasets.

LaminDB provides a framework to transform these datasets into more useful representations: validated, queryable collections, machine learning models, and analytical insights.

The tutorial has two parts, each is a Jupyter notebook:

  1. Tutorial: Artifacts - register & access

  2. Tutorial: Features & labels - validate & annotate

Setup#

Install the lamindb Python package:

pip install 'lamindb[jupyter,aws]'

You can now init a LaminDB instance with a directory ./lamin-tutorial for storing data:

import lamindb as ln

ln.setup.init(storage="./lamin-tutorial")  # or "s3://my-bucket" or "gs://my-bucket"

# if new to LaminDB, set verbosity to hint level

ln.settings.verbosity = "hint"
❗ To use lamindb, you need to connect to an instance.

Connect to an instance: `ln.connect()`. Init an instance: `ln.setup.init()`.

If you used the CLI to set up lamindb in a notebook, restart the Python session.
πŸ’‘ connected lamindb: anonymous/lamin-tutorial
What else can I configure during setup?
  1. Instead of the default SQLite database, use PostgreSQL:

    db=postgresql://<user>:<pwd>@<hostname>:<port>/<dbname>
    
  2. Instead of a default instance name derived from storage, provide a custom name:

    name=myinstance
    
  3. Beyond the core schema, use bionty and other schemas:

    schema=bionty,custom1,template1
    

For more, see Install & setup.

Track a data source#

The code that generates a dataset is a transform (Transform). It could be a script, a notebook, a pipeline or a UI action.

Let’s track the notebook that’s being run:

ln.settings.transform.stem_uid = "NJvdsWWbJlZS"
ln.settings.transform.version = "0"
ln.track()
πŸ’‘ Assuming editor is Jupyter Lab.
πŸ’‘ Attaching notebook metadata
πŸ’‘ notebook imports: lamindb==0.70.2
πŸ’‘ saved: Transform(uid='NJvdsWWbJlZS6K79', name='Tutorial: Artifacts', key='tutorial', version='0', type='notebook', updated_at=2024-04-19 17:40:09 UTC, created_by_id=1)
πŸ’‘ saved: Run(uid='Un6kaGLyWaVTuoRq275x', transform_id=1, created_by_id=1)
πŸ’‘ tracked pip freeze > /home/runner/.cache/lamindb/run_env_pip_Un6kaGLyWaVTuoRq275x.txt

By calling track(), the notebook is automatically linked as the source of all data that’s about to be saved!

What happened under the hood?
  1. Imported package versions of current notebook were detected

  2. Notebook metadata was detected and stored in a Transform record

  3. Run metadata was detected and stored in a Run record

The Transform class registers data transformations: a notebook, a pipeline or a UI operation.

The Run class registers executions of transforms. Several runs can be linked to the same transform if executed with different context (time, user, input data, etc.).

How do I track a pipeline instead of a notebook?
transform = ln.Transform(name="My pipeline", version="1.2.0")
ln.track(transform)
Why should I care about tracking notebooks?

If you can, avoid interactive notebooks: Anything that can be a deterministic pipeline, should be a pipeline.

Just: much insight generated from biological data is driven by computational biologists interacting with it.

A notebook that’s run a single time on specific data is not a pipeline: it’s a (versioned) document that produced insight or some other form of data representation (with parallels to an ELN in the wetlab).

Because humans are in the loop, most mistakes happen when using notebooks: track() helps avoiding some.

(An early blog post on this is here.)

Manage artifacts#

We’ll work with a toy collection of image files and transform it into higher-level features for downstream analysis.

(For other data types: see Data types.)

Consider 3 directories storing images & metadata of Iris flowers, generated in 3 subsequent studies:

ln.UPath("s3://lamindb-dev-datasets/iris_studies").view_tree()
Hide code cell output
iris_studies (3 sub-directories & 151 files with suffixes '.csv', '.jpg'): 
β”œβ”€β”€ study0_raw_images
β”‚   β”œβ”€β”€ iris-0337d20a3b7273aa0ddaa7d6afb57a37a759b060e4401871db3cefaa6adc068d.jpg
β”‚   β”œβ”€β”€ iris-0797945218a97d6e5251b4758a2ba1b418cbd52ce4ef46a3239e4b939bd9807b.jpg
β”‚   β”œβ”€β”€ iris-0f133861ea3fe1b68f9f1b59ebd9116ff963ee7104a0c4200218a33903f82444.jpg
β”‚   β”œβ”€β”€ iris-0fec175448a23db03c1987527f7e9bb74c18cffa76ef003f962c62603b1cbb87.jpg
β”‚   β”œβ”€β”€ iris-125b6645e086cd60131764a6bed12650e0f7f2091c8bbb72555c103196c01881.jpg
β”‚   β”œβ”€β”€ iris-13dfaff08727abea3da8cfd8d097fe1404e76417fefe27ff71900a89954e145a.jpg
β”‚   ...
β”‚   └── meta.csv
β”œβ”€β”€ study1_raw_images
β”‚   β”œβ”€β”€ iris-0879d3f5b337fe512da1c7bf1d2bfd7616d744d3eef7fa532455a879d5cc4ba0.jpg
β”‚   β”œβ”€β”€ iris-0b486eebacd93e114a6ec24264e035684cebe7d2074eb71eb1a71dd70bf61e8f.jpg
β”‚   β”œβ”€β”€ iris-0ff5ba898a0ec179a25ca217af45374fdd06d606bb85fc29294291facad1776a.jpg
β”‚   β”œβ”€β”€ iris-1175239c07a943d89a6335fb4b99a9fb5aabb2137c4d96102f10b25260ae523f.jpg
β”‚   β”œβ”€β”€ iris-1289c57b571e8e98e4feb3e18a890130adc145b971b7e208a6ce5bad945b4a5a.jpg
β”‚   β”œβ”€β”€ iris-12adb3a8516399e27ff1a9d20d28dca4674836ed00c7c0ae268afce2c30c4451.jpg
β”‚   ...
β”‚   └── meta.csv
└── study2_raw_images
    β”œβ”€β”€ iris-01cdd55ca6402713465841abddcce79a2e906e12edf95afb77c16bde4b4907dc.jpg
    β”œβ”€β”€ iris-02868b71ddd9b33ab795ac41609ea7b20a6e94f2543fad5d7fa11241d61feacf.jpg
    β”œβ”€β”€ iris-0415d2f3295db04bebc93249b685f7d7af7873faa911cd270ecd8363bd322ed5.jpg
    β”œβ”€β”€ iris-0c826b6f4648edf507e0cafdab53712bb6fd1f04dab453cee8db774a728dd640.jpg
    β”œβ”€β”€ iris-10fb9f154ead3c56ba0ab2c1ab609521c963f2326a648f82c9d7cabd178fc425.jpg
    β”œβ”€β”€ iris-14cbed88b0d2a929477bdf1299724f22d782e90f29ce55531f4a3d8608f7d926.jpg
    ...
    └── meta.csv

Our goal is to turn these directories into a validated & queryable collection that can be used alongside many other collections.

Register an artifact#

LaminDB uses the Artifact class to model files, folders & arrays in storage with their metadata. It’s a registry to manage search, queries, validation & access of storage locations.

Let’s create a Artifact record from one of the files:

artifact = ln.Artifact(
    "s3://lamindb-dev-datasets/iris_studies/study0_raw_images/meta.csv"
)
artifact
❗ generating a new storage location at s3://lamindb-dev-datasets
πŸ’‘ path in storage 's3://lamindb-dev-datasets' with key 'iris_studies/study0_raw_images/meta.csv'
Artifact(uid='Jcs76l0nQW6Oi1go0aeP', key='iris_studies/study0_raw_images/meta.csv', suffix='.csv', size=4355, hash='ZpAEpN0iFYH6vjZNigic7g', hash_type='md5', visibility=1, key_is_virtual=False, storage_id=2, transform_id=1, run_id=1, created_by_id=1)
Which fields are populated when creating an artifact record?

Basic fields:

  • uid: universal ID

  • key: storage key, a relative path of the artifact in storage

  • description: an optional string description

  • storage: the storage location (the root, say, an S3 bucket or a local directory)

  • suffix: an optional file/path suffix

  • size: the artifact size in bytes

  • hash: a hash useful to check for integrity and collisions (is this artifact already stored?)

  • hash_type: the type of the hash (usually, an MD5 or SHA1 checksum)

  • created_at: time of creation

  • updated_at: time of last update

Provenance-related fields:

  • created_by: the User who created the artifact

  • transform: the Transform (pipeline, notebook, instrument, app) that was run

  • run: the Run of the transform that created the artifact

For a full reference, see Artifact.

Upon .save(), artifact metadata is written to the database:

artifact.save()
What happens during save?

In the database: A artifact record is inserted into the artifact registry. If the artifact record exists already, it’s updated.

In storage:

  • If the default storage is in the cloud, .save() triggers an upload for a local artifact.

  • If the artifact is already in a registered storage location, only the metadata of the record is saved to the artifact registry.

We can get an overview of all artifacts in the database by calling df():

ln.Artifact.df()
uid storage_id key suffix accessor description version size hash hash_type n_objects n_observations transform_id run_id visibility key_is_virtual created_at updated_at created_by_id
id
1 Jcs76l0nQW6Oi1go0aeP 2 iris_studies/study0_raw_images/meta.csv .csv None None None 4355 ZpAEpN0iFYH6vjZNigic7g md5 None None 1 1 1 False 2024-04-19 17:40:11.795128+00:00 2024-04-19 17:40:11.795156+00:00 1

View data lineage#

Because we called track(), we know that the artifact was saved in the current notebook (view_lineage()):

artifact.view_lineage()
_images/e53ae9fd0a25170d6d2a3e09e9a0ba6ac673a8f67b863bf83ee00a2bc4c73eee.svg

We can also directly access its linked Transform & Run records:

artifact.transform
Transform(uid='NJvdsWWbJlZS6K79', name='Tutorial: Artifacts', key='tutorial', version='0', type='notebook', updated_at=2024-04-19 17:40:09 UTC, created_by_id=1)
artifact.run
Run(uid='Un6kaGLyWaVTuoRq275x', started_at=2024-04-19 17:40:09 UTC, is_consecutive=True, transform_id=1, created_by_id=1)

(For a comprehensive example with data lineage through app uploads, pipelines & notebooks of multiple data types, see Project flow.)

Access an artifact#

path gives you the file path (UPath):

artifact.path
S3Path('s3://lamindb-dev-datasets/iris_studies/study0_raw_images/meta.csv')

To cache the artifact to a local cache, call cache():

artifact.cache()
PosixUPath('/home/runner/.cache/lamindb/lamindb-dev-datasets/iris_studies/study0_raw_images/meta.csv')

To load data into memory with a default loader, call load():

df = artifact.load(index_col=0)
df.head()
0 1
0 iris-0797945218a97d6e5251b4758a2ba1b418cbd52ce... setosa
1 iris-0f133861ea3fe1b68f9f1b59ebd9116ff963ee710... versicolor
2 iris-9ffe51c2abd973d25a299647fa9ccaf6aa9c8eecf... versicolor
3 iris-83f433381b755101b9fc9fbc9743e35fbb8a1a109... setosa
4 iris-bdae8314e4385d8e2322abd8e63a82758a9063c77... virginica

If the data is large, you’ll likely want to query it via backed(). For more on this, see: Query arrays.

How do I update an artifact?

If you’d like to replace the underlying stored object, use replace().

If you’d like to update metadata:

artifact.description = "My new description"
artifact.save()  # save the change to the database

Register directories as artifacts#

We now register the entire directory for study 0 as an artifact:

study0_data = ln.Artifact(f"s3://lamindb-dev-datasets/iris_studies/study0_raw_images")
study0_data.save()
ln.Artifact.df()  # see the registry content
πŸ’‘ path in storage 's3://lamindb-dev-datasets' with key 'iris_studies/study0_raw_images'
uid storage_id key suffix accessor description version size hash hash_type n_objects n_observations transform_id run_id visibility key_is_virtual created_at updated_at created_by_id
id
2 CDanhHQaYdJVTlQw2cAj 2 iris_studies/study0_raw_images None None None 656692 wVYKPpEsmmrqSpAZIRXCFg md5-d 51.0 None 1 1 1 False 2024-04-19 17:40:12.699383+00:00 2024-04-19 17:40:12.699413+00:00 1
1 Jcs76l0nQW6Oi1go0aeP 2 iris_studies/study0_raw_images/meta.csv .csv None None None 4355 ZpAEpN0iFYH6vjZNigic7g md5 NaN None 1 1 1 False 2024-04-19 17:40:11.795128+00:00 2024-04-19 17:40:11.795156+00:00 1

Filter & search artifacts#

You can search artifacts directly based on the Artifact registry:

ln.Artifact.search("meta").head()
key description score
uid
Jcs76l0nQW6Oi1go0aeP iris_studies/study0_raw_images/meta.csv 60.0
CDanhHQaYdJVTlQw2cAj iris_studies/study0_raw_images 48.9

You can also query & search the artifact by any metadata combination.

For instance, look up a user with auto-complete from the User registry:

users = ln.User.lookup()
users.anonymous
User(uid='00000000', handle='anonymous', updated_at=2024-04-19 17:40:07 UTC)
How do I act non-anonymously?
  1. Sign up for a free account (see more info) and copy the API key.

  2. Log in on the command line:

    lamin login <email> --key <API-key>
    

Filter the Transform registry for a name:

transform = ln.Transform.filter(
    name__icontains="Artifacts"
).one()  # get exactly one result
transform
Transform(uid='NJvdsWWbJlZS6K79', name='Tutorial: Artifacts', key='tutorial', version='0', type='notebook', updated_at=2024-04-19 17:40:09 UTC, created_by_id=1)
What does a double underscore mean?

For any field, the double underscore defines a comparator, e.g.,

  • name__icontains="Martha": name contains "Martha" when ignoring case

  • name__startswith="Martha": name starts with "Martha

  • name__in=["Martha", "John"]: name is "John" or "Martha"

For more info, see: Query & search registries.

Use these results to filter the Artifact registry:

ln.Artifact.filter(
    created_by=users.anonymous,
    transform=transform,
    suffix=".jpg",
).df().head()
uid key suffix accessor description version size hash hash_type n_objects n_observations visibility key_is_virtual created_at updated_at storage_id transform_id run_id created_by_id
id

You can also query for directories using key__startswith (LaminDB treats directories like AWS S3, as the prefix of the storage key):

ln.Artifact.filter(key__startswith="iris_studies/study0_raw_images/").df().head()
uid storage_id key suffix accessor description version size hash hash_type n_objects n_observations transform_id run_id visibility key_is_virtual created_at updated_at created_by_id
id
1 Jcs76l0nQW6Oi1go0aeP 2 iris_studies/study0_raw_images/meta.csv .csv None None None 4355 ZpAEpN0iFYH6vjZNigic7g md5 None None 1 1 1 False 2024-04-19 17:40:11.795128+00:00 2024-04-19 17:40:11.795156+00:00 1

Note

You can look up, filter & search any registry (Registry).

You can chain filter() statements and search(): ln.Artifact.filter(suffix=".jpg").search("my image")

An empty filter returns the entire registry: ln.Artifact.filter()

For more info, see: Query & search registries.

Filter & search on LaminHub

Describe artifacts#

Get an overview of what happened:

artifact.describe()
Artifact(uid='Jcs76l0nQW6Oi1go0aeP', key='iris_studies/study0_raw_images/meta.csv', suffix='.csv', size=4355, hash='ZpAEpN0iFYH6vjZNigic7g', hash_type='md5', visibility=1, key_is_virtual=False, updated_at=2024-04-19 17:40:11 UTC)

Provenance:
  πŸ“Ž storage: Storage(uid='kobXuZ1eiiLn', root='s3://lamindb-dev-datasets', type='s3', region='us-east-1')
  πŸ“Ž transform: Transform(uid='NJvdsWWbJlZS6K79', name='Tutorial: Artifacts', key='tutorial', version='0', type='notebook')
  πŸ“Ž run: Run(uid='Un6kaGLyWaVTuoRq275x', started_at=2024-04-19 17:40:09 UTC, is_consecutive=True)
  πŸ“Ž created_by: User(uid='00000000', handle='anonymous')
artifact.view_lineage()
_images/1142030ddc940c1745071f90f213702127e51b1c9b333f069fe5debd7ffd2f03.svg

Version artifacts#

If you’d like to version an artifact or transform, either provide the version parameter when creating it or create new versions through is_new_version_of.

For instance:

new_artifact = ln.Artifact(data, is_new_version_of=old_artifact)

If you’d like to add a registered artifact to a version family, use add_to_version_family.

For instance:

new_artifact.add_to_version_family(old_artifact)

Are there remaining questions about storing artifacts? If so, see: Storage FAQ.

Collections#

An artifact can model anything that’s in storage: a file, a collection, an array, a machine learning model.

Often times, several artifacts together represent a collection.

Let’s store the artifact for study0_data as a Collection:

collection = ln.Collection(
    study0_data,
    name="Iris collection",
    version="1",
    description="50 image files and metadata",
)
collection
Collection(uid='KeVx4xgsVQWQHhlVssV0', name='Iris collection', description='50 image files and metadata', version='1', hash='WwFLpNFmK8GMC2dSGj1W', visibility=1, transform_id=1, run_id=1, created_by_id=1)

And save it:

collection.save()

Now, we perform subsequent studies by collecting more data.

We’d like to keep track of their data as part of a growing versioned collection:

artifacts = [study0_data]
for folder_name in ["study1_raw_images", "study2_raw_images"]:
    # create an artifact for the folder
    artifact = ln.Artifact(f"s3://lamindb-dev-datasets/iris_studies/{folder_name}")
    artifact.save()
    artifacts.append(artifact)
    # create a new version of the collection
    collection = ln.Collection(
        artifacts, is_new_version_of=collection, description="Another 50 images"
    )
    collection.description = "Another 50 images"
    collection.save()
πŸ’‘ path in storage 's3://lamindb-dev-datasets' with key 'iris_studies/study1_raw_images'
πŸ’‘ path in storage 's3://lamindb-dev-datasets' with key 'iris_studies/study2_raw_images'

See all artifacts:

ln.Artifact.df()
uid storage_id key suffix accessor description version size hash hash_type n_objects n_observations transform_id run_id visibility key_is_virtual created_at updated_at created_by_id
id
4 fmI4CzVG0FgA9xPcrA5a 2 iris_studies/study2_raw_images None None None 665518 PX8Vt9T28y-uCEJO1tKm7A md5-d 51.0 None 1 1 1 False 2024-04-19 17:40:14.096765+00:00 2024-04-19 17:40:14.096795+00:00 1
3 gPpSlQysec2GLvbVQjUy 2 iris_studies/study1_raw_images None None None 640617 j61W__GgImA18CKrIf7FVg md5-d 49.0 None 1 1 1 False 2024-04-19 17:40:13.520866+00:00 2024-04-19 17:40:13.520900+00:00 1
2 CDanhHQaYdJVTlQw2cAj 2 iris_studies/study0_raw_images None None None 656692 wVYKPpEsmmrqSpAZIRXCFg md5-d 51.0 None 1 1 1 False 2024-04-19 17:40:12.699383+00:00 2024-04-19 17:40:12.699413+00:00 1
1 Jcs76l0nQW6Oi1go0aeP 2 iris_studies/study0_raw_images/meta.csv .csv None None None 4355 ZpAEpN0iFYH6vjZNigic7g md5 NaN None 1 1 1 False 2024-04-19 17:40:11.795128+00:00 2024-04-19 17:40:11.795156+00:00 1

See all collections:

ln.Collection.df()
uid name description version hash reference reference_type transform_id run_id artifact_id visibility created_at updated_at created_by_id
id
3 KeVx4xgsVQWQHhlVQCAZ Iris collection Another 50 images 3 T-U8z2Zi5rFYdAD9pzmS None None 1 1 None 1 2024-04-19 17:40:14.110651+00:00 2024-04-19 17:40:14.110697+00:00 1
2 KeVx4xgsVQWQHhlVt74c Iris collection Another 50 images 2 5cCK6ZLOPB0cV3tyeZup None None 1 1 None 1 2024-04-19 17:40:13.534888+00:00 2024-04-19 17:40:13.534914+00:00 1
1 KeVx4xgsVQWQHhlVssV0 Iris collection 50 image files and metadata 1 WwFLpNFmK8GMC2dSGj1W None None 1 1 None 1 2024-04-19 17:40:12.894475+00:00 2024-04-19 17:40:12.894503+00:00 1

Most functionality that you just learned about artifacts - e.g., queries & provenance - also applies to Collection.

But Collection is an abstraction over storing data in one or several artifacts and does not have a key field.

We’ll learn more about collections in the next part of the tutorial.

View changes#

With view(), you can see the latest changes to the database:

ln.view()  # link tables in the database are not shown
Hide code cell output
Artifact
uid storage_id key suffix accessor description version size hash hash_type n_objects n_observations transform_id run_id visibility key_is_virtual created_at updated_at created_by_id
id
4 fmI4CzVG0FgA9xPcrA5a 2 iris_studies/study2_raw_images None None None 665518 PX8Vt9T28y-uCEJO1tKm7A md5-d 51.0 None 1 1 1 False 2024-04-19 17:40:14.096765+00:00 2024-04-19 17:40:14.096795+00:00 1
3 gPpSlQysec2GLvbVQjUy 2 iris_studies/study1_raw_images None None None 640617 j61W__GgImA18CKrIf7FVg md5-d 49.0 None 1 1 1 False 2024-04-19 17:40:13.520866+00:00 2024-04-19 17:40:13.520900+00:00 1
2 CDanhHQaYdJVTlQw2cAj 2 iris_studies/study0_raw_images None None None 656692 wVYKPpEsmmrqSpAZIRXCFg md5-d 51.0 None 1 1 1 False 2024-04-19 17:40:12.699383+00:00 2024-04-19 17:40:12.699413+00:00 1
1 Jcs76l0nQW6Oi1go0aeP 2 iris_studies/study0_raw_images/meta.csv .csv None None None 4355 ZpAEpN0iFYH6vjZNigic7g md5 NaN None 1 1 1 False 2024-04-19 17:40:11.795128+00:00 2024-04-19 17:40:11.795156+00:00 1
Collection
uid name description version hash reference reference_type transform_id run_id artifact_id visibility created_at updated_at created_by_id
id
3 KeVx4xgsVQWQHhlVQCAZ Iris collection Another 50 images 3 T-U8z2Zi5rFYdAD9pzmS None None 1 1 None 1 2024-04-19 17:40:14.110651+00:00 2024-04-19 17:40:14.110697+00:00 1
2 KeVx4xgsVQWQHhlVt74c Iris collection Another 50 images 2 5cCK6ZLOPB0cV3tyeZup None None 1 1 None 1 2024-04-19 17:40:13.534888+00:00 2024-04-19 17:40:13.534914+00:00 1
1 KeVx4xgsVQWQHhlVssV0 Iris collection 50 image files and metadata 1 WwFLpNFmK8GMC2dSGj1W None None 1 1 None 1 2024-04-19 17:40:12.894475+00:00 2024-04-19 17:40:12.894503+00:00 1
Run
uid transform_id started_at finished_at created_by_id json report_id environment_id is_consecutive reference reference_type created_at
id
1 Un6kaGLyWaVTuoRq275x 1 2024-04-19 17:40:09.302707+00:00 None 1 None None None True None None 2024-04-19 17:40:09.302838+00:00
Storage
uid root description type region created_at updated_at created_by_id
id
2 kobXuZ1eiiLn s3://lamindb-dev-datasets None s3 us-east-1 2024-04-19 17:40:11.732916+00:00 2024-04-19 17:40:11.732955+00:00 1
1 NIXrqfee /home/runner/work/lamindb/lamindb/docs/lamin-t... None local None 2024-04-19 17:40:07.762074+00:00 2024-04-19 17:40:07.762097+00:00 1
Transform
uid name key version description type latest_report_id source_code_id reference reference_type created_at updated_at created_by_id
id
1 NJvdsWWbJlZS6K79 Tutorial: Artifacts tutorial 0 None notebook None None None None 2024-04-19 17:40:09.295516+00:00 2024-04-19 17:40:09.295547+00:00 1
User
uid handle name created_at updated_at
id
1 00000000 anonymous None 2024-04-19 17:40:07.757709+00:00 2024-04-19 17:40:07.757734+00:00

Save notebook & scripts#

When you’ve completed the work on a notebook or script, you can save the source code and, for notebooks, an execution report to your storage location like so:

ln.finish()

This enables you to query execution report & source code via transform.latest_report and transform.source_code.

If you registered the instance on LaminHub, you can share it like here.

If you want to cache a notebook or script, call:

lamin get https://lamin.ai/laminlabs/lamindata/transform/NJvdsWWbJlZSz8

Read on#

Now, you already know about 6 out of 9 LaminDB core classes! The two most central are:

And the four registries related to provenance:

  • Transform: transforms of artifacts

  • Run: runs of transforms

  • User: users

  • Storage: storage locations like S3/GCP buckets or local directories

If you want to validate data, label artifacts, and manage features, read on: Tutorial: Features & labels.