Are LaminDB’s operations idempotent? (What happens if I save the same files & records twice?)#
LaminDB’s operations are idempotent. This is essential for two reasons:
Avoid that re-running a workflow errors.
Avoid that re-running a workflow inserts duplicated data.
Summary#
Metadata records#
If you try to create any metadata record (Registry
) and upon_create_search_names
is True
(the default):
LaminDB will warn you if a record with similar
name
exists and display a table of similar existing records.You can then decide whether you’d like to save a record to the database or rather query an existing one from the table.
If a name already has an exact match in a registry, LaminDB will return it instead of creating a new record.
If you set upon_create_search_names
to False
, you’ll directly populate the DB.
Files#
If you try to create a File
object from the same content, depending on upon_file_create_if_hash_exists
,
you’ll get an existing object, if
upon_file_create_if_hash_exists = "warn_return_existing"
(the default)you’ll get an error, if
upon_file_create_if_hash_exists = "error"
you’ll get a warning and a new object, if
upon_file_create_if_hash_exists = "warn_create_new"
Examples#
!lamin init --storage ./test-idempotency
✅ saved: User(id='DzTjkKse', handle='testuser1', email='testuser1@lamin.ai', name='Test User1', updated_at=2023-09-26 15:20:46)
✅ saved: Storage(id='0ZcAs8fO', root='/home/runner/work/lamindb/lamindb/docs/faq/test-idempotency', type='local', updated_at=2023-09-26 15:20:46, created_by_id='DzTjkKse')
💡 loaded instance: testuser1/test-idempotency
💡 did not register local instance on hub (if you want, call `lamin register`)
import lamindb as ln
import pytest
ln.settings.verbosity = "hint"
💡 loaded instance: testuser1/test-idempotency (lamindb 0.54.2)
Metadata records#
assert ln.settings.upon_create_search_names
Let us add a first record to the ULabel
registry:
label = ln.ULabel(name="My project 1")
label.save()
If we create a new record, we’ll automatically get search results that give clues on whether we are prone to duplicating an entry:
label = ln.ULabel(name="My project 2")
❗ record with similar name exist! did you mean to load it?
id | __ratio__ | |
---|---|---|
name | ||
My project 1 | Hv8YEhhK | 91.666667 |
label.save()
In case we match an existing name directly, we’ll get the existing object:
label = ln.ULabel(name="My project 1")
✅ loaded ULabel record with exact same name: 'My project 1'
If we save it again, it will not create a new entry in the registry:
label.save()
Now, if we create a third record, we’ll get two alternatives:
label = ln.ULabel(name="My project 3")
❗ records with similar names exist! did you mean to load one of them?
id | __ratio__ | |
---|---|---|
name | ||
My project 1 | Hv8YEhhK | 91.666667 |
My project 2 | pvhTIXjQ | 91.666667 |
If we prefer to not perform a search, e.g. for performance reasons or too noisy logging, we can switch it off.
ln.settings.upon_create_search_names = False
label = ln.ULabel(name="My project 3")
In this walkthrough, switch it back on:
ln.settings.upon_create_search_names = True
Data artifacts#
upon_file_create_if_hash_exists = "warn_return_existing"
#
assert ln.settings.upon_file_create_if_hash_exists == "warn_return_existing"
filepath = ln.dev.datasets.file_fcs()
Create a File
object.
file = ln.File(filepath, description="My fcs file")
file.save()
❗ no run & transform get linked, consider passing a `run` or calling ln.track()
💡 file will be copied to default storage upon `save()` with key `None` ('.lamindb/IWCgW1wljGH3TiNiaQIa.fcs')
✅ storing file 'IWCgW1wljGH3TiNiaQIa' at '.lamindb/IWCgW1wljGH3TiNiaQIa.fcs'
Show code cell content
assert file.hash == "KCEXRahJ-Ui9Y6nksQ8z1A"
Create a File
object from the same path:
file2 = ln.File(filepath)
❗ no run & transform get linked, consider passing a `run` or calling ln.track()
❗ returning existing file with same hash: File(id='IWCgW1wljGH3TiNiaQIa', suffix='.fcs', description='My fcs file', size=6785467, hash='KCEXRahJ-Ui9Y6nksQ8z1A', hash_type='md5', updated_at=2023-09-26 15:20:48, storage_id='0ZcAs8fO', created_by_id='DzTjkKse')
It gives us the existing object:
assert file.id == file2.id
If you save it again, nothing will happen (the operation is idempotent):
file2.save()
upon_file_create_if_hash_exists = "error"
#
ln.settings.upon_file_create_if_hash_exists = "error"
In this case, you’ll not be able to create an object from the same content:
with pytest.raises(RuntimeError):
file3 = ln.File(filepath, description="My new fcs file")
❗ no run & transform get linked, consider passing a `run` or calling ln.track()
upon_file_create_if_hash_exists = "warn_create_new"
#
ln.settings.upon_file_create_if_hash_exists = "warn_create_new"
In this case, you’ll create a new object:
file4 = ln.File(filepath, description="My new fcs file")
file4.save()
❗ no run & transform get linked, consider passing a `run` or calling ln.track()
❗ creating new File object despite existing file with same hash: File(id='IWCgW1wljGH3TiNiaQIa', suffix='.fcs', description='My fcs file', size=6785467, hash='KCEXRahJ-Ui9Y6nksQ8z1A', hash_type='md5', updated_at=2023-09-26 15:20:48, storage_id='0ZcAs8fO', created_by_id='DzTjkKse')
💡 file will be copied to default storage upon `save()` with key `None` ('.lamindb/dF9OI3drWNxJWiaqsNXg.fcs')
✅ storing file 'dF9OI3drWNxJWiaqsNXg' at '.lamindb/dF9OI3drWNxJWiaqsNXg.fcs'
You can verify that it’s a new entry by comparing the ids:
assert file4.id != file.id
file4.filter(hash="KCEXRahJ-Ui9Y6nksQ8z1A").df()
storage_id | key | suffix | accessor | description | version | size | hash | hash_type | transform_id | run_id | initial_version_id | updated_at | created_by_id | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
id | ||||||||||||||
IWCgW1wljGH3TiNiaQIa | 0ZcAs8fO | None | .fcs | None | My fcs file | None | 6785467 | KCEXRahJ-Ui9Y6nksQ8z1A | md5 | None | None | None | 2023-09-26 15:20:48 | DzTjkKse |
dF9OI3drWNxJWiaqsNXg | 0ZcAs8fO | None | .fcs | None | My new fcs file | None | 6785467 | KCEXRahJ-Ui9Y6nksQ8z1A | md5 | None | None | None | 2023-09-26 15:20:48 | DzTjkKse |
Show code cell content
assert len(file.filter(hash="KCEXRahJ-Ui9Y6nksQ8z1A").list()) == 2
!lamin delete --force test-idempotency
!rm -r test-idempotency
💡 deleting instance testuser1/test-idempotency
✅ deleted instance settings file: /home/runner/.lamin/instance--testuser1--test-idempotency.env
✅ instance cache deleted
✅ deleted '.lndb' sqlite file
❗ consider manually deleting your stored data: /home/runner/work/lamindb/lamindb/docs/faq/test-idempotency