Query & search registries#
Find & access data using registries.
Setup#
!lamin init --storage ./mydata
Show code cell output
✅ saved: User(id='DzTjkKse', handle='testuser1', email='testuser1@lamin.ai', name='Test User1', updated_at=2023-09-26 15:21:17)
✅ saved: Storage(id='SU6Z058y', root='/home/runner/work/lamindb/lamindb/docs/mydata', type='local', updated_at=2023-09-26 15:21:17, created_by_id='DzTjkKse')
💡 loaded instance: testuser1/mydata
💡 did not register local instance on hub (if you want, call `lamin register`)
import lamindb as ln
💡 loaded instance: testuser1/mydata (lamindb 0.54.2)
ln.settings.verbosity = "info"
ln.track()
💡 notebook imports: Django==4.2.5 lamindb==0.54.2
✅ saved: Transform(id='vldHzF3aTAiWz8', name='Query & search registries', short_name='meta', version='0', type=notebook, updated_at=2023-09-26 15:21:20, created_by_id='DzTjkKse')
✅ saved: Run(id='ngmendOMpRV0D6sS9bl4', run_at=2023-09-26 15:21:20, transform_id='vldHzF3aTAiWz8', created_by_id='DzTjkKse')
We’ll need some toy data:
ln.File(ln.dev.datasets.file_jpg_paradisi05(), description="My image").save()
ln.File(ln.dev.datasets.df_iris(), description="The iris dataset").save()
ln.File(ln.dev.datasets.file_fastq(), description="My fastq").save()
Show code cell output
✅ storing file 've72bnAb7WsPz6K9iYbb' at '.lamindb/ve72bnAb7WsPz6K9iYbb.jpg'
✅ storing file '25FaCvRzqYkXeF1cpuI9' at '.lamindb/25FaCvRzqYkXeF1cpuI9.parquet'
❗ file has more than one suffix (path.suffixes), inferring: '.fastq.gz'
✅ storing file '0cX4xW9DqinirLT2m7Qw' at '.lamindb/0cX4xW9DqinirLT2m7Qw.fastq.gz'
Look up metadata#
For entities where we don’t store more than 100k records, a look up object can be a convenient way of selecting a record.
Consider the User
registry:
users = ln.User.lookup(field="handle")
With auto-complete, we find a user:
user = users.testuser1
user
User(id='DzTjkKse', handle='testuser1', email='testuser1@lamin.ai', name='Test User1', updated_at=2023-09-26 15:21:17)
Note
You can also auto-complete in a dictionary:
users_dict = ln.User.lookup().dict()
Filter by metadata#
Filter for all files created by a user:
ln.File.filter(created_by=user).df()
storage_id | key | suffix | accessor | description | version | size | hash | hash_type | transform_id | run_id | initial_version_id | updated_at | created_by_id | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
id | ||||||||||||||
ve72bnAb7WsPz6K9iYbb | SU6Z058y | None | .jpg | None | My image | None | 29358 | r4tnqmKI_SjrkdLzpuWp4g | md5 | vldHzF3aTAiWz8 | ngmendOMpRV0D6sS9bl4 | None | 2023-09-26 15:21:20 | DzTjkKse |
25FaCvRzqYkXeF1cpuI9 | SU6Z058y | None | .parquet | DataFrame | The iris dataset | None | 5629 | 1OIjU7wD_Wdeiqjtlyf42g | md5 | vldHzF3aTAiWz8 | ngmendOMpRV0D6sS9bl4 | None | 2023-09-26 15:21:20 | DzTjkKse |
0cX4xW9DqinirLT2m7Qw | SU6Z058y | None | .fastq.gz | None | My fastq | None | 16 | QDkCIyDtWe8tlrS9zG8gnw | md5 | vldHzF3aTAiWz8 | ngmendOMpRV0D6sS9bl4 | None | 2023-09-26 15:21:20 | DzTjkKse |
To access the query results encoded in a select statement (an extended Django QuerySet
object), execute it with one of
.df()
: A pandasDataFrame
with each record stored as a row..all()
: An indexable djangoQuerySet
..list()
: A list of records..one()
: Exactly one record. Will raise an error if there is none..one_or_none()
: Either one record orNone
if there is no query result.
Note
The ORMs in LaminDB are Django Models and any Django query works. LaminDB extends Django’s API for data scientists.
Under the hood, any filter()
call translates into a SQL select statement.
In SQLAlchemy’s & SQLModel’s queries, this is more evident as they revolve around select
statements, which is analogous to the QuerySet
returned by filter()
. .one()
and .one_or_none()
are two parts of LaminDB’s API that are borrowed from SQLAlchemy.
Search for metadata#
ln.File.search("iris")
key | description | __ratio__ | |
---|---|---|---|
id | |||
25FaCvRzqYkXeF1cpuI9 | The iris dataset | 90.000000 | |
ve72bnAb7WsPz6K9iYbb | My image | 34.200000 | |
0cX4xW9DqinirLT2m7Qw | My fastq | 25.714286 |
ln.File.search("iris", return_queryset=True).first()
File(id='25FaCvRzqYkXeF1cpuI9', suffix='.parquet', accessor='DataFrame', description='The iris dataset', size=5629, hash='1OIjU7wD_Wdeiqjtlyf42g', hash_type='md5', updated_at=2023-09-26 15:21:20, storage_id='SU6Z058y', transform_id='vldHzF3aTAiWz8', run_id='ngmendOMpRV0D6sS9bl4', created_by_id='DzTjkKse')
Let us create 500 notebook objects with fake titles and save them:
ln.save(
[
ln.Transform(name=title, type="notebook")
for title in ln.dev.datasets.fake_bio_notebook_titles(n=500)
]
)
We can now search for any combination of terms:
ln.Transform.search("intestine").head()
id | __ratio__ | |
---|---|---|
name | ||
Bulbourethral Glands IgG4 IgG4 IgA IgG4 Smooth muscle cell intestine. | 6qTIYEecmDmjh2 | 90.0 |
Bulbourethral Glands IgY efficiency Earlobe intestine. | huaYbcWC7nOHSD | 90.0 |
Cajal–Retzius Cells Earlobe intestine IgG2 Cajal–Retzius cells IgG4. | eC2b4eEenlq6Uv | 90.0 |
Cajal–Retzius Cells visualize intestine visualize IgG1 IgG3. | AFp1S2o9yJN056 | 90.0 |
Classify IgG4 intestine IgA. | Mn2Dxay7dIjK4L | 90.0 |
Leverage relations#
Django has a double-under-score syntax to filter based on related tables.
This syntax enables you to traverse several layers of relations:
ln.File.filter(run__created_by__handle__startswith="testuse").df()
storage_id | key | suffix | accessor | description | version | size | hash | hash_type | transform_id | run_id | initial_version_id | updated_at | created_by_id | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
id | ||||||||||||||
ve72bnAb7WsPz6K9iYbb | SU6Z058y | None | .jpg | None | My image | None | 29358 | r4tnqmKI_SjrkdLzpuWp4g | md5 | vldHzF3aTAiWz8 | ngmendOMpRV0D6sS9bl4 | None | 2023-09-26 15:21:20 | DzTjkKse |
25FaCvRzqYkXeF1cpuI9 | SU6Z058y | None | .parquet | DataFrame | The iris dataset | None | 5629 | 1OIjU7wD_Wdeiqjtlyf42g | md5 | vldHzF3aTAiWz8 | ngmendOMpRV0D6sS9bl4 | None | 2023-09-26 15:21:20 | DzTjkKse |
0cX4xW9DqinirLT2m7Qw | SU6Z058y | None | .fastq.gz | None | My fastq | None | 16 | QDkCIyDtWe8tlrS9zG8gnw | md5 | vldHzF3aTAiWz8 | ngmendOMpRV0D6sS9bl4 | None | 2023-09-26 15:21:20 | DzTjkKse |
The filter selects all files based on the users who ran the generating notebook.
(Under the hood, in the SQL database, it’s joining the file table with the run and the user table.)
Beyond __startswith
, Django supports about two dozen field comparators field__comparator=value
.
Here are some of them.
and#
ln.File.filter(suffix=".jpg", created_by=user).df()
storage_id | key | suffix | accessor | description | version | size | hash | hash_type | transform_id | run_id | initial_version_id | updated_at | created_by_id | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
id | ||||||||||||||
ve72bnAb7WsPz6K9iYbb | SU6Z058y | None | .jpg | None | My image | None | 29358 | r4tnqmKI_SjrkdLzpuWp4g | md5 | vldHzF3aTAiWz8 | ngmendOMpRV0D6sS9bl4 | None | 2023-09-26 15:21:20 | DzTjkKse |
less than/ greater than#
Or subset to files greater than 10kB. Here, we can’t use keyword arguments, but need an explicit where statement.
ln.File.filter(created_by=user, size__lt=1e4).df()
storage_id | key | suffix | accessor | description | version | size | hash | hash_type | transform_id | run_id | initial_version_id | updated_at | created_by_id | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
id | ||||||||||||||
25FaCvRzqYkXeF1cpuI9 | SU6Z058y | None | .parquet | DataFrame | The iris dataset | None | 5629 | 1OIjU7wD_Wdeiqjtlyf42g | md5 | vldHzF3aTAiWz8 | ngmendOMpRV0D6sS9bl4 | None | 2023-09-26 15:21:20 | DzTjkKse |
0cX4xW9DqinirLT2m7Qw | SU6Z058y | None | .fastq.gz | None | My fastq | None | 16 | QDkCIyDtWe8tlrS9zG8gnw | md5 | vldHzF3aTAiWz8 | ngmendOMpRV0D6sS9bl4 | None | 2023-09-26 15:21:20 | DzTjkKse |
or#
from django.db.models import Q
ln.File.filter().filter(Q(suffix=".jpg") | Q(suffix=".fastq.gz")).df()
storage_id | key | suffix | accessor | description | version | size | hash | hash_type | transform_id | run_id | initial_version_id | updated_at | created_by_id | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
id | ||||||||||||||
0cX4xW9DqinirLT2m7Qw | SU6Z058y | None | .fastq.gz | None | My fastq | None | 16 | QDkCIyDtWe8tlrS9zG8gnw | md5 | vldHzF3aTAiWz8 | ngmendOMpRV0D6sS9bl4 | None | 2023-09-26 15:21:20 | DzTjkKse |
ve72bnAb7WsPz6K9iYbb | SU6Z058y | None | .jpg | None | My image | None | 29358 | r4tnqmKI_SjrkdLzpuWp4g | md5 | vldHzF3aTAiWz8 | ngmendOMpRV0D6sS9bl4 | None | 2023-09-26 15:21:20 | DzTjkKse |
in#
ln.File.filter(suffix__in=[".jpg", ".fastq.gz"]).df()
storage_id | key | suffix | accessor | description | version | size | hash | hash_type | transform_id | run_id | initial_version_id | updated_at | created_by_id | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
id | ||||||||||||||
0cX4xW9DqinirLT2m7Qw | SU6Z058y | None | .fastq.gz | None | My fastq | None | 16 | QDkCIyDtWe8tlrS9zG8gnw | md5 | vldHzF3aTAiWz8 | ngmendOMpRV0D6sS9bl4 | None | 2023-09-26 15:21:20 | DzTjkKse |
ve72bnAb7WsPz6K9iYbb | SU6Z058y | None | .jpg | None | My image | None | 29358 | r4tnqmKI_SjrkdLzpuWp4g | md5 | vldHzF3aTAiWz8 | ngmendOMpRV0D6sS9bl4 | None | 2023-09-26 15:21:20 | DzTjkKse |
order by#
ln.File.filter().order_by("-updated_at").df()
storage_id | key | suffix | accessor | description | version | size | hash | hash_type | transform_id | run_id | initial_version_id | updated_at | created_by_id | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
id | ||||||||||||||
0cX4xW9DqinirLT2m7Qw | SU6Z058y | None | .fastq.gz | None | My fastq | None | 16 | QDkCIyDtWe8tlrS9zG8gnw | md5 | vldHzF3aTAiWz8 | ngmendOMpRV0D6sS9bl4 | None | 2023-09-26 15:21:20 | DzTjkKse |
25FaCvRzqYkXeF1cpuI9 | SU6Z058y | None | .parquet | DataFrame | The iris dataset | None | 5629 | 1OIjU7wD_Wdeiqjtlyf42g | md5 | vldHzF3aTAiWz8 | ngmendOMpRV0D6sS9bl4 | None | 2023-09-26 15:21:20 | DzTjkKse |
ve72bnAb7WsPz6K9iYbb | SU6Z058y | None | .jpg | None | My image | None | 29358 | r4tnqmKI_SjrkdLzpuWp4g | md5 | vldHzF3aTAiWz8 | ngmendOMpRV0D6sS9bl4 | None | 2023-09-26 15:21:20 | DzTjkKse |
contains#
ln.Transform.filter(name__contains="search").df().head(10)
name | short_name | version | type | reference | reference_type | initial_version_id | updated_at | created_by_id | |
---|---|---|---|---|---|---|---|---|---|
id | |||||||||
vldHzF3aTAiWz8 | Query & search registries | meta | 0 | notebook | None | None | None | 2023-09-26 15:21:20 | DzTjkKse |
PoUFkSiCAL0zR9 | Hyalocyte IgA Neurogliaform cells research Ear... | None | None | notebook | None | None | None | 2023-09-26 15:21:24 | DzTjkKse |
Lww5PkHba2whA6 | Igg3 research Earlobe IgG4. | None | None | notebook | None | None | None | 2023-09-26 15:21:24 | DzTjkKse |
q8GYobFITMxHN9 | Research Nuclear bag cell IgA research candida... | None | None | notebook | None | None | None | 2023-09-26 15:21:24 | DzTjkKse |
jSNXY7eZLZKAdv | Retina efficiency visualize IgG4 IgG4 IgG4 res... | None | None | notebook | None | None | None | 2023-09-26 15:21:24 | DzTjkKse |
dFxjnQMgbOoSJn | Research research Medium spiny neurons researc... | None | None | notebook | None | None | None | 2023-09-26 15:21:24 | DzTjkKse |
QzzRbgOPGyOkx2 | Intestinal research Uterus IgA Retina. | None | None | notebook | None | None | None | 2023-09-26 15:21:24 | DzTjkKse |
zQ8AiZjfdmWU58 | Research IgG4 neurotensin-secreting N cell Smo... | None | None | notebook | None | None | None | 2023-09-26 15:21:24 | DzTjkKse |
odfO9tIGW5qa2E | Hyalocyte Nuclear bag cell Medium spiny neuron... | None | None | notebook | None | None | None | 2023-09-26 15:21:24 | DzTjkKse |
78P5g72Q4gqXRd | Hyalocyte Cajal–Retzius cells IgA research IgA. | None | None | notebook | None | None | None | 2023-09-26 15:21:24 | DzTjkKse |
And case-insensitive:
ln.Transform.filter(name__icontains="Search").df().head(10)
name | short_name | version | type | reference | reference_type | initial_version_id | updated_at | created_by_id | |
---|---|---|---|---|---|---|---|---|---|
id | |||||||||
vldHzF3aTAiWz8 | Query & search registries | meta | 0 | notebook | None | None | None | 2023-09-26 15:21:20 | DzTjkKse |
PoUFkSiCAL0zR9 | Hyalocyte IgA Neurogliaform cells research Ear... | None | None | notebook | None | None | None | 2023-09-26 15:21:24 | DzTjkKse |
Lww5PkHba2whA6 | Igg3 research Earlobe IgG4. | None | None | notebook | None | None | None | 2023-09-26 15:21:24 | DzTjkKse |
q8GYobFITMxHN9 | Research Nuclear bag cell IgA research candida... | None | None | notebook | None | None | None | 2023-09-26 15:21:24 | DzTjkKse |
jSNXY7eZLZKAdv | Retina efficiency visualize IgG4 IgG4 IgG4 res... | None | None | notebook | None | None | None | 2023-09-26 15:21:24 | DzTjkKse |
dFxjnQMgbOoSJn | Research research Medium spiny neurons researc... | None | None | notebook | None | None | None | 2023-09-26 15:21:24 | DzTjkKse |
QzzRbgOPGyOkx2 | Intestinal research Uterus IgA Retina. | None | None | notebook | None | None | None | 2023-09-26 15:21:24 | DzTjkKse |
zQ8AiZjfdmWU58 | Research IgG4 neurotensin-secreting N cell Smo... | None | None | notebook | None | None | None | 2023-09-26 15:21:24 | DzTjkKse |
odfO9tIGW5qa2E | Hyalocyte Nuclear bag cell Medium spiny neuron... | None | None | notebook | None | None | None | 2023-09-26 15:21:24 | DzTjkKse |
78P5g72Q4gqXRd | Hyalocyte Cajal–Retzius cells IgA research IgA. | None | None | notebook | None | None | None | 2023-09-26 15:21:24 | DzTjkKse |
startswith#
ln.Transform.filter(name__startswith="Query").df()
name | short_name | version | type | reference | reference_type | initial_version_id | updated_at | created_by_id | |
---|---|---|---|---|---|---|---|---|---|
id | |||||||||
vldHzF3aTAiWz8 | Query & search registries | meta | 0 | notebook | None | None | None | 2023-09-26 15:21:20 | DzTjkKse |
Show code cell content
!lamin delete --force mydata
💡 deleting instance testuser1/mydata
✅ deleted instance settings file: /home/runner/.lamin/instance--testuser1--mydata.env
✅ instance cache deleted
✅ deleted '.lndb' sqlite file
❗ consider manually deleting your stored data: /home/runner/work/lamindb/lamindb/docs/mydata