Query & search registries#

Find & access data using registries.

Setup#

!lamin init --storage ./mydata
Hide code cell output
✅ saved: User(id='DzTjkKse', handle='testuser1', email='testuser1@lamin.ai', name='Test User1', updated_at=2023-09-26 15:21:17)
✅ saved: Storage(id='SU6Z058y', root='/home/runner/work/lamindb/lamindb/docs/mydata', type='local', updated_at=2023-09-26 15:21:17, created_by_id='DzTjkKse')
💡 loaded instance: testuser1/mydata
💡 did not register local instance on hub (if you want, call `lamin register`)

import lamindb as ln
💡 loaded instance: testuser1/mydata (lamindb 0.54.2)
ln.settings.verbosity = "info"
ln.track()
💡 notebook imports: Django==4.2.5 lamindb==0.54.2
✅ saved: Transform(id='vldHzF3aTAiWz8', name='Query & search registries', short_name='meta', version='0', type=notebook, updated_at=2023-09-26 15:21:20, created_by_id='DzTjkKse')
✅ saved: Run(id='ngmendOMpRV0D6sS9bl4', run_at=2023-09-26 15:21:20, transform_id='vldHzF3aTAiWz8', created_by_id='DzTjkKse')

We’ll need some toy data:

ln.File(ln.dev.datasets.file_jpg_paradisi05(), description="My image").save()
ln.File(ln.dev.datasets.df_iris(), description="The iris dataset").save()
ln.File(ln.dev.datasets.file_fastq(), description="My fastq").save()
Hide code cell output
✅ storing file 've72bnAb7WsPz6K9iYbb' at '.lamindb/ve72bnAb7WsPz6K9iYbb.jpg'
✅ storing file '25FaCvRzqYkXeF1cpuI9' at '.lamindb/25FaCvRzqYkXeF1cpuI9.parquet'
❗ file has more than one suffix (path.suffixes), inferring: '.fastq.gz'
✅ storing file '0cX4xW9DqinirLT2m7Qw' at '.lamindb/0cX4xW9DqinirLT2m7Qw.fastq.gz'

Look up metadata#

For entities where we don’t store more than 100k records, a look up object can be a convenient way of selecting a record.

Consider the User registry:

users = ln.User.lookup(field="handle")

With auto-complete, we find a user:

user = users.testuser1
user
User(id='DzTjkKse', handle='testuser1', email='testuser1@lamin.ai', name='Test User1', updated_at=2023-09-26 15:21:17)

Note

You can also auto-complete in a dictionary:

users_dict = ln.User.lookup().dict()

Filter by metadata#

Filter for all files created by a user:

ln.File.filter(created_by=user).df()
storage_id key suffix accessor description version size hash hash_type transform_id run_id initial_version_id updated_at created_by_id
id
ve72bnAb7WsPz6K9iYbb SU6Z058y None .jpg None My image None 29358 r4tnqmKI_SjrkdLzpuWp4g md5 vldHzF3aTAiWz8 ngmendOMpRV0D6sS9bl4 None 2023-09-26 15:21:20 DzTjkKse
25FaCvRzqYkXeF1cpuI9 SU6Z058y None .parquet DataFrame The iris dataset None 5629 1OIjU7wD_Wdeiqjtlyf42g md5 vldHzF3aTAiWz8 ngmendOMpRV0D6sS9bl4 None 2023-09-26 15:21:20 DzTjkKse
0cX4xW9DqinirLT2m7Qw SU6Z058y None .fastq.gz None My fastq None 16 QDkCIyDtWe8tlrS9zG8gnw md5 vldHzF3aTAiWz8 ngmendOMpRV0D6sS9bl4 None 2023-09-26 15:21:20 DzTjkKse

To access the query results encoded in a select statement (an extended Django QuerySet object), execute it with one of

  • .df(): A pandas DataFrame with each record stored as a row.

  • .all(): An indexable django QuerySet.

  • .list(): A list of records.

  • .one(): Exactly one record. Will raise an error if there is none.

  • .one_or_none(): Either one record or None if there is no query result.

Note

The ORMs in LaminDB are Django Models and any Django query works. LaminDB extends Django’s API for data scientists.

Under the hood, any filter() call translates into a SQL select statement.

In SQLAlchemy’s & SQLModel’s queries, this is more evident as they revolve around select statements, which is analogous to the QuerySet returned by filter(). .one() and .one_or_none() are two parts of LaminDB’s API that are borrowed from SQLAlchemy.

Search for metadata#

ln.File.search("iris")
key description __ratio__
id
25FaCvRzqYkXeF1cpuI9 The iris dataset 90.000000
ve72bnAb7WsPz6K9iYbb My image 34.200000
0cX4xW9DqinirLT2m7Qw My fastq 25.714286
ln.File.search("iris", return_queryset=True).first()
File(id='25FaCvRzqYkXeF1cpuI9', suffix='.parquet', accessor='DataFrame', description='The iris dataset', size=5629, hash='1OIjU7wD_Wdeiqjtlyf42g', hash_type='md5', updated_at=2023-09-26 15:21:20, storage_id='SU6Z058y', transform_id='vldHzF3aTAiWz8', run_id='ngmendOMpRV0D6sS9bl4', created_by_id='DzTjkKse')

Let us create 500 notebook objects with fake titles and save them:

ln.save(
    [
        ln.Transform(name=title, type="notebook")
        for title in ln.dev.datasets.fake_bio_notebook_titles(n=500)
    ]
)

We can now search for any combination of terms:

ln.Transform.search("intestine").head()
id __ratio__
name
Bulbourethral Glands IgG4 IgG4 IgA IgG4 Smooth muscle cell intestine. 6qTIYEecmDmjh2 90.0
Bulbourethral Glands IgY efficiency Earlobe intestine. huaYbcWC7nOHSD 90.0
Cajal–Retzius Cells Earlobe intestine IgG2 Cajal–Retzius cells IgG4. eC2b4eEenlq6Uv 90.0
Cajal–Retzius Cells visualize intestine visualize IgG1 IgG3. AFp1S2o9yJN056 90.0
Classify IgG4 intestine IgA. Mn2Dxay7dIjK4L 90.0

Leverage relations#

Django has a double-under-score syntax to filter based on related tables.

This syntax enables you to traverse several layers of relations:

ln.File.filter(run__created_by__handle__startswith="testuse").df()
storage_id key suffix accessor description version size hash hash_type transform_id run_id initial_version_id updated_at created_by_id
id
ve72bnAb7WsPz6K9iYbb SU6Z058y None .jpg None My image None 29358 r4tnqmKI_SjrkdLzpuWp4g md5 vldHzF3aTAiWz8 ngmendOMpRV0D6sS9bl4 None 2023-09-26 15:21:20 DzTjkKse
25FaCvRzqYkXeF1cpuI9 SU6Z058y None .parquet DataFrame The iris dataset None 5629 1OIjU7wD_Wdeiqjtlyf42g md5 vldHzF3aTAiWz8 ngmendOMpRV0D6sS9bl4 None 2023-09-26 15:21:20 DzTjkKse
0cX4xW9DqinirLT2m7Qw SU6Z058y None .fastq.gz None My fastq None 16 QDkCIyDtWe8tlrS9zG8gnw md5 vldHzF3aTAiWz8 ngmendOMpRV0D6sS9bl4 None 2023-09-26 15:21:20 DzTjkKse

The filter selects all files based on the users who ran the generating notebook.

(Under the hood, in the SQL database, it’s joining the file table with the run and the user table.)

Beyond __startswith, Django supports about two dozen field comparators field__comparator=value.

Here are some of them.

and#

ln.File.filter(suffix=".jpg", created_by=user).df()
storage_id key suffix accessor description version size hash hash_type transform_id run_id initial_version_id updated_at created_by_id
id
ve72bnAb7WsPz6K9iYbb SU6Z058y None .jpg None My image None 29358 r4tnqmKI_SjrkdLzpuWp4g md5 vldHzF3aTAiWz8 ngmendOMpRV0D6sS9bl4 None 2023-09-26 15:21:20 DzTjkKse

less than/ greater than#

Or subset to files greater than 10kB. Here, we can’t use keyword arguments, but need an explicit where statement.

ln.File.filter(created_by=user, size__lt=1e4).df()
storage_id key suffix accessor description version size hash hash_type transform_id run_id initial_version_id updated_at created_by_id
id
25FaCvRzqYkXeF1cpuI9 SU6Z058y None .parquet DataFrame The iris dataset None 5629 1OIjU7wD_Wdeiqjtlyf42g md5 vldHzF3aTAiWz8 ngmendOMpRV0D6sS9bl4 None 2023-09-26 15:21:20 DzTjkKse
0cX4xW9DqinirLT2m7Qw SU6Z058y None .fastq.gz None My fastq None 16 QDkCIyDtWe8tlrS9zG8gnw md5 vldHzF3aTAiWz8 ngmendOMpRV0D6sS9bl4 None 2023-09-26 15:21:20 DzTjkKse

or#

from django.db.models import Q

ln.File.filter().filter(Q(suffix=".jpg") | Q(suffix=".fastq.gz")).df()
storage_id key suffix accessor description version size hash hash_type transform_id run_id initial_version_id updated_at created_by_id
id
0cX4xW9DqinirLT2m7Qw SU6Z058y None .fastq.gz None My fastq None 16 QDkCIyDtWe8tlrS9zG8gnw md5 vldHzF3aTAiWz8 ngmendOMpRV0D6sS9bl4 None 2023-09-26 15:21:20 DzTjkKse
ve72bnAb7WsPz6K9iYbb SU6Z058y None .jpg None My image None 29358 r4tnqmKI_SjrkdLzpuWp4g md5 vldHzF3aTAiWz8 ngmendOMpRV0D6sS9bl4 None 2023-09-26 15:21:20 DzTjkKse

in#

ln.File.filter(suffix__in=[".jpg", ".fastq.gz"]).df()
storage_id key suffix accessor description version size hash hash_type transform_id run_id initial_version_id updated_at created_by_id
id
0cX4xW9DqinirLT2m7Qw SU6Z058y None .fastq.gz None My fastq None 16 QDkCIyDtWe8tlrS9zG8gnw md5 vldHzF3aTAiWz8 ngmendOMpRV0D6sS9bl4 None 2023-09-26 15:21:20 DzTjkKse
ve72bnAb7WsPz6K9iYbb SU6Z058y None .jpg None My image None 29358 r4tnqmKI_SjrkdLzpuWp4g md5 vldHzF3aTAiWz8 ngmendOMpRV0D6sS9bl4 None 2023-09-26 15:21:20 DzTjkKse

order by#

ln.File.filter().order_by("-updated_at").df()
storage_id key suffix accessor description version size hash hash_type transform_id run_id initial_version_id updated_at created_by_id
id
0cX4xW9DqinirLT2m7Qw SU6Z058y None .fastq.gz None My fastq None 16 QDkCIyDtWe8tlrS9zG8gnw md5 vldHzF3aTAiWz8 ngmendOMpRV0D6sS9bl4 None 2023-09-26 15:21:20 DzTjkKse
25FaCvRzqYkXeF1cpuI9 SU6Z058y None .parquet DataFrame The iris dataset None 5629 1OIjU7wD_Wdeiqjtlyf42g md5 vldHzF3aTAiWz8 ngmendOMpRV0D6sS9bl4 None 2023-09-26 15:21:20 DzTjkKse
ve72bnAb7WsPz6K9iYbb SU6Z058y None .jpg None My image None 29358 r4tnqmKI_SjrkdLzpuWp4g md5 vldHzF3aTAiWz8 ngmendOMpRV0D6sS9bl4 None 2023-09-26 15:21:20 DzTjkKse

contains#

ln.Transform.filter(name__contains="search").df().head(10)
name short_name version type reference reference_type initial_version_id updated_at created_by_id
id
vldHzF3aTAiWz8 Query & search registries meta 0 notebook None None None 2023-09-26 15:21:20 DzTjkKse
PoUFkSiCAL0zR9 Hyalocyte IgA Neurogliaform cells research Ear... None None notebook None None None 2023-09-26 15:21:24 DzTjkKse
Lww5PkHba2whA6 Igg3 research Earlobe IgG4. None None notebook None None None 2023-09-26 15:21:24 DzTjkKse
q8GYobFITMxHN9 Research Nuclear bag cell IgA research candida... None None notebook None None None 2023-09-26 15:21:24 DzTjkKse
jSNXY7eZLZKAdv Retina efficiency visualize IgG4 IgG4 IgG4 res... None None notebook None None None 2023-09-26 15:21:24 DzTjkKse
dFxjnQMgbOoSJn Research research Medium spiny neurons researc... None None notebook None None None 2023-09-26 15:21:24 DzTjkKse
QzzRbgOPGyOkx2 Intestinal research Uterus IgA Retina. None None notebook None None None 2023-09-26 15:21:24 DzTjkKse
zQ8AiZjfdmWU58 Research IgG4 neurotensin-secreting N cell Smo... None None notebook None None None 2023-09-26 15:21:24 DzTjkKse
odfO9tIGW5qa2E Hyalocyte Nuclear bag cell Medium spiny neuron... None None notebook None None None 2023-09-26 15:21:24 DzTjkKse
78P5g72Q4gqXRd Hyalocyte Cajal–Retzius cells IgA research IgA. None None notebook None None None 2023-09-26 15:21:24 DzTjkKse

And case-insensitive:

ln.Transform.filter(name__icontains="Search").df().head(10)
name short_name version type reference reference_type initial_version_id updated_at created_by_id
id
vldHzF3aTAiWz8 Query & search registries meta 0 notebook None None None 2023-09-26 15:21:20 DzTjkKse
PoUFkSiCAL0zR9 Hyalocyte IgA Neurogliaform cells research Ear... None None notebook None None None 2023-09-26 15:21:24 DzTjkKse
Lww5PkHba2whA6 Igg3 research Earlobe IgG4. None None notebook None None None 2023-09-26 15:21:24 DzTjkKse
q8GYobFITMxHN9 Research Nuclear bag cell IgA research candida... None None notebook None None None 2023-09-26 15:21:24 DzTjkKse
jSNXY7eZLZKAdv Retina efficiency visualize IgG4 IgG4 IgG4 res... None None notebook None None None 2023-09-26 15:21:24 DzTjkKse
dFxjnQMgbOoSJn Research research Medium spiny neurons researc... None None notebook None None None 2023-09-26 15:21:24 DzTjkKse
QzzRbgOPGyOkx2 Intestinal research Uterus IgA Retina. None None notebook None None None 2023-09-26 15:21:24 DzTjkKse
zQ8AiZjfdmWU58 Research IgG4 neurotensin-secreting N cell Smo... None None notebook None None None 2023-09-26 15:21:24 DzTjkKse
odfO9tIGW5qa2E Hyalocyte Nuclear bag cell Medium spiny neuron... None None notebook None None None 2023-09-26 15:21:24 DzTjkKse
78P5g72Q4gqXRd Hyalocyte Cajal–Retzius cells IgA research IgA. None None notebook None None None 2023-09-26 15:21:24 DzTjkKse

startswith#

ln.Transform.filter(name__startswith="Query").df()
name short_name version type reference reference_type initial_version_id updated_at created_by_id
id
vldHzF3aTAiWz8 Query & search registries meta 0 notebook None None None 2023-09-26 15:21:20 DzTjkKse
Hide code cell content
!lamin delete --force mydata
💡 deleting instance testuser1/mydata
✅     deleted instance settings file: /home/runner/.lamin/instance--testuser1--mydata.env
✅     instance cache deleted
✅     deleted '.lndb' sqlite file
❗     consider manually deleting your stored data: /home/runner/work/lamindb/lamindb/docs/mydata