Jupyter Notebook

Query files & datasets#

We saw how LaminDB allows to query & search across files & datasets using registries: Query & search registries.

This guide addresses “queries within datasets”, for instance:

ulabels = ln.ULabel.lookup()
df = ln.File.filter(ulabels == ulabels.setosa).first().load()  # access a batch of the iris dataset ingested in the tutorial
df_setosa = df.loc[:, df.iris_species_name == ulabels.setosa.name]  # subset the iris dataset to observations of species "setosa"

Because the file was validated, subsetting the DataFrame is guaranteed to succeed, sparing you the headache of re-curating features & labels.

Such within-dataset queries are also possible for cloud-backed datasets using DuckDB, TileDB, zarr, HDF5, parquet and other storage backends.

In this notebook, we show how to subset an AnnData and generic HDF5 and zarr datasets accessed in the cloud.

Setup#

!lamin init --storage s3://lamindb-ci --name test-data
Hide code cell output
2023-09-26 15:20:51,880:INFO - Found credentials in environment variables.
❗ Instance metadata exists, but DB might have been corrupted or deleted. Re-initializing the DB.
❗ storage exists already
✅ registered instance on hub: https://lamin.ai/testuser1/test-data
✅ saved: User(id='DzTjkKse', handle='testuser1', email='testuser1@lamin.ai', name='Test User1', updated_at=2023-09-26 15:20:59)
✅ saved: Storage(id='5VXiHCyD', root='s3://lamindb-ci', type='s3', region='us-west-1', updated_at=2023-09-26 15:20:59, created_by_id='DzTjkKse')
❗ updating & unlocking cloud SQLite 's3://lamindb-ci/test-data.lndb' of instance 'testuser1/test-data'
💡 loaded instance: testuser1/test-data
❗ locked instance (to unlock and push changes to the cloud SQLite file, call: lamin close)

import lamindb as ln
2023-09-26 15:21:02,580:INFO - Found credentials in environment variables.
💡 loaded instance: testuser1/test-data (lamindb 0.54.2)
ln.settings.verbosity = "info"

We’ll need some test data:

ln.File("s3://lamindb-ci/lndb-storage/pbmc68k.h5ad").save()
ln.File("s3://lamindb-ci/lndb-storage/testfile.hdf5").save()
❗ no run & transform get linked, consider passing a `run` or calling ln.track()
❗ no run & transform get linked, consider passing a `run` or calling ln.track()

AnnData#

An h5ad file stored on s3:

file = ln.File.filter(key="lndb-storage/pbmc68k.h5ad").one()
file.path
S3Path('s3://lamindb-ci/lndb-storage/pbmc68k.h5ad')
adata = file.backed()

This object is an AnnDataAccessor object, an AnnData object backed in the cloud:

adata
AnnDataAccessor object with n_obs × n_vars = 70 × 765
  constructed for the AnnData object pbmc68k.h5ad
    obs: ['cell_type', 'index', 'louvain', 'n_genes', 'percent_mito']
    obsm: ['X_pca', 'X_umap']
    obsp: ['connectivities', 'distances']
    uns: ['louvain', 'louvain_colors', 'neighbors', 'pca']
    var: ['highly_variable', 'index', 'n_counts']
    varm: ['PCs']

Without subsetting, the AnnDataAccessor object references underlying lazy h5 or zarr arrays:

adata.X
<HDF5 dataset "X": shape (70, 765), type "<f4">

You can subset it like a normal AnnData object:

obs_idx = adata.obs.cell_type.isin(["Dendritic cells", "CD14+ Monocytes"]) & (
    adata.obs.percent_mito <= 0.05
)
adata_subset = adata[obs_idx]
adata_subset
AnnDataAccessorSubset object with n_obs × n_vars = 35 × 765
  obs: ['cell_type', 'index', 'louvain', 'n_genes', 'percent_mito']
  obsm: ['X_pca', 'X_umap']
  obsp: ['connectivities', 'distances']
  uns: ['louvain', 'louvain_colors', 'neighbors', 'pca']
  var: ['highly_variable', 'index', 'n_counts']
  varm: ['PCs']

Subsets load arrays into memory upon direct access:

adata_subset.X
array([[-0.326, -0.191,  0.499, ..., -0.21 , -0.636, -0.49 ],
       [ 0.811, -0.191, -0.728, ..., -0.21 ,  0.604, -0.49 ],
       [-0.326, -0.191,  0.643, ..., -0.21 ,  2.303, -0.49 ],
       ...,
       [-0.326, -0.191, -0.728, ..., -0.21 ,  0.626, -0.49 ],
       [-0.326, -0.191, -0.728, ..., -0.21 , -0.636, -0.49 ],
       [-0.326, -0.191, -0.728, ..., -0.21 , -0.636, -0.49 ]],
      dtype=float32)

To load the entire subset into memory as an actual AnnData object, use to_memory():

adata_subset.to_memory()
AnnData object with n_obs × n_vars = 35 × 765
    obs: 'cell_type', 'n_genes', 'percent_mito', 'louvain'
    var: 'n_counts', 'highly_variable'
    uns: 'louvain', 'louvain_colors', 'neighbors', 'pca'
    obsm: 'X_pca', 'X_umap'
    varm: 'PCs'
    obsp: 'connectivities', 'distances'

Generic HDF5#

Let us query a generic HDF5 file:

file = ln.File.filter(key="lndb-storage/testfile.hdf5").one()

And get a backed accessor:

backed = file.backed()

The returned object contains the .connection and h5py.File or zarr.Group in .storage

backed
BackedAccessor(connection=<File-like object S3FileSystem, lamindb-ci/lndb-storage/testfile.hdf5>, storage=<HDF5 file "testfile.hdf5>" (mode r)>)
backed.storage
<HDF5 file "testfile.hdf5>" (mode r)>
Hide code cell content
!lamin delete --force test-data
💡 deleting instance testuser1/test-data
2023-09-26 15:21:06,873:INFO - Found credentials in environment variables.
✅     deleted instance settings file: /home/runner/.lamin/instance--testuser1--test-data.env
❗ updating & unlocking cloud SQLite 's3://lamindb-ci/test-data.lndb' of instance 'testuser1/test-data'
✅     instance cache deleted
✅     deleted '.lndb' sqlite file
❗     consider manually deleting your remote instance on lamin.ai
❗     consider manually deleting your stored data: s3://lamindb-ci/