Query files & datasets#
We saw how LaminDB allows to query & search across files & datasets using registries: Query & search registries.
This guide addresses “queries within datasets”, for instance:
ulabels = ln.ULabel.lookup()
df = ln.File.filter(ulabels == ulabels.setosa).first().load() # access a batch of the iris dataset ingested in the tutorial
df_setosa = df.loc[:, df.iris_species_name == ulabels.setosa.name] # subset the iris dataset to observations of species "setosa"
Because the file was validated, subsetting the DataFrame
is guaranteed to succeed, sparing you the headache of re-curating features & labels.
Such within-dataset queries are also possible for cloud-backed datasets using DuckDB, TileDB, zarr, HDF5, parquet and other storage backends.
For a use case with TileDB, see: cellxgene-census
For a use case with DuckDB, see: RxRx
In this notebook, we show how to subset an AnnData
and generic HDF5
and zarr
datasets accessed in the cloud.
Setup#
!lamin init --storage s3://lamindb-ci --name test-data
Show code cell output
2023-09-26 15:20:51,880:INFO - Found credentials in environment variables.
❗ Instance metadata exists, but DB might have been corrupted or deleted. Re-initializing the DB.
❗ storage exists already
✅ registered instance on hub: https://lamin.ai/testuser1/test-data
✅ saved: User(id='DzTjkKse', handle='testuser1', email='testuser1@lamin.ai', name='Test User1', updated_at=2023-09-26 15:20:59)
✅ saved: Storage(id='5VXiHCyD', root='s3://lamindb-ci', type='s3', region='us-west-1', updated_at=2023-09-26 15:20:59, created_by_id='DzTjkKse')
❗ updating & unlocking cloud SQLite 's3://lamindb-ci/test-data.lndb' of instance 'testuser1/test-data'
💡 loaded instance: testuser1/test-data
❗ locked instance (to unlock and push changes to the cloud SQLite file, call: lamin close)
import lamindb as ln
2023-09-26 15:21:02,580:INFO - Found credentials in environment variables.
💡 loaded instance: testuser1/test-data (lamindb 0.54.2)
ln.settings.verbosity = "info"
We’ll need some test data:
ln.File("s3://lamindb-ci/lndb-storage/pbmc68k.h5ad").save()
ln.File("s3://lamindb-ci/lndb-storage/testfile.hdf5").save()
❗ no run & transform get linked, consider passing a `run` or calling ln.track()
❗ no run & transform get linked, consider passing a `run` or calling ln.track()
AnnData#
An h5ad
file stored on s3:
file = ln.File.filter(key="lndb-storage/pbmc68k.h5ad").one()
file.path
S3Path('s3://lamindb-ci/lndb-storage/pbmc68k.h5ad')
adata = file.backed()
This object is an AnnDataAccessor
object, an AnnData
object backed in the cloud:
adata
AnnDataAccessor object with n_obs × n_vars = 70 × 765
constructed for the AnnData object pbmc68k.h5ad
obs: ['cell_type', 'index', 'louvain', 'n_genes', 'percent_mito']
obsm: ['X_pca', 'X_umap']
obsp: ['connectivities', 'distances']
uns: ['louvain', 'louvain_colors', 'neighbors', 'pca']
var: ['highly_variable', 'index', 'n_counts']
varm: ['PCs']
Without subsetting, the AnnDataAccessor
object references underlying lazy h5
or zarr
arrays:
adata.X
<HDF5 dataset "X": shape (70, 765), type "<f4">
You can subset it like a normal AnnData
object:
obs_idx = adata.obs.cell_type.isin(["Dendritic cells", "CD14+ Monocytes"]) & (
adata.obs.percent_mito <= 0.05
)
adata_subset = adata[obs_idx]
adata_subset
AnnDataAccessorSubset object with n_obs × n_vars = 35 × 765
obs: ['cell_type', 'index', 'louvain', 'n_genes', 'percent_mito']
obsm: ['X_pca', 'X_umap']
obsp: ['connectivities', 'distances']
uns: ['louvain', 'louvain_colors', 'neighbors', 'pca']
var: ['highly_variable', 'index', 'n_counts']
varm: ['PCs']
Subsets load arrays into memory upon direct access:
adata_subset.X
array([[-0.326, -0.191, 0.499, ..., -0.21 , -0.636, -0.49 ],
[ 0.811, -0.191, -0.728, ..., -0.21 , 0.604, -0.49 ],
[-0.326, -0.191, 0.643, ..., -0.21 , 2.303, -0.49 ],
...,
[-0.326, -0.191, -0.728, ..., -0.21 , 0.626, -0.49 ],
[-0.326, -0.191, -0.728, ..., -0.21 , -0.636, -0.49 ],
[-0.326, -0.191, -0.728, ..., -0.21 , -0.636, -0.49 ]],
dtype=float32)
To load the entire subset into memory as an actual AnnData
object, use to_memory()
:
adata_subset.to_memory()
AnnData object with n_obs × n_vars = 35 × 765
obs: 'cell_type', 'n_genes', 'percent_mito', 'louvain'
var: 'n_counts', 'highly_variable'
uns: 'louvain', 'louvain_colors', 'neighbors', 'pca'
obsm: 'X_pca', 'X_umap'
varm: 'PCs'
obsp: 'connectivities', 'distances'
Generic HDF5#
Let us query a generic HDF5 file:
file = ln.File.filter(key="lndb-storage/testfile.hdf5").one()
And get a backed accessor:
backed = file.backed()
The returned object contains the .connection
and h5py.File
or zarr.Group
in .storage
backed
BackedAccessor(connection=<File-like object S3FileSystem, lamindb-ci/lndb-storage/testfile.hdf5>, storage=<HDF5 file "testfile.hdf5>" (mode r)>)
backed.storage
<HDF5 file "testfile.hdf5>" (mode r)>
Show code cell content
!lamin delete --force test-data
💡 deleting instance testuser1/test-data
2023-09-26 15:21:06,873:INFO - Found credentials in environment variables.
✅ deleted instance settings file: /home/runner/.lamin/instance--testuser1--test-data.env
❗ updating & unlocking cloud SQLite 's3://lamindb-ci/test-data.lndb' of instance 'testuser1/test-data'
✅ instance cache deleted
✅ deleted '.lndb' sqlite file
❗ consider manually deleting your remote instance on lamin.ai
❗ consider manually deleting your stored data: s3://lamindb-ci/