Query arrays#

We saw how LaminDB allows to query & search across artifacts & collections using registries: Query & search registries.

Let us now look at the following case:

# get a lookup for labels
ulabels = ln.ULabel.lookup()
# query a parquet file matching an "setosa"
df = ln.Artifact.filter(ulabels=ulabels.setosa, suffix=".suffix").first().load()
# query all observations in the DataFrame matching "setosa"
df_setosa = df.loc[:, df.iris_organism_name == ulabels.setosa.name]  

Because the artifact was validated, querying the DataFrame is guaranteed to succeed!

Such within-collection queries are also possible for cloud-backed collections using DuckDB, TileDB, zarr, HDF5, parquet, and other storage backends.

For a use case with TileDB, see: CELLxGENE: scRNA-seq
For a use case with DuckDB, see: RxRx: cell imaging

In this notebook, we show how to subset an AnnData and generic HDF5 and zarr collections accessed in the cloud.

import lamindb as ln

💡 connected lamindb: testuser1/test-array-notebook

ln.settings.verbosity = "info"

We’ll need some test data:

ln.Artifact("s3://lamindb-ci/lndb-storage/pbmc68k.h5ad").save()
ln.Artifact("s3://lamindb-ci/lndb-storage/testfile.hdf5").save()

❗ no run & transform get linked, consider calling ln.track()

❗ will manage storage location s3://lamindb-ci with instance testuser1/test-array-notebook

❗ no run & transform get linked, consider calling ln.track()

Artifact(uid='uyWuOPppNZUBjcFYT92s', key='lndb-storage/testfile.hdf5', suffix='.hdf5', size=1400, hash='UCWPjJkhzBjO97rtuo_8Yg', hash_type='md5', visibility=1, key_is_virtual=False, updated_at=2024-05-01 18:48:39 UTC, storage_id=2, created_by_id=1)

AnnData#

An h5ad artifact stored on s3:

artifact = ln.Artifact.filter(key="lndb-storage/pbmc68k.h5ad").one()

artifact.path

S3Path('s3://lamindb-ci/lndb-storage/pbmc68k.h5ad')

adata = artifact.backed()

This object is an AnnDataAccessor object, an AnnData object backed in the cloud:

adata

AnnDataAccessor object with n_obs × n_vars = 70 × 765
  constructed for the AnnData object pbmc68k.h5ad
    obs: ['cell_type', 'index', 'louvain', 'n_genes', 'percent_mito']
    obsm: ['X_pca', 'X_umap']
    obsp: ['connectivities', 'distances']
    uns: ['louvain', 'louvain_colors', 'neighbors', 'pca']
    var: ['highly_variable', 'index', 'n_counts']
    varm: ['PCs']

Without subsetting, the AnnDataAccessor object references underlying lazy h5 or zarr arrays:

adata.X

<HDF5 dataset "X": shape (70, 765), type "<f4">

You can subset it like a normal AnnData object:

obs_idx = adata.obs.cell_type.isin(["Dendritic cells", "CD14+ Monocytes"]) & (
    adata.obs.percent_mito <= 0.05
)
adata_subset = adata[obs_idx]
adata_subset

AnnDataAccessorSubset object with n_obs × n_vars = 35 × 765
  obs: ['cell_type', 'index', 'louvain', 'n_genes', 'percent_mito']
  obsm: ['X_pca', 'X_umap']
  obsp: ['connectivities', 'distances']
  uns: ['louvain', 'louvain_colors', 'neighbors', 'pca']
  var: ['highly_variable', 'index', 'n_counts']
  varm: ['PCs']

Subsets load arrays into memory upon direct access:

adata_subset.X

array([[-0.326, -0.191,  0.499, ..., -0.21 , -0.636, -0.49 ],
       [ 0.811, -0.191, -0.728, ..., -0.21 ,  0.604, -0.49 ],
       [-0.326, -0.191,  0.643, ..., -0.21 ,  2.303, -0.49 ],
       ...,
       [-0.326, -0.191, -0.728, ..., -0.21 ,  0.626, -0.49 ],
       [-0.326, -0.191, -0.728, ..., -0.21 , -0.636, -0.49 ],
       [-0.326, -0.191, -0.728, ..., -0.21 , -0.636, -0.49 ]],
      dtype=float32)

To load the entire subset into memory as an actual AnnData object, use to_memory():

adata_subset.to_memory()

AnnData object with n_obs × n_vars = 35 × 765
    obs: 'cell_type', 'n_genes', 'percent_mito', 'louvain'
    var: 'n_counts', 'highly_variable'
    uns: 'louvain', 'louvain_colors', 'neighbors', 'pca'
    obsm: 'X_pca', 'X_umap'
    varm: 'PCs'
    obsp: 'connectivities', 'distances'

Generic HDF5#

Let us query a generic HDF5 artifact:

artifact = ln.Artifact.filter(key="lndb-storage/testfile.hdf5").one()

And get a backed accessor:

backed = artifact.backed()

The returned object contains the .connection and h5py.File or zarr.Group in .storage

backed

BackedAccessor(connection=<File-like object S3FileSystem, lamindb-ci/lndb-storage/testfile.hdf5>, storage=<HDF5 file "testfile.hdf5>" (mode r)>)

backed.storage

<HDF5 file "testfile.hdf5>" (mode r)>