Jupyter Notebook

Query arrays#

We saw how LaminDB allows to query & search across artifacts & collections using registries: Query & search registries.

Let us now look at the following case:

# get a lookup for labels
ulabels = ln.ULabel.lookup()
# query a parquet file matching an "setosa"
df = ln.Artifact.filter(ulabels=ulabels.setosa, suffix=".suffix").first().load()
# query all observations in the DataFrame matching "setosa"
df_setosa = df.loc[:, df.iris_organism_name == ulabels.setosa.name]  

Because the artifact was validated, querying the DataFrame is guaranteed to succeed!

Such within-collection queries are also possible for cloud-backed collections using DuckDB, TileDB, zarr, HDF5, parquet, and other storage backends.

In this notebook, we show how to subset an AnnData and generic HDF5 and zarr collections accessed in the cloud.


!lamin init --storage s3://lamindb-ci/test-data --name test-data
Hide code cell output
โ— your database (0.67.2) is behind your installed lamindb package (0.70.2) 
โ— please migrate your database: lamin migrate deploy
โ— instance exists with id 7e6fe8b3f8e25f88b77c56ab026d4fdf, but database is not loadable: re-initializing
๐Ÿ’ก registered storage: s3://lamindb-ci/test-data
โ— updating & unlocking cloud SQLite 's3://lamindb-ci/test-data/7e6fe8b3f8e25f88b77c56ab026d4fdf.lndb' of instance 'testuser1/test-data'
๐Ÿ’ก connected lamindb: testuser1/test-data
โ— locked instance (to unlock and push changes to the cloud SQLite file, call: lamin close)
!lamin info
Current user: testuser1
- handle: testuser1
- email: [email protected]
- uid: DzTjkKse
Auto-connect in Python: True
Current instance: testuser1/test-data
- owner: testuser1
- name: test-data
- storage root: s3://lamindb-ci/test-data
- storage region: us-west-1
- db: sqlite:////home/runner/.cache/lamindb/lamindb-ci/test-data/7e6fe8b3f8e25f88b77c56ab026d4fdf.lndb
- schema: {}
- git_repo: None
!lamin set --auto-connect True
 Usage: lamin set [OPTIONS] {auto-connect} VALUE                                
 Try 'lamin set --help' for help                                                
โ•ญโ”€ Error โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฎ
โ”‚ No such option: --auto-connect                                               โ”‚
import lamindb as ln
๐Ÿ’ก connected lamindb: testuser1/test-data
ln.settings.verbosity = "info"

Weโ€™ll need some test data:

โ— no run & transform get linked, consider calling ln.track()
โ— generating a new storage location at s3://lamindb-ci
โ— no run & transform get linked, consider calling ln.track()


An h5ad artifact stored on s3:

artifact = ln.Artifact.filter(key="lndb-storage/pbmc68k.h5ad").one()
adata = artifact.backed()

This object is an AnnDataAccessor object, an AnnData object backed in the cloud:

AnnDataAccessor object with n_obs ร— n_vars = 70 ร— 765
  constructed for the AnnData object pbmc68k.h5ad
    obs: ['cell_type', 'index', 'louvain', 'n_genes', 'percent_mito']
    obsm: ['X_pca', 'X_umap']
    obsp: ['connectivities', 'distances']
    uns: ['louvain', 'louvain_colors', 'neighbors', 'pca']
    var: ['highly_variable', 'index', 'n_counts']
    varm: ['PCs']

Without subsetting, the AnnDataAccessor object references underlying lazy h5 or zarr arrays:

<HDF5 dataset "X": shape (70, 765), type "<f4">

You can subset it like a normal AnnData object:

obs_idx = adata.obs.cell_type.isin(["Dendritic cells", "CD14+ Monocytes"]) & (
    adata.obs.percent_mito <= 0.05
adata_subset = adata[obs_idx]
AnnDataAccessorSubset object with n_obs ร— n_vars = 35 ร— 765
  obs: ['cell_type', 'index', 'louvain', 'n_genes', 'percent_mito']
  obsm: ['X_pca', 'X_umap']
  obsp: ['connectivities', 'distances']
  uns: ['louvain', 'louvain_colors', 'neighbors', 'pca']
  var: ['highly_variable', 'index', 'n_counts']
  varm: ['PCs']

Subsets load arrays into memory upon direct access:

array([[-0.326, -0.191,  0.499, ..., -0.21 , -0.636, -0.49 ],
       [ 0.811, -0.191, -0.728, ..., -0.21 ,  0.604, -0.49 ],
       [-0.326, -0.191,  0.643, ..., -0.21 ,  2.303, -0.49 ],
       [-0.326, -0.191, -0.728, ..., -0.21 ,  0.626, -0.49 ],
       [-0.326, -0.191, -0.728, ..., -0.21 , -0.636, -0.49 ],
       [-0.326, -0.191, -0.728, ..., -0.21 , -0.636, -0.49 ]],

To load the entire subset into memory as an actual AnnData object, use to_memory():

AnnData object with n_obs ร— n_vars = 35 ร— 765
    obs: 'cell_type', 'n_genes', 'percent_mito', 'louvain'
    var: 'n_counts', 'highly_variable'
    uns: 'louvain', 'louvain_colors', 'neighbors', 'pca'
    obsm: 'X_pca', 'X_umap'
    varm: 'PCs'
    obsp: 'connectivities', 'distances'

Generic HDF5#

Let us query a generic HDF5 artifact:

artifact = ln.Artifact.filter(key="lndb-storage/testfile.hdf5").one()

And get a backed accessor:

backed = artifact.backed()

The returned object contains the .connection and h5py.File or zarr.Group in .storage

BackedAccessor(connection=<File-like object S3FileSystem, lamindb-ci/lndb-storage/testfile.hdf5>, storage=<HDF5 file "testfile.hdf5>" (mode r)>)
<HDF5 file "testfile.hdf5>" (mode r)>
Hide code cell content
# clean up test instance
!lamin delete --force test-data
!rm -r test-data
๐Ÿ’ก deleting instance testuser1/test-data
โ— manually delete your remote instance on lamin.ai
โ— manually delete your stored data: s3://lamindb-ci/test-data
rm: cannot remove 'test-data': No such file or directory