Stream data#

When working with large serialized objects, it is often inefficient to load entire files into memory.

Here, we show how to subset an AnnData stored in the cloud.

import lamindb as ln
✅ Loaded instance: testuser1/lndb-storage
ln.track()
ℹ️ Instance: testuser1/lndb-storage
ℹ️ User: testuser1
✅ Added: Transform(id='YVUCtH4GfQOy', version='0', name='stream', type=notebook, title='Stream data', created_by_id='DzTjkKse', created_at=datetime.datetime(2023, 5, 28, 14, 48))
✅ Added: Run(id='1vfNuZRUw4gFpeBbmt2l', transform_id='YVUCtH4GfQOy', transform_version='0', created_by_id='DzTjkKse', created_at=datetime.datetime(2023, 5, 28, 14, 48, 2))

Check the configured storage:

ln.setup.settings.storage.root
S3Path('s3://lamindb-ci/')

Register a file:

file = ln.File("s3://lamindb-ci/lndb-storage/pbmc68k.h5ad")
file = ln.add(file)
💡 file in storage ✓ using storage key = lndb-storage/pbmc68k.h5ad

Get its backed cloud representation:

adata = file.backed()

Inspect its metadata:

adata.obs.head()
cell_type n_genes percent_mito louvain
index
GCAGGGCTGGATTC-1 Dendritic cells 1168 0.014345 2
CTTTAGTGGTTACG-6 CD19+ B 1121 0.019679 8
TGACTGGAACCATG-7 Dendritic cells 1277 0.012961 1
TCAATCACCCTTCG-8 CD19+ B 1139 0.018467 4
CGTTATACAGTACC-8 CD4+/CD45RO+ Memory 1034 0.010163 0
adata.obs.cell_type.value_counts()
Dendritic cells                 28
CD19+ B                         11
CD4+/CD45RO+ Memory              7
CD14+ Monocytes                  7
CD4+/CD25 T Reg                  5
CD8+ Cytotoxic T                 4
CD8+/CD45RA+ Naive Cytotoxic     3
CD56+ NK                         3
CD34+                            2
Name: cell_type, dtype: int64

Construct a subsetter based on the metadata:

obs = file.subsetter()
subset_obs = obs.cell_type.isin(["Dendritic cells", "CD14+ Monocytes"]) & (
    obs.percent_mito <= 0.05
)
adata_subset = file.stream(subset_obs=subset_obs)
adata_subset
AnnData object with n_obs × n_vars = 35 × 765
    obs: 'cell_type', 'n_genes', 'percent_mito', 'louvain'
    var: 'n_counts', 'highly_variable'
    uns: 'louvain', 'louvain_colors', 'neighbors', 'pca'
    obsm: 'X_pca', 'X_umap'
    varm: 'PCs'
    obsp: 'connectivities', 'distances'
adata_subset.obs.cell_type.value_counts()
Dendritic cells                 28
CD14+ Monocytes                  7
CD4+/CD25 T Reg                  0
CD4+/CD45RO+ Memory              0
CD8+ Cytotoxic T                 0
CD8+/CD45RA+ Naive Cytotoxic     0
CD19+ B                          0
CD34+                            0
CD56+ NK                         0
Name: cell_type, dtype: int64

It is also possible to access AnnData objects’ attributes and subset them directly through file.backed() withouth loading the full objects into memory:

adata
AnnDataAccessor object with n_obs × n_vars = 70 × 765
  constructed for the AnnData object pbmc68k.h5ad
    obs: ['cell_type', 'index', 'louvain', 'n_genes', 'percent_mito']
    obsm: ['X_pca', 'X_umap']
    obsp: ['connectivities', 'distances']
    uns: ['louvain', 'louvain_colors', 'neighbors', 'pca']
    var: ['highly_variable', 'index', 'n_counts']
    varm: ['PCs']

Note that the object above is an AnnDataAccessor object, not an AnnData object

Check the reference to .X:

adata.X
<HDF5 dataset "X": shape (70, 765), type "<f4">

Get a subset of the object, attributes are loaded only on explicit access:

obs_idx = adata.obs.cell_type.isin(["Dendritic cells", "CD14+ Monocytes"]) & (
    adata.obs.percent_mito <= 0.05
)
adata_subset = adata[obs_idx]
adata_subset
AnnDataAccessorSubset object with n_obs × n_vars = 35 × 765
  obs: ['cell_type', 'index', 'louvain', 'n_genes', 'percent_mito']
  obsm: ['X_pca', 'X_umap']
  obsp: ['connectivities', 'distances']
  uns: ['louvain', 'louvain_colors', 'neighbors', 'pca']
  var: ['highly_variable', 'index', 'n_counts']
  varm: ['PCs']
adata_subset.obs.cell_type.value_counts()
Dendritic cells                 28
CD14+ Monocytes                  7
CD4+/CD25 T Reg                  0
CD4+/CD45RO+ Memory              0
CD8+ Cytotoxic T                 0
CD8+/CD45RA+ Naive Cytotoxic     0
CD19+ B                          0
CD34+                            0
CD56+ NK                         0
Name: cell_type, dtype: int64

You can do the same with a zarr object:

file = ln.add(ln.File("s3://lamindb-ci/lndb-storage/pbmc68k.zarr"))
print(file.backed().obs.head())
adata_subset = file.stream(subset_obs=subset_obs)
adata_subset.obs.cell_type.value_counts()
💡 file in storage ✓ using storage key = lndb-storage/pbmc68k.zarr
                            cell_type  n_genes  percent_mito louvain
index                                                               
GCAGGGCTGGATTC-1      Dendritic cells     1168      0.014345       2
CTTTAGTGGTTACG-6              CD19+ B     1121      0.019679       8
TGACTGGAACCATG-7      Dendritic cells     1277      0.012961       1
TCAATCACCCTTCG-8              CD19+ B     1139      0.018467       4
CGTTATACAGTACC-8  CD4+/CD45RO+ Memory     1034      0.010163       0
Dendritic cells                 28
CD14+ Monocytes                  7
CD4+/CD25 T Reg                  0
CD4+/CD45RO+ Memory              0
CD8+ Cytotoxic T                 0
CD8+/CD45RA+ Naive Cytotoxic     0
CD19+ B                          0
CD34+                            0
CD56+ NK                         0
Name: cell_type, dtype: int64