Stream data#
When working with large serialized objects, it is often inefficient to load entire files into memory.
Here, we show how to subset an AnnData
stored in the cloud.
import lamindb as ln
✅ Loaded instance: testuser1/lndb-storage
ln.track()
ℹ️ Instance: testuser1/lndb-storage
ℹ️ User: testuser1
✅ Added: Transform(id='YVUCtH4GfQOy', version='0', name='stream', type=notebook, title='Stream data', created_by_id='DzTjkKse', created_at=datetime.datetime(2023, 5, 28, 14, 48))
✅ Added: Run(id='1vfNuZRUw4gFpeBbmt2l', transform_id='YVUCtH4GfQOy', transform_version='0', created_by_id='DzTjkKse', created_at=datetime.datetime(2023, 5, 28, 14, 48, 2))
Check the configured storage:
ln.setup.settings.storage.root
S3Path('s3://lamindb-ci/')
Register a file:
file = ln.File("s3://lamindb-ci/lndb-storage/pbmc68k.h5ad")
file = ln.add(file)
💡 file in storage ✓ using storage key = lndb-storage/pbmc68k.h5ad
Get its backed cloud representation:
adata = file.backed()
Inspect its metadata:
adata.obs.head()
cell_type | n_genes | percent_mito | louvain | |
---|---|---|---|---|
index | ||||
GCAGGGCTGGATTC-1 | Dendritic cells | 1168 | 0.014345 | 2 |
CTTTAGTGGTTACG-6 | CD19+ B | 1121 | 0.019679 | 8 |
TGACTGGAACCATG-7 | Dendritic cells | 1277 | 0.012961 | 1 |
TCAATCACCCTTCG-8 | CD19+ B | 1139 | 0.018467 | 4 |
CGTTATACAGTACC-8 | CD4+/CD45RO+ Memory | 1034 | 0.010163 | 0 |
adata.obs.cell_type.value_counts()
Dendritic cells 28
CD19+ B 11
CD4+/CD45RO+ Memory 7
CD14+ Monocytes 7
CD4+/CD25 T Reg 5
CD8+ Cytotoxic T 4
CD8+/CD45RA+ Naive Cytotoxic 3
CD56+ NK 3
CD34+ 2
Name: cell_type, dtype: int64
Construct a subsetter based on the metadata:
obs = file.subsetter()
subset_obs = obs.cell_type.isin(["Dendritic cells", "CD14+ Monocytes"]) & (
obs.percent_mito <= 0.05
)
adata_subset = file.stream(subset_obs=subset_obs)
adata_subset
AnnData object with n_obs × n_vars = 35 × 765
obs: 'cell_type', 'n_genes', 'percent_mito', 'louvain'
var: 'n_counts', 'highly_variable'
uns: 'louvain', 'louvain_colors', 'neighbors', 'pca'
obsm: 'X_pca', 'X_umap'
varm: 'PCs'
obsp: 'connectivities', 'distances'
adata_subset.obs.cell_type.value_counts()
Dendritic cells 28
CD14+ Monocytes 7
CD4+/CD25 T Reg 0
CD4+/CD45RO+ Memory 0
CD8+ Cytotoxic T 0
CD8+/CD45RA+ Naive Cytotoxic 0
CD19+ B 0
CD34+ 0
CD56+ NK 0
Name: cell_type, dtype: int64
It is also possible to access AnnData objects’ attributes and subset them directly through file.backed()
withouth loading the full objects into memory:
adata
AnnDataAccessor object with n_obs × n_vars = 70 × 765
constructed for the AnnData object pbmc68k.h5ad
obs: ['cell_type', 'index', 'louvain', 'n_genes', 'percent_mito']
obsm: ['X_pca', 'X_umap']
obsp: ['connectivities', 'distances']
uns: ['louvain', 'louvain_colors', 'neighbors', 'pca']
var: ['highly_variable', 'index', 'n_counts']
varm: ['PCs']
Note that the object above is an AnnDataAccessor object, not an AnnData object
Check the reference to .X
:
adata.X
<HDF5 dataset "X": shape (70, 765), type "<f4">
Get a subset of the object, attributes are loaded only on explicit access:
obs_idx = adata.obs.cell_type.isin(["Dendritic cells", "CD14+ Monocytes"]) & (
adata.obs.percent_mito <= 0.05
)
adata_subset = adata[obs_idx]
adata_subset
AnnDataAccessorSubset object with n_obs × n_vars = 35 × 765
obs: ['cell_type', 'index', 'louvain', 'n_genes', 'percent_mito']
obsm: ['X_pca', 'X_umap']
obsp: ['connectivities', 'distances']
uns: ['louvain', 'louvain_colors', 'neighbors', 'pca']
var: ['highly_variable', 'index', 'n_counts']
varm: ['PCs']
adata_subset.obs.cell_type.value_counts()
Dendritic cells 28
CD14+ Monocytes 7
CD4+/CD25 T Reg 0
CD4+/CD45RO+ Memory 0
CD8+ Cytotoxic T 0
CD8+/CD45RA+ Naive Cytotoxic 0
CD19+ B 0
CD34+ 0
CD56+ NK 0
Name: cell_type, dtype: int64
You can do the same with a zarr object:
file = ln.add(ln.File("s3://lamindb-ci/lndb-storage/pbmc68k.zarr"))
print(file.backed().obs.head())
adata_subset = file.stream(subset_obs=subset_obs)
adata_subset.obs.cell_type.value_counts()
💡 file in storage ✓ using storage key = lndb-storage/pbmc68k.zarr
cell_type n_genes percent_mito louvain
index
GCAGGGCTGGATTC-1 Dendritic cells 1168 0.014345 2
CTTTAGTGGTTACG-6 CD19+ B 1121 0.019679 8
TGACTGGAACCATG-7 Dendritic cells 1277 0.012961 1
TCAATCACCCTTCG-8 CD19+ B 1139 0.018467 4
CGTTATACAGTACC-8 CD4+/CD45RO+ Memory 1034 0.010163 0
Dendritic cells 28
CD14+ Monocytes 7
CD4+/CD25 T Reg 0
CD4+/CD45RO+ Memory 0
CD8+ Cytotoxic T 0
CD8+/CD45RA+ Naive Cytotoxic 0
CD19+ B 0
CD34+ 0
CD56+ NK 0
Name: cell_type, dtype: int64