lamindb.core.MappedCollection#

class lamindb.core.MappedCollection(path_list, layers_keys=None, obs_keys=None, obsm_keys=None, join='inner', encode_labels=True, unknown_label=None, cache_categories=True, parallel=False, dtype=None)#

Bases: object

Map-style collection for use in data loaders.

This class virtually concatenates AnnData arrays as a pytorch map-style dataset.

If your AnnData collection is in the cloud, move them into a local cache first for faster access.

__getitem__ of the MappedCollection object takes a single integer index and returns a dictionary with the observation data sample for this index from the AnnData objects in path_list. The dictionary has keys for layers_keys (.X is in "X"), obs_keys, obsm_keys (under f"obsm_{key}") and also "_store_idx" for the index of the AnnData object containing this observation sample.

Note

For a guide, see Train a machine learning model on a collection.

For more convenient use within MappedCollection, see mapped().

This currently only works for collections of AnnData objects.

The implementation was influenced by the SCimilarity data loader.

Parameters:
  • path_list (list[Union[str, Path]]) – A list of paths to AnnData objects stored in .h5ad or .zarr formats.

  • layers_keys (Union[str, list[str], None], default: None) – Keys from the .layers slot. layers_keys=None or "X" in the list retrieves .X.

  • obsm_keys (Union[str, list[str], None], default: None) – Keys from the .obsm slots.

  • obs_keys (Union[str, list[str], None], default: None) – Keys from the .obs slots.

  • join (Optional[Literal['inner', 'outer']], default: 'inner') – "inner" or "outer" virtual joins. If None is passed, does not join.

  • encode_labels (bool | list[str], default: True) – Encode labels into integers. Can be a list with elements from obs_keys.

  • unknown_label (Union[str, dict[str, str], None], default: None) – Encode this label to -1. Can be a dictionary with keys from obs_keys if encode_labels=True or from encode_labels if it is a list.

  • cache_categories (bool, default: True) – Enable caching categories of obs_keys for faster access.

  • parallel (bool, default: False) – Enable sampling with multiple processes.

  • dtype (Optional[str], default: None) – Convert numpy arrays from .X, .layers and .obsm

Attributes

closed property#

Check if connections to array streaming backend are closed.

Does not matter if parallel=True.

original_shapes property#

Shapes of the underlying AnnData objects.

shape property#

Shape of the (virtually aligned) dataset.

Methods

close()#

Close connections to array streaming backend.

No effect if parallel=True.

get_label_weights(obs_keys)#

Get all weights for the given label keys.

get_merged_categories(label_key)#

Get merged categories for label_key from all .obs.

get_merged_labels(label_key)#

Get merged labels for label_key from all .obs.

static torch_worker_init_fn(worker_id)#

worker_init_fn for torch.utils.data.DataLoader.

Improves performance for num_workers > 1.