lamindb.Collection#

class lamindb.Collection(artifacts: list[lnschema_core.models.Artifact], name: str, version: str, description: str | None = None, meta: Optional[Any] = None, reference: str | None = None, reference_type: str | None = None, run: lnschema_core.models.Run | None = None, is_new_version_of: lnschema_core.models.Collection | None = None)#

Bases: Registry, Data, IsVersioned

Collections: collections of artifacts.

Parameters:
  • dataList[Artifact] A list of artifacts.

  • namestr A name.

  • descriptionstr | None = None A description.

  • versionstr | None = None A version string.

  • is_new_version_ofCollection | None = None An old version of the collection.

  • runRun | None = None The run that creates the collection.

  • metaArtifact | None = None An artifact that defines metadata for the collection.

  • referencestr | None = None For instance, an external ID or a URL.

  • reference_typestr | None = None For instance, "url".

See also

Artifact

Notes

See tutorial: Tutorial: Artifacts.

Examples

Create a collection from a collection of Artifact objects:

>>> collection = ln.Collection([artifact1, artifact2], name="My collection")
>>> collection.save()

If you have more than 100k artifacts, consider creating a collection directly from the directory without creating File records (e.g., here RxRx: cell imaging):

>>> collection = ln.Artifact("s3://my-bucket/my-images/", name="My collection", meta=df)
>>> collection.save()

Make a new version of a collection:

>>> # a non-versioned collection
>>> collection = ln.Collection(df1, description="My dataframe")
>>> collection.save()
>>> # create new collection from old collection and version both
>>> new_collection = ln.Collection(df2, is_new_version_of=collection)
>>> assert new_collection.stem_uid == collection.stem_uid
>>> assert collection.version == "1"
>>> assert new_collection.version == "2"

Fields

id AutoField

Internal id, valid only in one DB instance.

uid CharField

Universal id, valid across DB instances.

name CharField

Name or title of collection (required).

description TextField

A description.

version CharField

Version (default None).

Defines version of a family of records characterized by the same stem_uid.

Consider using semantic versioning with Python versioning.

hash CharField

Hash of collection content. 86 base64 chars allow to store 64 bytes, 512 bits.

reference CharField

A reference like URL or external ID.

reference_type CharField

Type of reference, e.g., cellxgene Census collection_id.

transform ForeignKey

Transform whose run created the collection.

run ForeignKey

Run that created the collection.

artifact OneToOneField

Storage of collection as a one artifact.

visibility SmallIntegerField

Visibility of record, 0-default, 1-hidden, 2-trash.

created_at DateTimeField

Time of creation of record.

updated_at DateTimeField

Time of run execution.

created_by ForeignKey

Creator of record, a User.

feature_sets ManyToManyField

The feature sets measured in this collection (see FeatureSet).

ulabels ManyToManyField

ULabels sampled in the collection (see Feature).

input_of ManyToManyField

Runs that use this collection as an input.

unordered_artifacts ManyToManyField

Storage of collection as multiple artifacts.

Methods

cache(is_run_input=None)#

Download cloud artifacts in collection to local cache.

Follows synching logic: only caches outdated artifacts.

Returns paths to locally cached on-disk artifacts.

Parameters:

is_run_input (Optional[bool], default: None) – Whether to track this collection as run input.

Return type:

list[UPath]

delete(permanent=None)#

Delete collection.

Parameters:

permanent (Optional[bool], default: None) – Whether to permanently delete the collection record (skips trash).

Return type:

None

Examples

For any Collection object collection, call:

>>> collection.delete()
load(join='outer', is_run_input=None, **kwargs)#

Stage and load to memory.

Returns in-memory representation if possible, e.g., a concatenated DataFrame or AnnData object.

Return type:

Any

mapped(layers_keys=None, obs_keys=None, obsm_keys=None, join='inner', encode_labels=True, unknown_label=None, cache_categories=True, parallel=False, dtype=None, stream=False, is_run_input=None)#

Return a map-style dataset.

Returns a pytorch map-style dataset by virtually concatenating AnnData arrays.

If your AnnData collection is in the cloud, move them into a local cache first via cache().

__getitem__ of the MappedCollection object takes a single integer index and returns a dictionary with the observation data sample for this index from the AnnData objects in the collection. The dictionary has keys for layers_keys (.X is in “X”), obs_keys, obsm_keys (under f”obsm_{key}”) and also “_store_idx” for the index of the AnnData object containing this observation sample.

Note

For a guide, see Train a machine learning model on a collection.

This method currently only works for collections of AnnData artifacts.

Parameters:
  • layers_keys (Union[str, list[str], None], default: None) – Keys from the .layers slot. layers_keys=None or "X" in the list retrieves .X.

  • obsm_keys (Union[str, list[str], None], default: None) – Keys from the .obsm slots.

  • obs_keys (Union[str, list[str], None], default: None) – Keys from the .obs slots.

  • join (Optional[Literal['inner', 'outer']], default: 'inner') – “inner” or “outer” virtual joins. If None is passed, does not join.

  • encode_labels (bool | list[str], default: True) – Encode labels into integers. Can be a list with elements from obs_keys.

  • unknown_label (Union[str, dict[str, str], None], default: None) – Encode this label to -1. Can be a dictionary with keys from obs_keys if encode_labels=True or from encode_labels if it is a list.

  • cache_categories (bool, default: True) – Enable caching categories of obs_keys for faster access.

  • parallel (bool, default: False) – Enable sampling with multiple processes.

  • dtype (Optional[str], default: None) – Convert numpy arrays from .X, .layers and .obsm

  • stream (bool, default: False) – Whether to stream data from the array backend.

  • is_run_input (Optional[bool], default: None) – Whether to track this collection as run input.

Return type:

MappedCollection

Examples

>>> import lamindb as ln
>>> from torch.utils.data import DataLoader
>>> ds = ln.Collection.filter(description="my collection").one()
>>> mapped = collection.mapped(label_keys=["cell_type", "batch"])
>>> dl = DataLoader(mapped, batch_size=128, shuffle=True)
restore()#

Restore collection record from trash.

Return type:

None

Examples

For any Collection object collection, call:

>>> collection.restore()
save(transfer_labels=False, using=None)#

Save the collection and underlying artifacts to database & storage.

Parameters:
  • transfer_labels (bool, default: False) – Transfer labels from artifacts to the collection.

  • using (Optional[str], default: None) – The database to which you want to save.

Return type:

None

Examples

>>> collection = ln.Collection("./myfile.csv", name="myfile")
>>> collection.save()
stage(is_run_input=None)#

Download cloud artifacts in collection to local cache.

Follows synching logic: only caches outdated artifacts.

Returns paths to locally cached on-disk artifacts.

Parameters:

is_run_input (Optional[bool], default: None) – Whether to track this collection as run input.

Return type:

list[UPath]