lamindb.Collection#

class lamindb.Collection(data: Any, name: str, version: str, description: Optional[str] = None, meta: Optional[Any] = None, reference: Optional[str] = None, reference_type: Optional[str] = None, run: Optional[Run] = None, is_new_version_of: Optional[Collection] = None)#

Bases: Registry, Data, IsVersioned

Collections: collections of artifacts.

Parameters:
  • dataDataLike An artifact, a list of artifacts, or an array (DataFrame, AnnData).

  • namestr A name.

  • descriptionOptional[str] = None A description.

  • versionOptional[str] = None A version string.

  • is_new_version_ofOptional[Collection] = None An old version of the collection.

  • runOptional[Run] = None The run that creates the collection.

  • metaOptional[DataLike] An array (DataFrame, AnnData) or a File object that defines metadata for a directory of objects.

  • referenceOptional[str] = None For instance, an external ID or a URL.

  • reference_typeOptional[str] = None For instance, "url".

See also

Artifact

Notes

See tutorial: Tutorial: Artifacts.

Examples

Create a collection from a collection of Artifact objects:

>>> collection = ln.Collection([artifact1, artifact2], name="My collection")
>>> collection.save()

If you have more than 100k artifacts, consider creating a collection directly from the directory without creating File records (e.g., here RxRx: cell imaging):

>>> collection = ln.Artifact("s3://my-bucket/my-images/", name="My collection", meta=df)
>>> collection.save()

Make a new version of a collection:

>>> # a non-versioned collection
>>> collection = ln.Collection(df1, description="My dataframe")
>>> collection.save()
>>> # create new collection from old collection and version both
>>> new_collection = ln.Collection(df2, is_new_version_of=collection)
>>> assert new_collection.stem_uid == collection.stem_uid
>>> assert collection.version == "1"
>>> assert new_collection.version == "2"

Fields

id AutoField

Internal id, valid only in one DB instance.

uid CharField

Universal id, valid across DB instances.

name CharField

Name or title of collection (required).

description TextField

A description.

version CharField

Version (default None).

Defines version of a family of records characterized by the same stem_uid.

Consider using semantic versioning with Python versioning.

hash CharField

Hash of collection content. 86 base64 chars allow to store 64 bytes, 512 bits.

reference CharField

A reference like URL or external ID.

reference_type CharField

Type of reference, e.g., cellxgene Census collection_id.

transform ForeignKey

Transform whose run created the collection.

run ForeignKey

Run that created the collection.

artifact OneToOneField

Storage of collection as a one artifact.

visibility SmallIntegerField

Visibility of record, 0-default, 1-hidden, 2-trash.

created_at DateTimeField

Time of creation of record.

updated_at DateTimeField

Time of run execution.

created_by ForeignKey

Creator of record, a User.

feature_sets ManyToManyField

The feature sets measured in this collection (see FeatureSet).

ulabels ManyToManyField

ULabels sampled in the collection (see Feature).

input_of ManyToManyField

Runs that use this collection as an input.

unordered_artifacts ManyToManyField

Storage of collection as multiple artifacts.

Methods

backed(is_run_input=None)#

Return a cloud-backed data object.

Return type:

Union[AnnDataAccessor, BackedAccessor]

Notes

For more info, see tutorial: Query arrays.

delete(permanent=None)#

Delete collection.

Parameters:

permanent (Optional[bool], default: None) – Whether to permanently delete the collection record (skips trash).

Return type:

None

Examples

For any Collection object collection, call:

>>> collection.delete()
classmethod from_anndata(adata, name=None, description=None, run=None, reference=None, reference_type=None, version=None, is_new_version_of=None, **kwargs)#

Create from AnnDataLike, validate & link features.

Parameters:
  • adata (AnnData) – An AnnData object.

  • field – The registry field to validate & annotate features.

  • name (Optional[str], default: None) – A name.

  • description (Optional[str], default: None) – A description.

  • version (Optional[str], default: None) – A version string.

  • is_new_version_of (Optional[Artifact], default: None) – An old version of the collection.

  • run (Optional[Run], default: None) – The run that creates the collection.

Return type:

Collection

See also

Artifact

Track artifacts.

Feature

Track features.

Examples

>>> import bionty as bt
>>> bt.settings.organism = "human"
>>> adata = ln.core.datasets.anndata_with_obs()
>>> adata.var_names[:2]
Index(['ENSG00000000003', 'ENSG00000000005'], dtype='object')
>>> collection = ln.Collection.from_anndata(adata, name="My collection", field=bt.Gene.ensembl_gene_id)
>>> collection.save()

.

classmethod from_df(df, name=None, description=None, run=None, reference=None, reference_type=None, version=None, is_new_version_of=None, **kwargs)#

Create from DataFrame, validate & link features.

Parameters:
  • df (DataFrame) – A DataFrame object.

  • field – The registry field to validate & annotate features.

  • name (Optional[str], default: None) – A name.

  • description (Optional[str], default: None) – A description.

  • version (Optional[str], default: None) – A version string.

  • is_new_version_of (Optional[Artifact], default: None) – An old version of the collection.

  • run (Optional[Run], default: None) – The run that creates the collection.

Return type:

Collection

See also

Artifact

Track artifacts.

Feature

Track features.

Notes

For more info, see tutorial: Tutorial: Artifacts.

Examples

>>> df = ln.core.datasets.df_iris_in_meter_batch1()
>>> df.head()
  sepal_length sepal_width petal_length petal_width iris_organism_code
0        0.051       0.035        0.014       0.002                 0
1        0.049       0.030        0.014       0.002                 0
2        0.047       0.032        0.013       0.002                 0
3        0.046       0.031        0.015       0.002                 0
4        0.050       0.036        0.014       0.002                 0
>>> collection = ln.Collection.from_df(df, description="Iris flower collection batch1")

.

load(join='outer', is_run_input=None, **kwargs)#

Stage and load to memory.

Returns in-memory representation if possible, e.g., a concatenated DataFrame or AnnData object.

Return type:

Any

mapped(label_keys=None, join='inner', encode_labels=True, unknown_label=None, cache_categories=True, parallel=False, dtype=None, stream=False, is_run_input=None)#

Convert to map-style collection for data loaders.

Note: This currently only works for AnnData objects. The objects should have the same label keys and variables.

Parameters:
  • label_keys (Union[str, List[str], None], default: None) – Columns of the .obs slot - the names of the metadata features storing labels.

  • join (Optional[Literal['inner', 'outer']], default: 'inner') – “inner” or “outer” virtual joins. If None is passed, does not join.

  • encode_labels (Union[bool, List[str]], default: True) – Encode labels into integers. Can be a list with elements from label_keys`.

  • unknown_label (Union[str, Dict[str, str], None], default: None) – Encode this label to -1. Can be a dictionary with keys from label_keys if encode_labels=True` or from encode_labels if it is a list.

  • cache_categories (bool, default: True) – Enable caching categories of label_keys for faster access.

  • parallel (bool, default: False) – Enable sampling with multiple processes.

  • dtype (Optional[str], default: None) – Convert numpy arrays from .X to this dtype on selection.

  • stream (bool, default: False) – Whether to stream data from the array backend.

  • is_run_input (Optional[bool], default: None) – Whether to track this collection as run input.

Return type:

MappedCollection

Examples

>>> import lamindb as ln
>>> from torch.utils.data import DataLoader
>>> ds = ln.Collection.filter(description="my collection").one()
>>> mapped = collection.mapped(label_keys=["cell_type", "batch"])
>>> dl = DataLoader(mapped, batch_size=128, shuffle=True)
restore()#

Restore collection record from trash.

Return type:

None

Examples

For any Collection object collection, call:

>>> collection.restore()
save(*args, **kwargs)#

Save the collection and underlying artifacts to database & storage.

Return type:

None

Examples

>>> collection = ln.Collection("./myfile.csv", name="myfile")
>>> collection.save()