lamindb.Dataset#

class lamindb.Dataset(data: Any, name: str, description: Optional[str] = None, reference: Optional[str] = None, reference_type: Optional[str] = None)#
class lamindb.Dataset(*db_args)

Bases: Registry, Data

Datasets: mutable collections of data batches.

Parameters:
  • dataDataLike A data object (DataFrame, AnnData) to store.

  • namestr A name.

  • descriptionOptional[str] = None A description.

  • versionOptional[str] = None A version string.

  • is_new_version_ofOptional[Dataset] = None An old version of the dataset.

  • runOptional[Run] = None The run that creates the dataset.

See also

File

Notes

See tutorial: Tutorial: Files & datasets.

The File & Dataset registries both

  • track data batches of arbitrary format & size

  • can validate & link features (the measured dimensions in a data batch)

Typically,

  • a file stores a single immutable batch of data

  • a dataset stores a mutable collection of data batches

Examples:

  • Blob-like immutable files (pdf, txt, csv, jpg, …) or arrays (h5, h5ad, …) → File

  • Mutable streamable backends (DuckDB, zarr, TileDB, …) → Dataset wrapping File

  • Collections of files → Dataset wrapping File

  • Datasets in BigQuery, Snowflake, Postgres, … → Dataset (not yet implemented)

Hence, while

  • files always have a one-to-one correspondence with a storage accessor

  • datasets can reference a single file, multiple files or a dataset in a warehouse like BigQuery or Snowflake

Examples

Create a dataset from a DataFrame:

>>> df = ln.dev.datasets.df_iris_in_meter_batch1()
>>> df.head()
  sepal_length sepal_width petal_length petal_width iris_species_code
0        0.051       0.035        0.014       0.002                 0
1        0.049       0.030        0.014       0.002                 0
2        0.047       0.032        0.013       0.002                 0
3        0.046       0.031        0.015       0.002                 0
4        0.050       0.036        0.014       0.002                 0
>>> dataset = ln.Dataset(df, name="Iris flower dataset batch1")
>>> dataset.save()

Create a dataset from a collection of File objects:

>>> dataset = ln.Dataset([file1, file2], name="My dataset")
>>> dataset.save()

Make a new version of a dataset:

>>> # a non-versioned dataset
>>> dataset = ln.Dataset(df1, description="My dataframe")
>>> dataset.save()
>>> # create new dataset from old dataset and version both
>>> new_dataset = ln.File(df2, is_new_version_of=dataset)
>>> assert new_dataset.initial_version == dataset.initial_version
>>> assert dataset.version == "1"
>>> assert new_dataset.version == "2"

Fields

id CharField

Universal id, valid across DB instances.

name CharField

Name or title of dataset (required).

description TextField

A description.

version CharField

Version (default None).

Use this together with initial_version to label different versions of a dataset.

Consider using semantic versioning with Python versioning.

hash CharField

Hash of dataset content. 86 base64 chars allow to store 64 bytes, 512 bits.

reference CharField

A reference like URL or external ID.

reference_type CharField

Type of reference, e.g., cellxgene Census dataset_id.

transform ForeignKey

Transform whose run created the dataset.

run ForeignKey

Run that created the file.

file ForeignKey

Storage of dataset as a one file.

initial_version ForeignKey

Initial version of the dataset, a Dataset object.

created_at DateTimeField

Time of creation of record.

updated_at DateTimeField

Time of run execution.

created_by ForeignKey

Creator of record, a User.

feature_sets ManyToManyField

The feature sets measured in this dataset (see FeatureSet).

ulabels ManyToManyField

ULabels sampled in the dataset (see Feature).

input_of ManyToManyField

Runs that use this dataset as an input.

files ManyToManyField

Storage of dataset as multiple file.

Methods

backed(is_run_input=None)#

Return a cloud-backed data object.

Return type:

Union[AnnDataAccessor, BackedAccessor]

Notes

For more info, see tutorial: Query files & datasets.

delete(storage=None)#

Delete file, optionally from storage.

Parameters:

storage (Optional[bool], default: None) – Indicate whether you want to delete the file in storage.

Return type:

None

Examples

For any File object file, call:

>>> file.delete()
classmethod from_anndata(adata, field, name=None, description=None, run=None, modality=None, reference=None, reference_type=None)#

Create from AnnDataLike, validate & link features. :rtype: Dataset

See also

File

Track files.

Feature

Track features.

Examples

>>> import lnschema_bionty as lb
>>> lb.settings.species = "human"
>>> adata = ln.dev.datasets.anndata_with_obs()
>>> adata.var_names[:2]
Index(['ENSG00000000003', 'ENSG00000000005'], dtype='object')
>>> dataset = ln.Dataset.from_anndata(adata, name="My dataset", field=lb.Gene.ensembl_gene_id)
>>> dataset.save()
classmethod from_df(df, field=FieldAttr(Feature.name), name=None, description=None, run=None, modality=None, reference=None, reference_type=None)#

Create from DataFrame, validate & link features. :rtype: Dataset

See also

File

Track files.

Feature

Track features.

Notes

For more info, see tutorial: Tutorial: Files & datasets.

Examples

>>> df = ln.dev.datasets.df_iris_in_meter_batch1()
>>> df.head()
  sepal_length sepal_width petal_length petal_width iris_species_code
0        0.051       0.035        0.014       0.002                 0
1        0.049       0.030        0.014       0.002                 0
2        0.047       0.032        0.013       0.002                 0
3        0.046       0.031        0.015       0.002                 0
4        0.050       0.036        0.014       0.002                 0
>>> dataset = ln.Dataset.from_df(df, description="Iris flower dataset batch1")
load(is_run_input=None, **kwargs)#

Stage and load to memory.

Returns in-memory representation if possible, e.g., a concatenated DataFrame or AnnData object.

Return type:

Any

save(*args, **kwargs)#

Save the file to database & storage.

Return type:

None

Examples

>>> file = ln.File("./myfile.csv", key="myfile.csv")
>>> file.save()