lamindb.Dataset#

class lamindb.Dataset(data: Any, name: str, version: str, description: Optional[str] = None, meta: Optional[Any] = None, reference: Optional[str] = None, reference_type: Optional[str] = None, run: Optional[Run] = None, is_new_version_of: Optional[Dataset] = None)#

Bases: Registry, Data

Datasets: collections of data batches.

Parameters:
  • dataDataLike An array (DataFrame, AnnData), a directory, or a list of File objects.

  • namestr A name.

  • descriptionOptional[str] = None A description.

  • versionOptional[str] = None A version string.

  • is_new_version_ofOptional[Dataset] = None An old version of the dataset.

  • runOptional[Run] = None The run that creates the dataset.

  • metaOptional[DataLike] An array (DataFrame, AnnData) or a File object that defines metadata for a directory of objects.

  • referenceOptional[str] = None For instance, an external ID or a URL.

  • reference_typeOptional[str] = None For instance, "url".

See also

File

Notes

See tutorial: Tutorial: Files & datasets.

The File & Dataset registries both

  • track data batches of arbitrary format & size

  • can validate & link features (the measured dimensions in a data batch)

Often, a file stores a single batch of data and a dataset stores a collection of data batches, while

  • files always have a one-to-one correspondence with a storage accessor

  • datasets can reference multiple objects or an array store

Examples:

  • store a blob-like file (pdf, txt, csv, jpg, …) as a File

  • store a queryable array (parquet, HDF5, h5ad, DuckDB, zarr, TileDB, …) as a Dataset or File, depending on your use case

  • store collections of files and arrays with Dataset

  • once implemented, datasets in BigQuery, Snowflake, Postgres, … would be stored in Dataset

Examples

Create a dataset from a DataFrame:

>>> df = ln.dev.datasets.df_iris_in_meter_batch1()
>>> df.head()
  sepal_length sepal_width petal_length petal_width iris_organism_code
0        0.051       0.035        0.014       0.002                 0
1        0.049       0.030        0.014       0.002                 0
2        0.047       0.032        0.013       0.002                 0
3        0.046       0.031        0.015       0.002                 0
4        0.050       0.036        0.014       0.002                 0
>>> dataset = ln.Dataset(df, name="Iris flower dataset batch1")
>>> dataset.save()

Create a dataset from a collection of File objects:

>>> dataset = ln.Dataset([file1, file2], name="My dataset")
>>> dataset.save()

If you don’t have 100k files or more in a directory, create File records via from_dir (e.g., here Tutorial: Files & datasets):

>>> files = ln.File.from_dir("./my_dir")
>>> dataset = ln.Dataset([file1, file2], name="My dataset")
>>> dataset.save()

If you have more than 100k files, consider creating a dataset directly from the directory without creating File records (e.g., here RxRx: cell imaging):

>>> dataset = ln.Dataset("s3://my-bucket/my-images/", name="My dataset", meta=df)
>>> dataset.save()

Make a new version of a dataset:

>>> # a non-versioned dataset
>>> dataset = ln.Dataset(df1, description="My dataframe")
>>> dataset.save()
>>> # create new dataset from old dataset and version both
>>> new_dataset = ln.File(df2, is_new_version_of=dataset)
>>> assert new_dataset.initial_version == dataset.initial_version
>>> assert dataset.version == "1"
>>> assert new_dataset.version == "2"

Fields

id AutoField

Internal id, valid only in one DB instance.

uid CharField

Universal id, valid across DB instances.

name CharField

Name or title of dataset (required).

description TextField

A description.

version CharField

Version (default None).

Use this together with initial_version to label different versions of a dataset.

Consider using semantic versioning with Python versioning.

hash CharField

Hash of dataset content. 86 base64 chars allow to store 64 bytes, 512 bits.

reference CharField

A reference like URL or external ID.

reference_type CharField

Type of reference, e.g., cellxgene Census dataset_id.

transform ForeignKey

Transform whose run created the dataset.

run ForeignKey

Run that created the file.

file OneToOneField

Storage of dataset as a one file.

storage ForeignKey

Storage of dataset as mere paths handled by a key value store or file system.

initial_version ForeignKey

Initial version of the dataset, a Dataset object.

visibility SmallIntegerField

Visibility of record, 0-default, 1-hidden, 2-trash.

created_at DateTimeField

Time of creation of record.

updated_at DateTimeField

Time of run execution.

created_by ForeignKey

Creator of record, a User.

feature_sets ManyToManyField

The feature sets measured in this dataset (see FeatureSet).

ulabels ManyToManyField

ULabels sampled in the dataset (see Feature).

input_of ManyToManyField

Runs that use this dataset as an input.

files ManyToManyField

Storage of dataset as multiple file.

Methods

backed(is_run_input=None)#

Return a cloud-backed data object.

Return type:

Union[AnnDataAccessor, BackedAccessor]

Notes

For more info, see tutorial: Query arrays.

delete(permanent=None, storage=None)#

Delete dataset.

Parameters:
  • permanent (Optional[bool], default: None) – Whether to permanently delete the dataset record (skips trash).

  • storage (Optional[bool], default: None) – Indicate whether you want to delete the linked file in storage.

Return type:

None

Examples

For any Dataset object dataset, call:

>>> dataset.delete()
classmethod from_anndata(adata, field, name=None, description=None, run=None, reference=None, reference_type=None, version=None, is_new_version_of=None, **kwargs)#

Create from AnnDataLike, validate & link features.

Parameters:
  • adata (Any) – An AnnData object.

  • field (Optional[DeferredAttribute]) – The registry field to validate & annotate features.

  • name (Optional[str], default: None) – A name.

  • description (Optional[str], default: None) – A description.

  • version (Optional[str], default: None) – A version string.

  • is_new_version_of (Optional[File], default: None) – An old version of the dataset.

  • run (Optional[Run], default: None) – The run that creates the dataset.

Return type:

Dataset

See also

File

Track files.

Feature

Track features.

Examples

>>> import lnschema_bionty as lb
>>> lb.settings.organism = "human"
>>> adata = ln.dev.datasets.anndata_with_obs()
>>> adata.var_names[:2]
Index(['ENSG00000000003', 'ENSG00000000005'], dtype='object')
>>> dataset = ln.Dataset.from_anndata(adata, name="My dataset", field=lb.Gene.ensembl_gene_id)
>>> dataset.save()
classmethod from_df(df, field=FieldAttr(Feature.name), name=None, description=None, run=None, reference=None, reference_type=None, version=None, is_new_version_of=None, **kwargs)#

Create from DataFrame, validate & link features.

Parameters:
  • df (DataFrame) – A DataFrame object.

  • field (DeferredAttribute, default: FieldAttr(Feature.name)) – The registry field to validate & annotate features.

  • name (Optional[str], default: None) – A name.

  • description (Optional[str], default: None) – A description.

  • version (Optional[str], default: None) – A version string.

  • is_new_version_of (Optional[File], default: None) – An old version of the dataset.

  • run (Optional[Run], default: None) – The run that creates the dataset.

Return type:

Dataset

See also

File

Track files.

Feature

Track features.

Notes

For more info, see tutorial: Tutorial: Files & datasets.

Examples

>>> df = ln.dev.datasets.df_iris_in_meter_batch1()
>>> df.head()
  sepal_length sepal_width petal_length petal_width iris_organism_code
0        0.051       0.035        0.014       0.002                 0
1        0.049       0.030        0.014       0.002                 0
2        0.047       0.032        0.013       0.002                 0
3        0.046       0.031        0.015       0.002                 0
4        0.050       0.036        0.014       0.002                 0
>>> dataset = ln.Dataset.from_df(df, description="Iris flower dataset batch1")
load(join='outer', is_run_input=None, **kwargs)#

Stage and load to memory.

Returns in-memory representation if possible, e.g., a concatenated DataFrame or AnnData object.

Return type:

Any

mapped(label_keys=None, join_vars='auto', encode_labels=True, parallel=False, stream=False, is_run_input=None)#

Convert to map-style dataset for data loaders.

Note: This currently only works for AnnData objects. The objects should have the same label keys and variables.

Parameters:
  • label_keys (Union[str, List[str], None], default: None) – Columns of the .obs slot - the names of the metadata features storing labels.

  • join_vars (Optional[Literal['auto', 'inner']], default: 'auto') – Do virtual inner join of varibales if set to “auto” and the varibales in the underlying AnnData objects are different. Always does the join if set to “inner”. If None, does not do the join.

  • encode_labels (bool, default: True) – Indicate whether you want to delete the linked file in storage.

  • parallel (bool, default: False) – Enable sampling with multiple processes.

  • stream (bool, default: False) – Whether to stream data from the array backend.

  • is_run_input (Optional[bool], default: None) – Whether to track this dataset as run input.

Return type:

MappedDataset

Examples

>>> import lamindb as ln
>>> from torch.utils.data import DataLoader
>>> ds = ln.Dataset.filter(description="my dataset").one()
>>> mapped = dataset.mapped(label_keys=["cell_type", "batch"])
>>> dl = DataLoader(mapped, batch_size=128, shuffle=True)
restore()#

Restore dataset record from trash.

Return type:

None

Examples

For any Dataset object dataset, call:

>>> dataset.restore()
save(*args, **kwargs)#

Save the dataset and underlying file to database & storage.

Return type:

None

Examples

>>> dataset = ln.Dataset("./myfile.csv", name="myfile")
>>> dataset.save()