lamindb.Dataset#
- class lamindb.Dataset(data: Any, name: str, description: Optional[str] = None, reference: Optional[str] = None, reference_type: Optional[str] = None)#
- class lamindb.Dataset(*db_args)
-
Datasets: mutable collections of data batches.
- Parameters:
data –
DataLike
A data object (DataFrame
,AnnData
) to store.name –
str
A name.description –
Optional[str] = None
A description.version –
Optional[str] = None
A version string.is_new_version_of –
Optional[Dataset] = None
An old version of the dataset.run –
Optional[Run] = None
The run that creates the dataset.
See also
Notes
See tutorial: Tutorial: Files & datasets.
The
File
&Dataset
registries bothtrack data batches of arbitrary format & size
can validate & link features (the measured dimensions in a data batch)
Typically,
a file stores a single immutable batch of data
a dataset stores a mutable collection of data batches
Examples:
Blob-like immutable files (pdf, txt, csv, jpg, …) or arrays (h5, h5ad, …) →
File
Mutable streamable backends (DuckDB, zarr, TileDB, …) →
Dataset
wrappingFile
Datasets in BigQuery, Snowflake, Postgres, … →
Dataset
(not yet implemented)
Hence, while
files always have a one-to-one correspondence with a storage accessor
datasets can reference a single file, multiple files or a dataset in a warehouse like BigQuery or Snowflake
Examples
Create a dataset from a DataFrame:
>>> df = ln.dev.datasets.df_iris_in_meter_batch1() >>> df.head() sepal_length sepal_width petal_length petal_width iris_species_code 0 0.051 0.035 0.014 0.002 0 1 0.049 0.030 0.014 0.002 0 2 0.047 0.032 0.013 0.002 0 3 0.046 0.031 0.015 0.002 0 4 0.050 0.036 0.014 0.002 0 >>> dataset = ln.Dataset(df, name="Iris flower dataset batch1") >>> dataset.save()
Create a dataset from a collection of
File
objects:>>> dataset = ln.Dataset([file1, file2], name="My dataset") >>> dataset.save()
Make a new version of a dataset:
>>> # a non-versioned dataset >>> dataset = ln.Dataset(df1, description="My dataframe") >>> dataset.save() >>> # create new dataset from old dataset and version both >>> new_dataset = ln.File(df2, is_new_version_of=dataset) >>> assert new_dataset.initial_version == dataset.initial_version >>> assert dataset.version == "1" >>> assert new_dataset.version == "2"
Fields
- id CharField
Universal id, valid across DB instances.
- name CharField
Name or title of dataset (required).
- description TextField
A description.
- version CharField
Version (default
None
).Use this together with
initial_version
to label different versions of a dataset.Consider using semantic versioning with Python versioning.
- hash CharField
Hash of dataset content. 86 base64 chars allow to store 64 bytes, 512 bits.
- reference CharField
A reference like URL or external ID.
- reference_type CharField
Type of reference, e.g., cellxgene Census dataset_id.
- transform ForeignKey
Transform
whose run created the dataset.
- run ForeignKey
Run
that created thefile
.
- file ForeignKey
Storage of dataset as a one file.
- initial_version ForeignKey
Initial version of the dataset, a
Dataset
object.
- created_at DateTimeField
Time of creation of record.
- updated_at DateTimeField
Time of run execution.
- created_by ForeignKey
Creator of record, a
User
.
- feature_sets ManyToManyField
The feature sets measured in this dataset (see
FeatureSet
).
- ulabels ManyToManyField
ULabels sampled in the dataset (see
Feature
).
- input_of ManyToManyField
Runs that use this dataset as an input.
- files ManyToManyField
Storage of dataset as multiple file.
Methods
- backed(is_run_input=None)#
Return a cloud-backed data object.
- Return type:
Union
[AnnDataAccessor
,BackedAccessor
]
Notes
For more info, see tutorial: Query files & datasets.
- delete(storage=None)#
Delete file, optionally from storage.
- Parameters:
storage (
Optional
[bool
], default:None
) – Indicate whether you want to delete the file in storage.- Return type:
None
Examples
For any File object file, call:
>>> file.delete()
- classmethod from_anndata(adata, field, name=None, description=None, run=None, modality=None, reference=None, reference_type=None)#
Create from
AnnDataLike
, validate & link features. :rtype:Dataset
Examples
>>> import lnschema_bionty as lb >>> lb.settings.species = "human" >>> adata = ln.dev.datasets.anndata_with_obs() >>> adata.var_names[:2] Index(['ENSG00000000003', 'ENSG00000000005'], dtype='object') >>> dataset = ln.Dataset.from_anndata(adata, name="My dataset", field=lb.Gene.ensembl_gene_id) >>> dataset.save()
- classmethod from_df(df, field=FieldAttr(Feature.name), name=None, description=None, run=None, modality=None, reference=None, reference_type=None)#
Create from
DataFrame
, validate & link features. :rtype:Dataset
Notes
For more info, see tutorial: Tutorial: Files & datasets.
Examples
>>> df = ln.dev.datasets.df_iris_in_meter_batch1() >>> df.head() sepal_length sepal_width petal_length petal_width iris_species_code 0 0.051 0.035 0.014 0.002 0 1 0.049 0.030 0.014 0.002 0 2 0.047 0.032 0.013 0.002 0 3 0.046 0.031 0.015 0.002 0 4 0.050 0.036 0.014 0.002 0 >>> dataset = ln.Dataset.from_df(df, description="Iris flower dataset batch1")
- load(is_run_input=None, **kwargs)#
Stage and load to memory.
Returns in-memory representation if possible, e.g., a concatenated DataFrame or AnnData object.
- Return type:
Any
- save(*args, **kwargs)#
Save the file to database & storage.
- Return type:
None
Examples
>>> file = ln.File("./myfile.csv", key="myfile.csv") >>> file.save()