lamindb.Dataset#
- class lamindb.Dataset(data: Any, name: str, version: str, description: Optional[str] = None, meta: Optional[Any] = None, reference: Optional[str] = None, reference_type: Optional[str] = None, run: Optional[Run] = None, is_new_version_of: Optional[Dataset] = None)#
-
Datasets: collections of data batches.
- Parameters:
data –
DataLike
An array (DataFrame
,AnnData
), a directory, or a list ofFile
objects.name –
str
A name.description –
Optional[str] = None
A description.version –
Optional[str] = None
A version string.is_new_version_of –
Optional[Dataset] = None
An old version of the dataset.run –
Optional[Run] = None
The run that creates the dataset.meta –
Optional[DataLike]
An array (DataFrame
,AnnData
) or aFile
object that defines metadata for a directory of objects.reference –
Optional[str] = None
For instance, an external ID or a URL.reference_type –
Optional[str] = None
For instance,"url"
.
See also
Notes
See tutorial: Tutorial: Files & datasets.
The
File
&Dataset
registries bothtrack data batches of arbitrary format & size
can validate & link features (the measured dimensions in a data batch)
Often, a file stores a single batch of data and a dataset stores a collection of data batches, while
files always have a one-to-one correspondence with a storage accessor
datasets can reference multiple objects or an array store
Examples:
store a blob-like file (pdf, txt, csv, jpg, …) as a
File
store a queryable array (parquet, HDF5, h5ad, DuckDB, zarr, TileDB, …) as a
Dataset
orFile
, depending on your use casestore collections of files and arrays with
Dataset
once implemented, datasets in BigQuery, Snowflake, Postgres, … would be stored in
Dataset
Examples
Create a dataset from a DataFrame:
>>> df = ln.dev.datasets.df_iris_in_meter_batch1() >>> df.head() sepal_length sepal_width petal_length petal_width iris_organism_code 0 0.051 0.035 0.014 0.002 0 1 0.049 0.030 0.014 0.002 0 2 0.047 0.032 0.013 0.002 0 3 0.046 0.031 0.015 0.002 0 4 0.050 0.036 0.014 0.002 0 >>> dataset = ln.Dataset(df, name="Iris flower dataset batch1") >>> dataset.save()
Create a dataset from a collection of
File
objects:>>> dataset = ln.Dataset([file1, file2], name="My dataset") >>> dataset.save()
If you don’t have 100k files or more in a directory, create
File
records viafrom_dir
(e.g., here Tutorial: Files & datasets):>>> files = ln.File.from_dir("./my_dir") >>> dataset = ln.Dataset([file1, file2], name="My dataset") >>> dataset.save()
If you have more than 100k files, consider creating a dataset directly from the directory without creating File records (e.g., here RxRx: cell imaging):
>>> dataset = ln.Dataset("s3://my-bucket/my-images/", name="My dataset", meta=df) >>> dataset.save()
Make a new version of a dataset:
>>> # a non-versioned dataset >>> dataset = ln.Dataset(df1, description="My dataframe") >>> dataset.save() >>> # create new dataset from old dataset and version both >>> new_dataset = ln.File(df2, is_new_version_of=dataset) >>> assert new_dataset.initial_version == dataset.initial_version >>> assert dataset.version == "1" >>> assert new_dataset.version == "2"
Fields
- id AutoField
Internal id, valid only in one DB instance.
- uid CharField
Universal id, valid across DB instances.
- name CharField
Name or title of dataset (required).
- description TextField
A description.
- version CharField
Version (default
None
).Use this together with
initial_version
to label different versions of a dataset.Consider using semantic versioning with Python versioning.
- hash CharField
Hash of dataset content. 86 base64 chars allow to store 64 bytes, 512 bits.
- reference CharField
A reference like URL or external ID.
- reference_type CharField
Type of reference, e.g., cellxgene Census dataset_id.
- transform ForeignKey
Transform
whose run created the dataset.
- run ForeignKey
Run
that created thefile
.
- file OneToOneField
Storage of dataset as a one file.
- storage ForeignKey
Storage of dataset as mere paths handled by a key value store or file system.
- initial_version ForeignKey
Initial version of the dataset, a
Dataset
object.
- visibility SmallIntegerField
Visibility of record, 0-default, 1-hidden, 2-trash.
- created_at DateTimeField
Time of creation of record.
- updated_at DateTimeField
Time of run execution.
- created_by ForeignKey
Creator of record, a
User
.
- feature_sets ManyToManyField
The feature sets measured in this dataset (see
FeatureSet
).
- ulabels ManyToManyField
ULabels sampled in the dataset (see
Feature
).
- input_of ManyToManyField
Runs that use this dataset as an input.
- files ManyToManyField
Storage of dataset as multiple file.
Methods
- backed(is_run_input=None)#
Return a cloud-backed data object.
- Return type:
Union
[AnnDataAccessor
,BackedAccessor
]
Notes
For more info, see tutorial: Query arrays.
- delete(permanent=None, storage=None)#
Delete dataset.
- Parameters:
permanent (
Optional
[bool
], default:None
) – Whether to permanently delete the dataset record (skips trash).storage (
Optional
[bool
], default:None
) – Indicate whether you want to delete the linked file in storage.
- Return type:
None
Examples
For any Dataset object dataset, call:
>>> dataset.delete()
- classmethod from_anndata(adata, field, name=None, description=None, run=None, reference=None, reference_type=None, version=None, is_new_version_of=None, **kwargs)#
Create from
AnnDataLike
, validate & link features.- Parameters:
adata (
Any
) – An AnnData object.field (
Optional
[DeferredAttribute
]) – The registry field to validate & annotate features.name (
Optional
[str
], default:None
) – A name.description (
Optional
[str
], default:None
) – A description.version (
Optional
[str
], default:None
) – A version string.is_new_version_of (
Optional
[File
], default:None
) – An old version of the dataset.run (
Optional
[Run
], default:None
) – The run that creates the dataset.
- Return type:
Examples
>>> import lnschema_bionty as lb >>> lb.settings.organism = "human" >>> adata = ln.dev.datasets.anndata_with_obs() >>> adata.var_names[:2] Index(['ENSG00000000003', 'ENSG00000000005'], dtype='object') >>> dataset = ln.Dataset.from_anndata(adata, name="My dataset", field=lb.Gene.ensembl_gene_id) >>> dataset.save()
- classmethod from_df(df, field=FieldAttr(Feature.name), name=None, description=None, run=None, reference=None, reference_type=None, version=None, is_new_version_of=None, **kwargs)#
Create from
DataFrame
, validate & link features.- Parameters:
df (
DataFrame
) – A DataFrame object.field (
DeferredAttribute
, default:FieldAttr(Feature.name)
) – The registry field to validate & annotate features.name (
Optional
[str
], default:None
) – A name.description (
Optional
[str
], default:None
) – A description.version (
Optional
[str
], default:None
) – A version string.is_new_version_of (
Optional
[File
], default:None
) – An old version of the dataset.run (
Optional
[Run
], default:None
) – The run that creates the dataset.
- Return type:
Notes
For more info, see tutorial: Tutorial: Files & datasets.
Examples
>>> df = ln.dev.datasets.df_iris_in_meter_batch1() >>> df.head() sepal_length sepal_width petal_length petal_width iris_organism_code 0 0.051 0.035 0.014 0.002 0 1 0.049 0.030 0.014 0.002 0 2 0.047 0.032 0.013 0.002 0 3 0.046 0.031 0.015 0.002 0 4 0.050 0.036 0.014 0.002 0 >>> dataset = ln.Dataset.from_df(df, description="Iris flower dataset batch1")
- load(join='outer', is_run_input=None, **kwargs)#
Stage and load to memory.
Returns in-memory representation if possible, e.g., a concatenated DataFrame or AnnData object.
- Return type:
Any
- mapped(label_keys=None, join_vars='auto', encode_labels=True, parallel=False, stream=False, is_run_input=None)#
Convert to map-style dataset for data loaders.
Note: This currently only works for AnnData objects. The objects should have the same label keys and variables.
- Parameters:
label_keys (
Union
[str
,List
[str
],None
], default:None
) – Columns of the .obs slot - the names of the metadata features storing labels.join_vars (
Optional
[Literal
['auto'
,'inner'
]], default:'auto'
) – Do virtual inner join of varibales if set to “auto” and the varibales in the underlying AnnData objects are different. Always does the join if set to “inner”. If None, does not do the join.encode_labels (
bool
, default:True
) – Indicate whether you want to delete the linked file in storage.parallel (
bool
, default:False
) – Enable sampling with multiple processes.stream (
bool
, default:False
) – Whether to stream data from the array backend.is_run_input (
Optional
[bool
], default:None
) – Whether to track this dataset as run input.
- Return type:
Examples
>>> import lamindb as ln >>> from torch.utils.data import DataLoader >>> ds = ln.Dataset.filter(description="my dataset").one() >>> mapped = dataset.mapped(label_keys=["cell_type", "batch"]) >>> dl = DataLoader(mapped, batch_size=128, shuffle=True)
- restore()#
Restore dataset record from trash.
- Return type:
None
Examples
For any Dataset object dataset, call:
>>> dataset.restore()
- save(*args, **kwargs)#
Save the dataset and underlying file to database & storage.
- Return type:
None
Examples
>>> dataset = ln.Dataset("./myfile.csv", name="myfile") >>> dataset.save()