lamindb.File#

class lamindb.File(data: Union[PathLike, DataLike] = None, *, key: Optional[str] = None, name: Optional[str] = None, run: Optional[Run] = None, format: Optional[str] = None, features: List[Features] = None, id: Optional[str] = None, input_of: List[Run] = None)#

Bases: BaseORM

Files: serialized data objects.

Parameters:
  • dataUnion[PathLike, DataLike] = None - A file path or an in-memory data object to serialize. Can be a cloud path.

  • keyOptional[str] = None - A storage key, a relative filepath within the storage location, e.g., an S3 or GCP bucket.

  • nameOptional[str] = None - A name. Defaults to a file name for a file.

  • runOptional[Run] = None - The generating run.

  • featuresList[Features] = None - A feature set record.

  • idOptional[str] = None - The id of the file. Auto-generated if not passed.

  • input_ofList[Run] = None - Runs for which the file is as an input.

Often, files represent atomic datasets in object storage: jointly measured observations of features (Features). They are generated by running code (Transform), instances of Run.

Data objects often have canonical on-disk and in-memory representations. LaminDB makes some configurable default choices (e.g., serialize a DataFrame as a .parquet file).

Some datasets do not have a canonical in-memory representation, for instance, .fastq, .vcf, or files describing QC of datasets.

Note

Examples for storage ⟷ memory correspondence:

  • Table: .csv, .tsv, .parquet, .ipc, .featherpd.DataFrame, polars.DataFrame

  • Annotated matrix: .h5ad, .h5mu, .zarranndata.AnnData, mudata.MuData

  • Image: .jpg, .pngnp.ndarray, …

  • Tensor: zarr directory, TileDB store ⟷ zarr loader, TileDB loader

  • Fastq: .fastq ⟷ /

  • VCF: .vcf ⟷ /

  • QC: .html ⟷ /

Attributes

id: str#
name: Optional[str]#
suffix: Optional[str]#

Suffix to construct the storage key. Defaults to None.

This is a file extension if the file is stored in a file format. It’s None if the storage format doesn’t have a canonical extension.

size: Optional[int]#

Size in bytes.

Examples: 1KB is 1e3 bytes, 1MB is 1e6, 1GB is 1e9, 1TB is 1e12 etc.

hash: Optional[str]#

Hash (md5).

key: Optional[str]#

Storage key, the relative path within the storage location.

run_id: Optional[str]#

Source run id.

transform_id: Optional[str]#

Source transform id.

transform_version: Optional[str]#

Source transform version.

storage_id: str#

Storage root id.

created_at: datetime#
updated_at: Optional[datetime]#
created_by_id: Optional[str]#
created_by: User#

User who created the file [pre-joined].

features: List[Features]#

Feature sets indexing this file.

folders: List[Folder]#

Folders that contain this file.

input_of: List[Run]#

Runs that use this file as input.

run: Optional[Run]#

Run that created the file.

storage: Storage#

Storage location of file [pre-joined], see .path() for full path.

transform: Transform#

Transform whose run created the file [pre-joined].

Methods

backed()#

Return a cloud-backed AnnData object for streaming.

Return type:

AnnDataAccessor

load(is_run_input=None)#

Stage and load to memory.

Returns in-memory representation if possible, e.g., an AnnData object for an h5ad file.

Return type:

TypeVar(DataLike, AnnData, DataFrame)

path()#

Path on storage.

Return type:

Union[Path, UPath]

replace(data, run=None, format=None)#

Replace data object.

Return type:

None

stage(is_run_input=None)#

Update cache from cloud storage if outdated.

Returns a path to a locally cached on-disk object (say, a .jpg file).

Return type:

Path

stream(subset_obs=None, subset_var=None, is_run_input=None)#

Stream the file into memory. Allows subsetting an AnnData object.

Parameters:
  • subset_obsOptional[LazyDataFrame] = None - A DataFrame query to evaluate on .obs of an underlying AnnData object.

  • subset_varOptional[LazyDataFrame] = None - A DataFrame query to evaluate on .var of an underlying AnnData object.

Return type:

AnnData

Returns:

The streamed AnnData object.

Example:

>>> file = ln.select(ln.File).where(...).one()
>>> obs = file.subsetter()
>>> obs = (
>>>     obs.cell_type.isin(["dendritic cell", "T cell")
>>>     & obs.disease.isin(["Alzheimer's"])
>>> )
>>> file.stream(subset_obs=obs, is_run_input=True)
subsetter()#

A subsetter to pass to .stream().

Currently, this returns an instance of an unconstrained LazyDataFrame to be evaluated in .stream().

In the future, this will be constrained by metadata of the file, it’s feature- and sample-level descriptors, like .obs, .var, .columns, .rows.

Return type:

Lazy