lamindb.File#
- class lamindb.File(data: Union[PathLike, DataLike] = None, *, key: Optional[str] = None, name: Optional[str] = None, run: Optional[Run] = None, format: Optional[str] = None, features: List[Features] = None, id: Optional[str] = None, input_of: List[Run] = None)#
Bases:
BaseORM
Files: serialized data objects.
- Parameters:
data –
Union[PathLike, DataLike] = None
- A file path or an in-memory data object to serialize. Can be a cloud path.key –
Optional[str] = None
- A storage key, a relative filepath within the storage location, e.g., an S3 or GCP bucket.name –
Optional[str] = None
- A name. Defaults to a file name for a file.run –
Optional[Run] = None
- The generating run.features –
List[Features] = None
- A feature set record.id –
Optional[str] = None
- The id of the file. Auto-generated if not passed.input_of –
List[Run] = None
- Runs for which the file is as an input.
Often, files represent atomic datasets in object storage: jointly measured observations of features (
Features
). They are generated by running code (Transform
), instances ofRun
.Data objects often have canonical on-disk and in-memory representations. LaminDB makes some configurable default choices (e.g., serialize a
DataFrame
as a.parquet
file).Some datasets do not have a canonical in-memory representation, for instance,
.fastq
,.vcf
, or files describing QC of datasets.Note
Examples for storage ⟷ memory correspondence:
Table:
.csv
,.tsv
,.parquet
,.ipc
,.feather
⟷pd.DataFrame
,polars.DataFrame
Annotated matrix:
.h5ad
,.h5mu
,.zarr
⟷anndata.AnnData
,mudata.MuData
Image:
.jpg
,.png
⟷np.ndarray
, …Tensor: zarr directory, TileDB store ⟷ zarr loader, TileDB loader
Fastq:
.fastq
⟷ /VCF:
.vcf
⟷ /QC:
.html
⟷ /
Attributes
-
id:
str
#
-
name:
Optional
[str
]#
-
suffix:
Optional
[str
]# Suffix to construct the storage key. Defaults to None.
This is a file extension if the file is stored in a file format. It’s None if the storage format doesn’t have a canonical extension.
-
size:
Optional
[int
]# Size in bytes.
Examples: 1KB is 1e3 bytes, 1MB is 1e6, 1GB is 1e9, 1TB is 1e12 etc.
-
hash:
Optional
[str
]# Hash (md5).
-
key:
Optional
[str
]# Storage key, the relative path within the storage location.
-
run_id:
Optional
[str
]# Source run id.
-
transform_id:
Optional
[str
]# Source transform id.
-
transform_version:
Optional
[str
]# Source transform version.
-
storage_id:
str
# Storage root id.
-
created_at:
datetime
#
-
updated_at:
Optional
[datetime
]#
-
created_by_id:
Optional
[str
]#
Methods
- backed()#
Return a cloud-backed AnnData object for streaming.
- Return type:
AnnDataAccessor
- load(is_run_input=None)#
Stage and load to memory.
Returns in-memory representation if possible, e.g., an AnnData object for an h5ad file.
- Return type:
TypeVar
(DataLike
,AnnData
,DataFrame
)
- replace(data, run=None, format=None)#
Replace data object.
- Return type:
None
- stage(is_run_input=None)#
Update cache from cloud storage if outdated.
Returns a path to a locally cached on-disk object (say, a .jpg file).
- Return type:
Path
- stream(subset_obs=None, subset_var=None, is_run_input=None)#
Stream the file into memory. Allows subsetting an AnnData object.
- Parameters:
subset_obs –
Optional[LazyDataFrame] = None
- A DataFrame query to evaluate on.obs
of an underlyingAnnData
object.subset_var –
Optional[LazyDataFrame] = None
- A DataFrame query to evaluate on.var
of an underlyingAnnData
object.
- Return type:
AnnData
- Returns:
The streamed AnnData object.
Example:
>>> file = ln.select(ln.File).where(...).one() >>> obs = file.subsetter() >>> obs = ( >>> obs.cell_type.isin(["dendritic cell", "T cell") >>> & obs.disease.isin(["Alzheimer's"]) >>> ) >>> file.stream(subset_obs=obs, is_run_input=True)
- subsetter()#
A subsetter to pass to
.stream()
.Currently, this returns an instance of an unconstrained
LazyDataFrame
to be evaluated in.stream()
.In the future, this will be constrained by metadata of the file, it’s feature- and sample-level descriptors, like .obs, .var, .columns, .rows.
- Return type:
Lazy