lamindb.File#
- class lamindb.File(data: Union[PathLike, Any], key: Optional[str] = None, description: Optional[str] = None, is_new_version_of: Optional[File] = None, run: Optional[Run] = None)#
-
Files: data batches (blobs & array shards).
- Parameters:
data –
Union[PathLike, DataLike]
A path or data object (DataFrame
,AnnData
).key –
Optional[str] = None
A relative path within default storage, e.g.,"myfolder/myfile.fcs"
.description –
Optional[str] = None
A description.version –
Optional[str] = None
A version string.is_new_version_of –
Optional[File] = None
An old version of the file.run –
Optional[Run] = None
The run that creates the file.
Typical storage formats & their API accessors
Table:
.csv
,.tsv
,.parquet
,.ipc
⟷DataFrame
,pyarrow.Table
Annotated matrix:
.h5ad
,.h5mu
,.zrad
⟷AnnData
,MuData
Image:
.jpg
,.png
⟷np.ndarray
, …Arrays: HDF5 group, zarr group, TileDB store ⟷ HDF5, zarr, TileDB loaders
Fastq:
.fastq
⟷ /VCF:
.vcf
⟷ /QC:
.html
⟷ /
You’ll find these values in the
suffix
&accessor
fields.LaminDB makes some default choices (e.g., serialize a
DataFrame
as a.parquet
file).See also
Dataset
Mutable collections of data batches.
from_df()
Create a file object from
DataFrame
and track features.from_anndata()
Create a file object from
AnnData
and track features.from_dir()
Bulk create file objects from a directory.
Notes
For more info, see tutorial: Tutorial: Files & datasets.
Examples
Create a file from a cloud storage (supports
s3://
andgs://
):>>> file = ln.File("s3://lamindb-ci/test-data/test.csv") >>> file.save() # only metadata is saved
Create a file from a local temporary filepath using
key
:>>> temporary_filepath = ln.dev.datasets.file_jpg_paradisi05() >>> file = ln.File(temporary_filepath, key="images/paradisi05_image.jpg") >>> file.save()
Why does the API look this way?
It’s inspired by APIs building on AWS S3.
Both boto3 and quilt select a bucket (akin to default storage in LaminDB) and define a target path through a
key
argument.In boto3:
# signature: S3.Bucket.upload_file(filepath, key) import boto3 s3 = boto3.resource('s3') bucket = s3.Bucket('mybucket') bucket.upload_file('/tmp/hello.txt', 'hello.txt')
In quilt3:
# signature: quilt3.Bucket.put_file(key, filepath) import quilt3 bucket = quilt3.Bucket('mybucket') bucket.put_file('hello.txt', '/tmp/hello.txt')
Make a new version of a file:
>>> # a non-versioned file >>> file = ln.File(df1, description="My dataframe") >>> file.save() >>> # create new file from old file and version both >>> new_file = ln.File(df2, is_new_version_of=file) >>> assert new_file.initial_version == file.initial_version >>> assert file.version == "1" >>> assert new_file.version == "2"
Properties
- path#
File path (
Path
,UPath
).Examples
File in cloud storage:
>>> ln.File("s3://lamindb-ci/lndb-storage/pbmc68k.h5ad").save() >>> file = ln.File.filter(key="lndb-storage/pbmc68k.h5ad").one() >>> file.path S3Path('s3://lamindb-ci/lndb-storage/pbmc68k.h5ad')
File in local storage:
>>> ln.File("./myfile.csv", description="myfile").save() >>> file = ln.File.filter(description="myfile").one() >>> file.path PosixPath('/home/runner/work/lamindb/lamindb/docs/guide/mydata/myfile.csv')
Fields
- id AutoField
Internal id, valid only in one DB instance.
- uid CharField
A universal random id (20-char base62 ~ UUID), valid across DB instances.
- storage ForeignKey
Storage location (
Storage
), e.g., an S3 or GCP bucket or a local directory.
- key CharField
Storage key, the relative path within the storage location.
- suffix CharField
Path suffix or empty string if no canonical suffix exists.
This is either a file suffix (
".csv"
,".h5ad"
, etc.) or the empty string “”.
- accessor CharField
Default backed or memory accessor, e.g., DataFrame, AnnData.
Soon, also: SOMA, MuData, zarr.Group, tiledb.Array, etc.
- description CharField
A description.
- version CharField
Version (default
None
).Use this together with
initial_version
to label different versions of a file.Consider using semantic versioning with Python versioning.
- size BigIntegerField
Size in bytes.
Examples: 1KB is 1e3 bytes, 1MB is 1e6, 1GB is 1e9, 1TB is 1e12 etc.
- hash CharField
Hash or pseudo-hash of file content.
Useful to ascertain integrity and avoid duplication.
- hash_type CharField
Type of hash.
- transform ForeignKey
Transform
whose run created the file.
- run ForeignKey
Run
that created the file.
- initial_version ForeignKey
Initial version of the file, a
File
object.
- visibility SmallIntegerField
Visibility of file record in queries & searches (0 default, 1 hidden, 2 trash).
- key_is_virtual BooleanField
Indicates whether
key
is virtual or part of an actual file path.
- created_at DateTimeField
Time of creation of record.
- updated_at DateTimeField
Time of last update to record.
- created_by ForeignKey
Creator of record, a
User
.
- feature_sets ManyToManyField
The feature sets measured in the file (
FeatureSet
).
- ulabels ManyToManyField
The ulabels measured in the file (
ULabel
).
- input_of ManyToManyField
Runs that use this file as an input.
Methods
- backed(is_run_input=None)#
Return a cloud-backed data object.
- Return type:
Union
[AnnDataAccessor
,BackedAccessor
]
Notes
For more info, see tutorial: Query arrays.
Examples
Read AnnData in backed mode from cloud:
>>> file = ln.File.filter(key="lndb-storage/pbmc68k.h5ad").one() >>> file.backed() AnnData object with n_obs × n_vars = 70 × 765 backed at 's3://lamindb-ci/lndb-storage/pbmc68k.h5ad'
- delete(permanent=None, storage=None)#
Put file in trash.
Putting a file into the trash means setting its visibility field to 2.
FAQ: Storage FAQ
- Parameters:
permanent (
Optional
[bool
], default:None
) – Permanently delete the file (skips trash).storage (
Optional
[bool
], default:None
) – Indicate whether you want to delete the file in storage.
- Return type:
None
Examples
For any File object file, call:
>>> file.delete()
- classmethod from_anndata(adata, field, key=None, description=None, run=None, version=None, is_new_version_of=None, **kwargs)#
Create from
AnnDataLike
, validate & link features.- Parameters:
adata (
Any
) – An AnnData object or path to it.field (
Optional
[DeferredAttribute
]) – The registry field to validate & annotate features.key (
Optional
[str
], default:None
) – A relative path within default storage, e.g., “myfolder/myfile.fcs”.description (
Optional
[str
], default:None
) – A description.version (
Optional
[str
], default:None
) – A version string.is_new_version_of (
Optional
[File
], default:None
) – An old version of the file.run (
Optional
[Run
], default:None
) – The run that creates the file.
- Return type:
See also
lamindb.Dataset()
Track datasets.
lamindb.Feature
Track features.
Notes
For more info, see tutorial: Tutorial: Files & datasets.
Examples
>>> import lnschema_bionty as lb >>> lb.settings.organism = "human" >>> adata = ln.dev.datasets.anndata_with_obs() >>> adata.var_names[:2] Index(['ENSG00000000003', 'ENSG00000000005'], dtype='object') >>> file = ln.File.from_anndata(adata, ... field=lb.Gene.ensembl_gene_id, ... description="mini anndata with obs") >>> file.save()
- classmethod from_df(df, field=FieldAttr(Feature.name), key=None, description=None, run=None, version=None, is_new_version_of=None, **kwargs)#
Create from
DataFrame
, validate & link features.- Parameters:
df (
DataFrame
) – A DataFrame object.field (
DeferredAttribute
, default:FieldAttr(Feature.name)
) – The registry field to validate & annotate features.key (
Optional
[str
], default:None
) – A relative path within default storage, e.g., “myfolder/myfile.fcs”.description (
Optional
[str
], default:None
) – A description.version (
Optional
[str
], default:None
) – A version string.is_new_version_of (
Optional
[File
], default:None
) – An old version of the file.run (
Optional
[Run
], default:None
) – The run that creates the file.
- Return type:
See also
lamindb.Dataset()
Track datasets.
lamindb.Feature
Track features.
Notes
For more info, see tutorial: Tutorial: Files & datasets.
Examples
>>> df = ln.dev.datasets.df_iris_in_meter_batch1() >>> df.head() sepal_length sepal_width petal_length petal_width iris_organism_code 0 0.051 0.035 0.014 0.002 0 1 0.049 0.030 0.014 0.002 0 2 0.047 0.032 0.013 0.002 0 3 0.046 0.031 0.015 0.002 0 4 0.050 0.036 0.014 0.002 0 >>> file = ln.File.from_df(df, description="Iris flower dataset batch1") >>> file.save()
- classmethod from_dir(path, key=None, *, run=None)#
Create a list of file objects from a directory.
Note
If you have a high number of files (several 100k) and don’t want to track them individually, consider creating a
Dataset
viaDataset(path, meta=metadata)
for them. See, e.g., RxRx: cell imaging.- Parameters:
path (
TypeVar
(PathLike
,str
,Path
,UPath
)) – Source path of folder.key (
Optional
[str
], default:None
) – Key for storage destination. If None and directory is in a registered location, an inferred key will reflect the relative position. If None and directory is outside of a registered storage location, the inferred key defaults to path.name.run (
Optional
[Run
], default:None
) – A Run object.
- Return type:
List
[File
]
Examples
>>> dir_path = ln.dev.datasets.generate_cell_ranger_files("sample_001", ln.settings.storage) >>> files = ln.File.from_dir(dir_path) >>> ln.save(files)
- load(is_run_input=None, stream=False, **kwargs)#
Stage and load to memory.
Returns in-memory representation if possible, e.g., an AnnData object for an h5ad file.
- Return type:
Any
Examples
Load as a DataFrame:
>>> df = ln.dev.datasets.df_iris_in_meter_batch1() >>> ln.File.from_df(df, description="iris").save() >>> file = ln.File.filter(description="iris").one() >>> file.load().head() sepal_length sepal_width petal_length petal_width iris_organism_code 0 0.051 0.035 0.014 0.002 0 1 0.049 0.030 0.014 0.002 0 2 0.047 0.032 0.013 0.002 0 3 0.046 0.031 0.015 0.002 0 4 0.050 0.036 0.014 0.002 0
Load as an AnnData:
>>> ln.File("s3://lamindb-ci/lndb-storage/pbmc68k.h5ad").save() >>> file = ln.File.filter(key="lndb-storage/pbmc68k.h5ad").one() >>> file.load() AnnData object with n_obs × n_vars = 70 × 765
Fall back to
stage()
if no in-memory representation is configured:>>> ln.File(ln.dev.datasets.file_jpg_paradisi05(), description="paradisi05").save() >>> file = ln.File.filter(description="paradisi05").one() >>> file.load() PosixPath('/home/runner/work/lamindb/lamindb/docs/guide/mydata/.lamindb/jb7BY5UJoQVGMUOKiLcn.jpg')
- replace(data, run=None, format=None)#
Replace file content.
- Parameters:
- Return type:
None
Examples
Say we made a change to the content of a file (e.g., edited the image paradisi05_laminopathic_nuclei.jpg).
This is how we replace the old file in storage with the new file:
>>> file.replace("paradisi05_laminopathic_nuclei.jpg") >>> file.save()
Note that this neither changes the storage key nor the filename.
However, it will update the suffix if the file type changes.
- restore()#
Restore file from trash.
- Return type:
None
Examples
For any File object file, call:
>>> file.restore()
- save(*args, **kwargs)#
Save the file to database & storage.
- Return type:
None
Examples
>>> file = ln.File("./myfile.csv", description="myfile") >>> file.save()
- stage(is_run_input=None)#
Update cache from cloud storage if outdated.
Returns a path to a locally cached on-disk object (say, a .jpg file).
- Return type:
Path
Examples
Sync file from cloud and return the local path of the cache:
>>> ln.settings.storage = "s3://lamindb-ci" >>> ln.File("s3://lamindb-ci/lndb-storage/pbmc68k.h5ad").save() >>> file = ln.File.filter(key="lndb-storage/pbmc68k.h5ad").one() >>> file.stage() PosixPath('/home/runner/work/Caches/lamindb/lamindb-ci/lndb-storage/pbmc68k.h5ad')
- classmethod view_tree(level=-1, limit_to_directories=False, length_limit=1000, max_files_per_dir_per_type=7)#
View the tree structure of the keys.
- Parameters:
level (
int
, default:-1
) –int=-1
Depth of the tree to be displayed. Default is -1 which means all levels.limit_to_directories (
bool
, default:False
) –bool=False
If True, only directories will be displayed.length_limit (
int
, default:1000
) –int=1000
Maximum number of nodes to be displayed.max_files_per_dir_per_type (
int
, default:7
) –int=7
Maximum number of files per directory per type.
- Return type:
None