lamindb.File#

class lamindb.File(data: Union[PathLike, Any], key: Optional[str] = None, description: Optional[str] = None, is_new_version_of: Optional[File] = None, run: Optional[Run] = None)#

Bases: Registry, Data, IsTree

Files: data batches (blobs & array shards).

Parameters:
  • dataUnion[PathLike, DataLike] A path or data object (DataFrame, AnnData).

  • keyOptional[str] = None A relative path within default storage, e.g., "myfolder/myfile.fcs".

  • descriptionOptional[str] = None A description.

  • versionOptional[str] = None A version string.

  • is_new_version_ofOptional[File] = None An old version of the file.

  • runOptional[Run] = None The run that creates the file.

Typical storage formats & their API accessors
  • Table: .csv, .tsv, .parquet, .ipcDataFrame, pyarrow.Table

  • Annotated matrix: .h5ad, .h5mu, .zradAnnData, MuData

  • Image: .jpg, .pngnp.ndarray, …

  • Arrays: HDF5 group, zarr group, TileDB store ⟷ HDF5, zarr, TileDB loaders

  • Fastq: .fastq ⟷ /

  • VCF: .vcf ⟷ /

  • QC: .html ⟷ /

You’ll find these values in the suffix & accessor fields.

LaminDB makes some default choices (e.g., serialize a DataFrame as a .parquet file).

See also

Dataset

Mutable collections of data batches.

from_df()

Create a file object from DataFrame and track features.

from_anndata()

Create a file object from AnnData and track features.

from_dir()

Bulk create file objects from a directory.

Notes

For more info, see tutorial: Tutorial: Files & datasets.

Examples

Create a file from a cloud storage (supports s3:// and gs://):

>>> file = ln.File("s3://lamindb-ci/test-data/test.csv")
>>> file.save()  # only metadata is saved

Create a file from a local temporary filepath using key:

>>> temporary_filepath = ln.dev.datasets.file_jpg_paradisi05()
>>> file = ln.File(temporary_filepath, key="images/paradisi05_image.jpg")
>>> file.save()
Why does the API look this way?

It’s inspired by APIs building on AWS S3.

Both boto3 and quilt select a bucket (akin to default storage in LaminDB) and define a target path through a key argument.

In boto3:

# signature: S3.Bucket.upload_file(filepath, key)
import boto3
s3 = boto3.resource('s3')
bucket = s3.Bucket('mybucket')
bucket.upload_file('/tmp/hello.txt', 'hello.txt')

In quilt3:

# signature: quilt3.Bucket.put_file(key, filepath)
import quilt3
bucket = quilt3.Bucket('mybucket')
bucket.put_file('hello.txt', '/tmp/hello.txt')

Make a new version of a file:

>>> # a non-versioned file
>>> file = ln.File(df1, description="My dataframe")
>>> file.save()
>>> # create new file from old file and version both
>>> new_file = ln.File(df2, is_new_version_of=file)
>>> assert new_file.initial_version == file.initial_version
>>> assert file.version == "1"
>>> assert new_file.version == "2"

Properties

path#

File path (Path, UPath).

Examples

File in cloud storage:

>>> ln.File("s3://lamindb-ci/lndb-storage/pbmc68k.h5ad").save()
>>> file = ln.File.filter(key="lndb-storage/pbmc68k.h5ad").one()
>>> file.path
S3Path('s3://lamindb-ci/lndb-storage/pbmc68k.h5ad')

File in local storage:

>>> ln.File("./myfile.csv", description="myfile").save()
>>> file = ln.File.filter(description="myfile").one()
>>> file.path
PosixPath('/home/runner/work/lamindb/lamindb/docs/guide/mydata/myfile.csv')

Fields

id AutoField

Internal id, valid only in one DB instance.

uid CharField

A universal random id (20-char base62 ~ UUID), valid across DB instances.

storage ForeignKey

Storage location (Storage), e.g., an S3 or GCP bucket or a local directory.

key CharField

Storage key, the relative path within the storage location.

suffix CharField

Path suffix or empty string if no canonical suffix exists.

This is either a file suffix (".csv", ".h5ad", etc.) or the empty string “”.

accessor CharField

Default backed or memory accessor, e.g., DataFrame, AnnData.

Soon, also: SOMA, MuData, zarr.Group, tiledb.Array, etc.

description CharField

A description.

version CharField

Version (default None).

Use this together with initial_version to label different versions of a file.

Consider using semantic versioning with Python versioning.

size BigIntegerField

Size in bytes.

Examples: 1KB is 1e3 bytes, 1MB is 1e6, 1GB is 1e9, 1TB is 1e12 etc.

hash CharField

Hash or pseudo-hash of file content.

Useful to ascertain integrity and avoid duplication.

hash_type CharField

Type of hash.

transform ForeignKey

Transform whose run created the file.

run ForeignKey

Run that created the file.

initial_version ForeignKey

Initial version of the file, a File object.

visibility SmallIntegerField

Visibility of file record in queries & searches (0 default, 1 hidden, 2 trash).

key_is_virtual BooleanField

Indicates whether key is virtual or part of an actual file path.

created_at DateTimeField

Time of creation of record.

updated_at DateTimeField

Time of last update to record.

created_by ForeignKey

Creator of record, a User.

feature_sets ManyToManyField

The feature sets measured in the file (FeatureSet).

ulabels ManyToManyField

The ulabels measured in the file (ULabel).

input_of ManyToManyField

Runs that use this file as an input.

Methods

backed(is_run_input=None)#

Return a cloud-backed data object.

Return type:

Union[AnnDataAccessor, BackedAccessor]

Notes

For more info, see tutorial: Query arrays.

Examples

Read AnnData in backed mode from cloud:

>>> file = ln.File.filter(key="lndb-storage/pbmc68k.h5ad").one()
>>> file.backed()
AnnData object with n_obs × n_vars = 70 × 765 backed at 's3://lamindb-ci/lndb-storage/pbmc68k.h5ad'
delete(permanent=None, storage=None)#

Put file in trash.

Putting a file into the trash means setting its visibility field to 2.

FAQ: Storage FAQ

Parameters:
  • permanent (Optional[bool], default: None) – Permanently delete the file (skips trash).

  • storage (Optional[bool], default: None) – Indicate whether you want to delete the file in storage.

Return type:

None

Examples

For any File object file, call:

>>> file.delete()
classmethod from_anndata(adata, field, key=None, description=None, run=None, version=None, is_new_version_of=None, **kwargs)#

Create from AnnDataLike, validate & link features.

Parameters:
  • adata (Any) – An AnnData object or path to it.

  • field (Optional[DeferredAttribute]) – The registry field to validate & annotate features.

  • key (Optional[str], default: None) – A relative path within default storage, e.g., “myfolder/myfile.fcs”.

  • description (Optional[str], default: None) – A description.

  • version (Optional[str], default: None) – A version string.

  • is_new_version_of (Optional[File], default: None) – An old version of the file.

  • run (Optional[Run], default: None) – The run that creates the file.

Return type:

File

See also

lamindb.Dataset()

Track datasets.

lamindb.Feature

Track features.

Notes

For more info, see tutorial: Tutorial: Files & datasets.

Examples

>>> import lnschema_bionty as lb
>>> lb.settings.organism = "human"
>>> adata = ln.dev.datasets.anndata_with_obs()
>>> adata.var_names[:2]
Index(['ENSG00000000003', 'ENSG00000000005'], dtype='object')
>>> file = ln.File.from_anndata(adata,
...                             field=lb.Gene.ensembl_gene_id,
...                             description="mini anndata with obs")
>>> file.save()
classmethod from_df(df, field=FieldAttr(Feature.name), key=None, description=None, run=None, version=None, is_new_version_of=None, **kwargs)#

Create from DataFrame, validate & link features.

Parameters:
  • df (DataFrame) – A DataFrame object.

  • field (DeferredAttribute, default: FieldAttr(Feature.name)) – The registry field to validate & annotate features.

  • key (Optional[str], default: None) – A relative path within default storage, e.g., “myfolder/myfile.fcs”.

  • description (Optional[str], default: None) – A description.

  • version (Optional[str], default: None) – A version string.

  • is_new_version_of (Optional[File], default: None) – An old version of the file.

  • run (Optional[Run], default: None) – The run that creates the file.

Return type:

File

See also

lamindb.Dataset()

Track datasets.

lamindb.Feature

Track features.

Notes

For more info, see tutorial: Tutorial: Files & datasets.

Examples

>>> df = ln.dev.datasets.df_iris_in_meter_batch1()
>>> df.head()
  sepal_length sepal_width petal_length petal_width iris_organism_code
0        0.051       0.035        0.014       0.002                 0
1        0.049       0.030        0.014       0.002                 0
2        0.047       0.032        0.013       0.002                 0
3        0.046       0.031        0.015       0.002                 0
4        0.050       0.036        0.014       0.002                 0
>>> file = ln.File.from_df(df, description="Iris flower dataset batch1")
>>> file.save()
classmethod from_dir(path, key=None, *, run=None)#

Create a list of file objects from a directory.

Note

If you have a high number of files (several 100k) and don’t want to track them individually, consider creating a Dataset via Dataset(path, meta=metadata) for them. See, e.g., RxRx: cell imaging.

Parameters:
  • path (TypeVar(PathLike, str, Path, UPath)) – Source path of folder.

  • key (Optional[str], default: None) – Key for storage destination. If None and directory is in a registered location, an inferred key will reflect the relative position. If None and directory is outside of a registered storage location, the inferred key defaults to path.name.

  • run (Optional[Run], default: None) – A Run object.

Return type:

List[File]

Examples

>>> dir_path = ln.dev.datasets.generate_cell_ranger_files("sample_001", ln.settings.storage)
>>> files = ln.File.from_dir(dir_path)
>>> ln.save(files)
load(is_run_input=None, stream=False, **kwargs)#

Stage and load to memory.

Returns in-memory representation if possible, e.g., an AnnData object for an h5ad file.

Return type:

Any

Examples

Load as a DataFrame:

>>> df = ln.dev.datasets.df_iris_in_meter_batch1()
>>> ln.File.from_df(df, description="iris").save()
>>> file = ln.File.filter(description="iris").one()
>>> file.load().head()
sepal_length sepal_width petal_length petal_width iris_organism_code
0        0.051       0.035        0.014       0.002                 0
1        0.049       0.030        0.014       0.002                 0
2        0.047       0.032        0.013       0.002                 0
3        0.046       0.031        0.015       0.002                 0
4        0.050       0.036        0.014       0.002                 0

Load as an AnnData:

>>> ln.File("s3://lamindb-ci/lndb-storage/pbmc68k.h5ad").save()
>>> file = ln.File.filter(key="lndb-storage/pbmc68k.h5ad").one()
>>> file.load()
AnnData object with n_obs × n_vars = 70 × 765

Fall back to stage() if no in-memory representation is configured:

>>> ln.File(ln.dev.datasets.file_jpg_paradisi05(), description="paradisi05").save()
>>> file = ln.File.filter(description="paradisi05").one()
>>> file.load()
PosixPath('/home/runner/work/lamindb/lamindb/docs/guide/mydata/.lamindb/jb7BY5UJoQVGMUOKiLcn.jpg')
replace(data, run=None, format=None)#

Replace file content.

Parameters:
  • data (Union[TypeVar(PathLike, str, Path, UPath), Any]) – A file path or an in-memory data object (DataFrame, AnnData).

  • run (Optional[Run], default: None) – The run that created the file gets auto-linked if ln.track() was called.

Return type:

None

Examples

Say we made a change to the content of a file (e.g., edited the image paradisi05_laminopathic_nuclei.jpg).

This is how we replace the old file in storage with the new file:

>>> file.replace("paradisi05_laminopathic_nuclei.jpg")
>>> file.save()

Note that this neither changes the storage key nor the filename.

However, it will update the suffix if the file type changes.

restore()#

Restore file from trash.

Return type:

None

Examples

For any File object file, call:

>>> file.restore()
save(*args, **kwargs)#

Save the file to database & storage.

Return type:

None

Examples

>>> file = ln.File("./myfile.csv", description="myfile")
>>> file.save()
stage(is_run_input=None)#

Update cache from cloud storage if outdated.

Returns a path to a locally cached on-disk object (say, a .jpg file).

Return type:

Path

Examples

Sync file from cloud and return the local path of the cache:

>>> ln.settings.storage = "s3://lamindb-ci"
>>> ln.File("s3://lamindb-ci/lndb-storage/pbmc68k.h5ad").save()
>>> file = ln.File.filter(key="lndb-storage/pbmc68k.h5ad").one()
>>> file.stage()
PosixPath('/home/runner/work/Caches/lamindb/lamindb-ci/lndb-storage/pbmc68k.h5ad')
classmethod view_tree(level=-1, limit_to_directories=False, length_limit=1000, max_files_per_dir_per_type=7)#

View the tree structure of the keys.

Parameters:
  • level (int, default: -1) – int=-1 Depth of the tree to be displayed. Default is -1 which means all levels.

  • limit_to_directories (bool, default: False) – bool=False If True, only directories will be displayed.

  • length_limit (int, default: 1000) – int=1000 Maximum number of nodes to be displayed.

  • max_files_per_dir_per_type (int, default: 7) – int=7 Maximum number of files per directory per type.

Return type:

None