lamindb.Artifact#

class lamindb.Artifact(data: PathLike, key: Optional[str] = None, description: Optional[str] = None, is_new_version_of: Optional[Artifact] = None, run: Optional[Run] = None)#

Bases: Registry, Data, IsTree, IsVersioned

Artifacts: data batches stored as files, folders, or arrays.

Parameters:
  • dataUnion[PathLike, DataLike] A path or data object (DataFrame, AnnData).

  • keyOptional[str] = None A relative path within default storage, e.g., "myfolder/myfile.fcs".

  • descriptionOptional[str] = None A description.

  • versionOptional[str] = None A version string.

  • is_new_version_ofOptional[File] = None A previous version of the artifact.

  • runOptional[Run] = None The run that creates the artifact.

Typical storage formats & their API accessors
  • Table: .csv, .tsv, .parquet, .ipcDataFrame, pyarrow.Table

  • Annotated matrix: .h5ad, .h5mu, .zradAnnData, MuData

  • Image: .jpg, .pngnp.ndarray, …

  • Arrays: HDF5 group, zarr group, TileDB store ⟷ HDF5, zarr, TileDB loaders

  • Fastq: .fastq ⟷ /

  • VCF: .vcf ⟷ /

  • QC: .html ⟷ /

You’ll find these values in the suffix & accessor fields.

LaminDB makes some default choices (e.g., serialize a DataFrame as a .parquet file).

See also

Collection

Mutable collections of data batches.

from_df()

Create a artifact object from DataFrame and track features.

from_anndata()

Create a artifact object from AnnData and track features.

from_dir()

Bulk create artifact objects from a directory.

Notes

For more info, see tutorial: Tutorial: Artifacts.

Examples

Create an artifact from a cloud storage (supports s3:// and gs://):

>>> artifact = ln.Artifact("s3://lamindb-ci/test-data/test.csv")
>>> artifact.save()  # only metadata is saved

Create an artifact from a local temporary filepath using key:

>>> filepath = ln.core.datasets.file_jpg_paradisi05()
>>> artifact = ln.Artifact(filepath, key="images/paradisi05_image.jpg")
>>> artifact.save()
Why does the API look this way?

It’s inspired by APIs building on AWS S3.

Both boto3 and quilt select a bucket (akin to default storage in LaminDB) and define a target path through a key argument.

In boto3:

# signature: S3.Bucket.upload_file(filepath, key)
import boto3
s3 = boto3.resource('s3')
bucket = s3.Bucket('mybucket')
bucket.upload_file('/tmp/hello.txt', 'hello.txt')

In quilt3:

# signature: quilt3.Bucket.put_file(key, filepath)
import quilt3
bucket = quilt3.Bucket('mybucket')
bucket.put_file('hello.txt', '/tmp/hello.txt')

Make a new version of an artifact:

>>> # a non-versioned artifact
>>> artifact = ln.Artifact(df1, description="My dataframe")
>>> artifact.save()
>>> # version an artifact
>>> new_artifact = ln.Artifact(df2, is_new_version_of=artifact)
>>> assert new_artifact.stem_uid == artifact.stem_uid
>>> assert artifact.version == "1"
>>> assert new_artifact.version == "2"

Properties

path#

Path (Path, UPath).

Examples

File in cloud storage:

>>> ln.Artifact("s3://lamindb-ci/lndb-storage/pbmc68k.h5ad").save()
>>> artifact = ln.Artifact.filter(key="lndb-storage/pbmc68k.h5ad").one()
>>> artifact.path
S3Path('s3://lamindb-ci/lndb-storage/pbmc68k.h5ad')

File in local storage:

>>> ln.Artifact("./myfile.csv", description="myfile").save()
>>> artifact = ln.Artifact.filter(description="myfile").one()
>>> artifact.path
PosixPath('/home/runner/work/lamindb/lamindb/docs/guide/mydata/myfile.csv')

.

Fields

id AutoField

Internal id, valid only in one DB instance.

uid CharField

A universal random id (20-char base62 ~ UUID), valid across DB instances.

storage ForeignKey

Storage location (Storage), e.g., an S3 or GCP bucket or a local directory.

key CharField

Storage key, the relative path within the storage location.

suffix CharField

Path suffix or empty string if no canonical suffix exists.

This is either a file suffix (".csv", ".h5ad", etc.) or the empty string “”.

accessor CharField

Default backed or memory accessor, e.g., DataFrame, AnnData.

Soon, also: SOMA, MuData, zarr.Group, tiledb.Array, etc.

description CharField

A description.

version CharField

Version (default None).

Defines version of a family of records characterized by the same stem_uid.

Consider using semantic versioning with Python versioning.

size BigIntegerField

Size in bytes.

Examples: 1KB is 1e3 bytes, 1MB is 1e6, 1GB is 1e9, 1TB is 1e12 etc.

hash CharField

Hash or pseudo-hash of artifact content.

Useful to ascertain integrity and avoid duplication.

hash_type CharField

Type of hash.

n_objects BigIntegerField

Number of objects.

Typically, this denotes the number of files in an artifact.

n_observations BigIntegerField

Number of observations.

Typically, this denotes the first array dimension.

transform ForeignKey

Transform whose run created the artifact.

run ForeignKey

Run that created the artifact.

visibility SmallIntegerField

Visibility of artifact record in queries & searches (0 default, 1 hidden, 2 trash).

key_is_virtual BooleanField

Indicates whether key is virtual or part of an actual file path.

created_at DateTimeField

Time of creation of record.

updated_at DateTimeField

Time of last update to record.

created_by ForeignKey

Creator of record, a User.

feature_sets ManyToManyField

The feature sets measured in the artifact (FeatureSet).

ulabels ManyToManyField

The ulabels measured in the artifact (ULabel).

input_of ManyToManyField

Runs that use this artifact as an input.

Methods

backed(is_run_input=None)#

Return a cloud-backed data object.

Return type:

Union[AnnDataAccessor, BackedAccessor]

Notes

For more info, see tutorial: Query arrays.

Examples

Read AnnData in backed mode from cloud:

>>> artifact = ln.Artifact.filter(key="lndb-storage/pbmc68k.h5ad").one()
>>> artifact.backed()
AnnData object with n_obs × n_vars = 70 × 765 backed at 's3://lamindb-ci/lndb-storage/pbmc68k.h5ad'
delete(permanent=None, storage=None, using_key=None)#

Delete.

A first call to .delete() puts an artifact into the trash (sets visibility to -1).

A second call permanently deletes the artifact.

FAQ: Storage FAQ

Parameters:
  • permanent (Optional[bool], default: None) – Permanently delete the artifact (skip trash).

  • storage (Optional[bool], default: None) – Indicate whether you want to delete the artifact in storage.

Return type:

None

Examples

For an Artifact object artifact, call:

>>> artifact.delete()
classmethod from_anndata(adata, key=None, description=None, run=None, version=None, is_new_version_of=None, **kwargs)#

Create from AnnDataLike, validate & link features.

Parameters:
  • adata (AnnData) – An AnnData object or path to it.

  • key (Optional[str], default: None) – A relative path within default storage, e.g., “myfolder/myfile.fcs”.

  • description (Optional[str], default: None) – A description.

  • version (Optional[str], default: None) – A version string.

  • is_new_version_of (Optional[Artifact], default: None) – An old version of the artifact.

  • run (Optional[Run], default: None) – The run that creates the artifact.

Return type:

Artifact

See also

lamindb.Collection()

Track collections.

lamindb.Feature

Track features.

Notes

For more info, see tutorial: Tutorial: Artifacts.

Examples

>>> import bionty as bt
>>> bt.settings.organism = "human"
>>> adata = ln.core.datasets.anndata_with_obs()
>>> artifact = ln.Artifact.from_anndata(adata,
...                             description="mini anndata with obs")
>>> artifact.save()

.

classmethod from_df(df, key=None, description=None, run=None, version=None, is_new_version_of=None, **kwargs)#

Create from DataFrame, validate & link features.

Parameters:
  • df (DataFrame) – A DataFrame object.

  • key (Optional[str], default: None) – A relative path within default storage, e.g., “myfolder/myfile.fcs”.

  • description (Optional[str], default: None) – A description.

  • version (Optional[str], default: None) – A version string.

  • is_new_version_of (Optional[Artifact], default: None) – An old version of the artifact.

  • run (Optional[Run], default: None) – The run that creates the artifact.

Return type:

Artifact

See also

lamindb.Collection()

Track collections.

lamindb.Feature

Track features.

Notes

For more info, see tutorial: Tutorial: Artifacts.

Examples

>>> df = ln.core.datasets.df_iris_in_meter_batch1()
>>> df.head()
  sepal_length sepal_width petal_length petal_width iris_organism_code
0        0.051       0.035        0.014       0.002                 0
1        0.049       0.030        0.014       0.002                 0
2        0.047       0.032        0.013       0.002                 0
3        0.046       0.031        0.015       0.002                 0
4        0.050       0.036        0.014       0.002                 0
>>> artifact = ln.Artifact.from_df(df, description="Iris flower collection batch1")
>>> artifact.save()

.

classmethod from_dir(path, key=None, *, run=None)#

Create a list of artifact objects from a directory.

Note

If you have a high number of files (several 100k) and don’t want to track them individually, consider creating a Collection via Collection(path, meta=metadata) for them. See, e.g., RxRx: cell imaging.

Parameters:
  • path (TypeVar(PathLike, str, Path, UPath)) – Source path of folder.

  • key (Optional[str], default: None) – Key for storage destination. If None and directory is in a registered location, an inferred key will reflect the relative position. If None and directory is outside of a registered storage location, the inferred key defaults to path.name.

  • run (Optional[Run], default: None) – A Run object.

Return type:

List[Artifact]

Examples

>>> dir_path = ln.core.datasets.generate_cell_ranger_files("sample_001", ln.settings.storage)
>>> artifacts = ln.Artifact.from_dir(dir_path)
>>> ln.save(artifacts)

.

load(is_run_input=None, stream=False, **kwargs)#

Stage and load to memory.

Returns in-memory representation if possible, e.g., an AnnData object for an h5ad file.

Return type:

Any

Examples

Load as a DataFrame:

>>> df = ln.core.datasets.df_iris_in_meter_batch1()
>>> ln.Artifact.from_df(df, description="iris").save()
>>> artifact = ln.Artifact.filter(description="iris").one()
>>> artifact.load().head()
sepal_length sepal_width petal_length petal_width iris_organism_code
0        0.051       0.035        0.014       0.002                 0
1        0.049       0.030        0.014       0.002                 0
2        0.047       0.032        0.013       0.002                 0
3        0.046       0.031        0.015       0.002                 0
4        0.050       0.036        0.014       0.002                 0

Load as an AnnData:

>>> ln.Artifact("s3://lamindb-ci/lndb-storage/pbmc68k.h5ad").save()
>>> artifact = ln.Artifact.filter(key="lndb-storage/pbmc68k.h5ad").one()
>>> artifact.load()
AnnData object with n_obs × n_vars = 70 × 765

Fall back to stage() if no in-memory representation is configured:

>>> ln.Artifact(ln.core.datasets.file_jpg_paradisi05(), description="paradisi05").save()
>>> artifact = ln.Artifact.filter(description="paradisi05").one()
>>> artifact.load()
PosixPath('/home/runner/work/lamindb/lamindb/docs/guide/mydata/.lamindb/jb7BY5UJoQVGMUOKiLcn.jpg')
replace(data, run=None, format=None)#

Replace artifact content.

Parameters:
  • data (Union[TypeVar(PathLike, str, Path, UPath), Any]) – A file path or an in-memory data object (DataFrame, AnnData).

  • run (Optional[Run], default: None) – The run that created the artifact gets auto-linked if ln.track() was called.

Return type:

None

Examples

Say we made a change to the content of an artifact, e.g., edited the image paradisi05_laminopathic_nuclei.jpg.

This is how we replace the old file in storage with the new file:

>>> artifact.replace("paradisi05_laminopathic_nuclei.jpg")
>>> artifact.save()

Note that this neither changes the storage key nor the filename.

However, it will update the suffix if it changes.

restore()#

Restore from trash.

Return type:

None

Examples

For any Artifact object artifact, call:

>>> artifact.restore()
save(*args, **kwargs)#

Save to database & storage.

Return type:

None

Examples

>>> artifact = ln.Artifact("./myfile.csv", description="myfile")
>>> artifact.save()
stage(is_run_input=None)#

Update cache from cloud storage if outdated.

Returns a path to a locally cached on-disk object (say, a .jpg file).

Return type:

Path

Examples

Sync file from cloud and return the local path of the cache:

>>> ln.settings.storage = "s3://lamindb-ci"
>>> ln.Artifact("s3://lamindb-ci/lndb-storage/pbmc68k.h5ad").save()
>>> artifact = ln.Artifact.filter(key="lndb-storage/pbmc68k.h5ad").one()
>>> artifact.stage()
PosixPath('/home/runner/work/Caches/lamindb/lamindb-ci/lndb-storage/pbmc68k.h5ad')
classmethod view_tree(level=-1, limit_to_directories=False, length_limit=1000, max_files_per_dir_per_type=7)#

View the tree structure of the keys.

Parameters:
  • level (int, default: -1) – int=-1 Depth of the tree to be displayed. Default is -1 which means all levels.

  • limit_to_directories (bool, default: False) – bool=False If True, only directories will be displayed.

  • length_limit (int, default: 1000) – int=1000 Maximum number of nodes to be displayed.

  • max_files_per_dir_per_type (int, default: 7) – int=7 Maximum number of files per directory per type.

Return type:

None

.