Introduction#
LaminDB is an open-source Python framework to manage biological data & analyses in generic backends:
Access data & metadata across storage (files, arrays) & database (SQL) backends.
Track data flow across notebooks, pipelines & UI.
Manage registries for experimental metadata & in-house ontologies, import public ontologies.
Validate, standardize & annotate data using registries.
Organize and share data across a mesh of LaminDB instances.
Manage data access with an auditable system of record.
LaminDB features
Access data & metadata across storage (files, arrays) & database (SQL) backends.
Model data using
Feature
,FeatureSet
,ULabel
Plug-in custom schemas & manage schema migrations
Use array formats in memory & storage: DataFrame, AnnData, MuData, SOMA, … backed by parquet, zarr, TileDB, HDF5, h5ad, DuckDB, …
Version files, datasets & transforms
Track data flow across notebooks, pipelines & UI: track()
, Transform
& Run
.
Execution reports & source code for notebooks
Integrate with workflow managers: redun, nextflow, snakemake
Manage registries for experimental metadata & in-house ontologies, import public ontologies.
Use >20 public ontologies with plug-in
lnschema_bionty
Safeguards against typos & duplications
Validate, standardize & annotate data using registries: validate
& standardize
.
Inspect validation failures:
inspect
Annotate with untyped or typed labels:
add
Save data & metadata ACID:
save
Organize and share data across a mesh of LaminDB instances.
Create & load instances like git repos:
lamin init
&lamin load
Zero-copy transfer data across instances
Zero lock-in, scalable, auditable, access management, and more.
Zero lock-in: LaminDB runs on generic backends server-side and is not a client for “Lamin Cloud”
Flexible storage backends (local, S3, GCP, anything fsspec supports)
Currently two SQL backends for managing metadata: SQLite & Postgres
Scalable: metadata tables support 100s of millions of entries
Auditable: data & metadata records are hashed, timestamped, and attributed to users (soon to come: LaminDB Log)
Access management:
High-level access management through Lamin’s collaborator roles
Fine-grained access management via storage & SQL roles (and soon to come: Lamin Vault)
Secure: embedded in your infrastructure (Lamin has no access to your data & metadata)
Tested & typed (up to Django Model fields)
Idempotent & ACID operations
LaminHub is a data collaboration hub built on LaminDB similar to how GitHub is built on git.
LaminHub features
Public demo instances to explore in the UI or load using the CLI via lamin load owner/instance
:
lamin.ai/laminlabs/lamindata - Lamin’s main demo instance
lamin.ai/laminlabs/cellxgene - cellxgene (guide)
lamin.ai/laminlabs/lamin-site-assets - Lamin’s website assets
LaminHub neither hosts data nor metadata, but connects to distributed storage locations & databases through LaminDB.
See validated data artifacts in context of ontologies & experimental metadata.

Query & search.

See scripts, notebooks & pipelines with their inputs & outputs.

Track pipelines, notebooks & UI transforms in one registry.

See parents and children of transforms.

Basic features of LaminHub are free. Enterprise features hosted in your or our infrastructure are available on a paid plan!
Quickstart#
Warning
Public beta: Close to having converged a stable API, but some breaking changes might still occur.
Setup LaminDB#
Install the
lamindb
Python package:pip install 'lamindb[jupyter,bionty]'
Sign up for a free account (see more info) and copy the API key.
Log in on the command line (data remains in your infrastructure, with Lamin having no access to it):
lamin login <email> --key <API-key>
You can now init LaminDB instances like you init git repositories:
!lamin init --schema bionty --storage ./lamin-intro # or s3://my-bucket, gs://my-bucket as default storage
Show code cell output
✅ saved: User(uid='DzTjkKse', handle='testuser1', name='Test User1', updated_at=2023-12-08 11:33:02 UTC)
✅ saved: Storage(uid='aXvyR27m', root='/home/runner/work/lamindb/lamindb/docs/lamin-intro', type='local', updated_at=2023-12-08 11:33:02 UTC, created_by_id=1)
💡 loaded instance: testuser1/lamin-intro
💡 did not register local instance on hub
Because we passed --schema bionty
, this instance mounted plug-in lnschema_bionty
.
Register a file#
Track files using the File
registry:
import lamindb as ln
import pandas as pd
# track run context
ln.track()
# access a batch of data
df = pd.DataFrame(
{"CD8A": [1, 2, 3], "CD4": [3, 4, 5], "CD14": [5, 6, 7]},
index=["observation1", "observation2", "observation3"],
)
# create a file (versioning is optional)
file = ln.File(df, description="my RNA-seq", version="1")
# register file
file.save()
💡 lamindb instance: testuser1/lamin-intro
💡 notebook imports: lamindb==0.63.4 lnschema_bionty==0.35.3 pandas==1.5.3
💡 saved: Transform(uid='FPnfDtJz8qbEz8', name='Introduction', short_name='introduction', version='0', type=notebook, updated_at=2023-12-08 11:33:05 UTC, created_by_id=1)
💡 saved: Run(uid='BKP3qYT8hwel1W7rxXQP', run_at=2023-12-08 11:33:05 UTC, transform_id=1, created_by_id=1)
Access a file#
# search a file
ln.File.search("RNAseq")
# filter a file
file = ln.File.filter(description__contains="RNA-seq").first()
# view data flow
file.view_flow()
# describe metadata
file.describe()
# load the file
df = file.load()
File(uid='6yMEbv1eZs1t6YaijPcZ', suffix='.parquet', accessor='DataFrame', description='my RNA-seq', version='1', size=3506, hash='I3bJ9fOfamH1pAcJ4kydIg', hash_type='md5', visibility=1, key_is_virtual=True, updated_at=2023-12-08 11:33:05 UTC)
Provenance:
🗃️ storage: Storage(uid='aXvyR27m', root='/home/runner/work/lamindb/lamindb/docs/lamin-intro', type='local', updated_at=2023-12-08 11:33:02 UTC, created_by_id=1)
📔 transform: Transform(uid='FPnfDtJz8qbEz8', name='Introduction', short_name='introduction', version='0', type='notebook', updated_at=2023-12-08 11:33:05 UTC, created_by_id=1)
👣 run: Run(uid='BKP3qYT8hwel1W7rxXQP', run_at=2023-12-08 11:33:05 UTC, transform_id=1, created_by_id=1)
👤 created_by: User(uid='DzTjkKse', handle='testuser1', name='Test User1', updated_at=2023-12-08 11:33:02 UTC)
Filter & search in the UI

Data flow in the UI

Define features & labels#
Define features and labels using Feature
and ULabel
:
# define features
features = ln.Feature.from_df(df)
ln.save(features)
# define tissue label
tissue = ln.ULabel(name="umbilical blood")
tissue.save()
# define a parent label
is_tissue = ln.ULabel(name="is_tissue")
is_tissue.save()
is_tissue.children.add(tissue)
# view hierarchy
tissue.view_parents()
Validate & annotate data#
# create file & validate features
file = ln.File.from_df(df, description="my RNA-seq")
# register file & link validated features
file.save()
# annotate with a label
file.labels.add(tissue)
# show metadata
file.describe()
❗ returning existing file with same hash: File(uid='6yMEbv1eZs1t6YaijPcZ', suffix='.parquet', accessor='DataFrame', description='my RNA-seq', version='1', size=3506, hash='I3bJ9fOfamH1pAcJ4kydIg', hash_type='md5', visibility=1, key_is_virtual=True, updated_at=2023-12-08 11:33:05 UTC, storage_id=1, transform_id=1, run_id=1, created_by_id=1)
File(uid='6yMEbv1eZs1t6YaijPcZ', suffix='.parquet', accessor='DataFrame', description='my RNA-seq', version='1', size=3506, hash='I3bJ9fOfamH1pAcJ4kydIg', hash_type='md5', visibility=1, key_is_virtual=True, updated_at=2023-12-08 11:33:05 UTC)
Provenance:
🗃️ storage: Storage(uid='aXvyR27m', root='/home/runner/work/lamindb/lamindb/docs/lamin-intro', type='local', updated_at=2023-12-08 11:33:02 UTC, created_by_id=1)
📔 transform: Transform(uid='FPnfDtJz8qbEz8', name='Introduction', short_name='introduction', version='0', type='notebook', updated_at=2023-12-08 11:33:05 UTC, created_by_id=1)
👣 run: Run(uid='BKP3qYT8hwel1W7rxXQP', run_at=2023-12-08 11:33:05 UTC, transform_id=1, created_by_id=1)
👤 created_by: User(uid='DzTjkKse', handle='testuser1', name='Test User1', updated_at=2023-12-08 11:33:02 UTC)
Features:
columns: FeatureSet(uid='WTgR8NcuFs0q5wJj9iXK', n=3, registry='core.Feature', hash='UnW_pPBmoCTHePSnwWUB', updated_at=2023-12-08 11:33:05 UTC, created_by_id=1)
CD8A (number)
CD4 (number)
CD14 (number)
Labels:
🏷️ ulabels (1, core.ULabel): 'umbilical blood'
Data artifacts with context in the UI

Query for annotations#
# a look-up object for all the children of "is_tissue" in ULabel registry
tissues = is_tissue.children.lookup()
# query for exactly one result annotated with umbilical blood
dataset = ln.File.filter(ulabels=tissues.umbilical_blood).one()
# permanently delete the file (without the permanent flag, moves to trash)
file.delete(permanent=True)
Use biological types#
The generic Feature
and ULabel
will get you pretty far.
But if you use an entity many times, you typically want a dedicated registry, which you can use to type your code & as an interface for public ontologies.
Let’s do this with Gene
and Tissue
from plug-in lnschema_bionty
:
import lnschema_bionty as lb
# create gene records from the public ontology as features
genes = lb.Gene.from_values(df.columns, organism="human")
ln.save(genes)
# query the entire Gene registry content as a DataFrame
lb.Gene.filter().df()
# create file & validate features using the symbol field of Gene
file = ln.File.from_df(
df, description="my RNA-seq", field=lb.Gene.symbol, organism="human"
)
file.save()
# search the public tissue ontology from the bionty store
lb.Tissue.bionty().search("umbilical blood").head(2)
# define tissue label
tissue = lb.Tissue.from_bionty(name="umbilical cord blood")
tissue.save()
# ontological hierarchy comes by default
tissue.view_parents(distance=2)
# annotate with tissue label
file.labels.add(tissue)
# show metadata
file.describe()
❗ now recursing through parents: this only happens once, but is much slower than bulk saving
File(uid='lDyYMTRQXd2N5vJ3gzZ8', suffix='.parquet', accessor='DataFrame', description='my RNA-seq', size=3506, hash='I3bJ9fOfamH1pAcJ4kydIg', hash_type='md5', visibility=1, key_is_virtual=True, updated_at=2023-12-08 11:33:10 UTC)
Provenance:
🗃️ storage: Storage(uid='aXvyR27m', root='/home/runner/work/lamindb/lamindb/docs/lamin-intro', type='local', updated_at=2023-12-08 11:33:02 UTC, created_by_id=1)
💫 transform: Transform(uid='FPnfDtJz8qbEz8', name='Introduction', short_name='introduction', version='0', type=notebook, updated_at=2023-12-08 11:33:05 UTC, created_by_id=1)
👣 run: Run(uid='BKP3qYT8hwel1W7rxXQP', run_at=2023-12-08 11:33:05 UTC, transform_id=1, created_by_id=1)
👤 created_by: User(uid='DzTjkKse', handle='testuser1', name='Test User1', updated_at=2023-12-08 11:33:02 UTC)
Features:
columns: FeatureSet(uid='lOYmKviSYNrmjDRT2sIs', n=3, type='number', registry='bionty.Gene', hash='7wsvCyRhtmkNeD2dpTHg', updated_at=2023-12-08 11:33:10 UTC, created_by_id=1)
'CD8A', 'CD4', 'CD14'
Labels:
🏷️ tissues (1, bionty.Tissue): 'umbilical cord blood'
Query for gene sets & the linked files:
# an object to auto-complete human genes
genes = lb.Gene.filter(organism__name="human").lookup()
# all gene sets measuring CD8A
genesets_with_cd8a = ln.FeatureSet.filter(genes=genes.cd8a).all()
# all files measuring CD8A
ln.File.filter(feature_sets__in=genesets_with_cd8a).df()
uid | storage_id | key | suffix | accessor | description | version | size | hash | hash_type | transform_id | run_id | initial_version_id | visibility | key_is_virtual | updated_at | created_by_id | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
id | |||||||||||||||||
2 | lDyYMTRQXd2N5vJ3gzZ8 | 1 | None | .parquet | DataFrame | my RNA-seq | None | 3506 | I3bJ9fOfamH1pAcJ4kydIg | md5 | 1 | 1 | None | 1 | True | 2023-12-08 11:33:10.423872+00:00 | 1 |
Append a new batch of data#
# assume we now run a pipeline in which we access a new batch of data
transform = ln.Transform(name="RNA-seq file ingestion", type="pipeline", version="1")
ln.track(transform)
# access a new batch of data with a different schema
df = pd.DataFrame(
{
"CD8A": [2, 3, 3],
"CD4": [3, 4, 5],
"CD38": [4, 2, 3],
},
index=["observation4", "observation5", "observation6"],
)
# because gene `"CD38"` is not yet registered, it doesn't yet validate
file2 = ln.File.from_df(
df, description="my RNA-seq batch 2", field=lb.Gene.symbol, organism="human"
)
# let's add it to the `Gene` registry and re-create the file - now everything passes
lb.Gene.from_bionty(symbol="CD38", organism="human").save()
# now we can validate all features
file2 = ln.File.from_df(
df, description="my RNA-seq batch 2", field=lb.Gene.symbol, organism="human"
)
file2.save()
💡 saved: Transform(uid='VKrFIeI5Edud27', name='RNA-seq file ingestion', version='1', type='pipeline', updated_at=2023-12-08 11:33:21 UTC, created_by_id=1)
💡 saved: Run(uid='EswO863J8aKbv7pbQSbG', run_at=2023-12-08 11:33:21 UTC, transform_id=2, created_by_id=1)
❗ 1 term (33.30%) is not validated for symbol: CD38
Create a dataset using Dataset
by linking both batches in a “sharded dataset”:
dataset = ln.Dataset([file, file2], name="my RNA-seq dataset")
dataset.save()
dataset.describe()
dataset.view_flow()
Dataset(uid='gzlsEnucixYv1r2cEWMR', name='my RNA-seq dataset', hash='Kv5biKHSYT0rrCcw3TKG', visibility=1, updated_at=2023-12-08 11:33:24 UTC)
Provenance:
🧩 transform: Transform(uid='VKrFIeI5Edud27', name='RNA-seq file ingestion', version='1', type='pipeline', updated_at=2023-12-08 11:33:21 UTC, created_by_id=1)
👣 run: Run(uid='EswO863J8aKbv7pbQSbG', run_at=2023-12-08 11:33:21 UTC, transform_id=2, created_by_id=1)
👤 created_by: User(uid='DzTjkKse', handle='testuser1', name='Test User1', updated_at=2023-12-08 11:33:02 UTC)
Features:
columns: FeatureSet(uid='HaSQni4Sx75mM1c8wkPC', n=4, type='number', registry='bionty.Gene', hash='OaTsWN-lR7zDUC1bjzk8', updated_at=2023-12-08 11:33:24 UTC, created_by_id=1)
'CD8A', 'CD4', 'CD14', 'CD38'
You can load the entire dataset into memory as if it was one:
dataset.load()
CD8A | CD4 | CD14 | CD38 | |
---|---|---|---|---|
observation1 | 1 | 3 | 5.0 | NaN |
observation2 | 2 | 4 | 6.0 | NaN |
observation3 | 3 | 5 | 7.0 | NaN |
observation4 | 2 | 3 | NaN | 4.0 |
observation5 | 3 | 4 | NaN | 2.0 |
observation6 | 3 | 5 | NaN | 3.0 |
Or iterate over its files:
dataset.files.df()
uid | storage_id | key | suffix | accessor | description | version | size | hash | hash_type | transform_id | run_id | initial_version_id | visibility | key_is_virtual | updated_at | created_by_id | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
id | |||||||||||||||||
2 | lDyYMTRQXd2N5vJ3gzZ8 | 1 | None | .parquet | DataFrame | my RNA-seq | None | 3506 | I3bJ9fOfamH1pAcJ4kydIg | md5 | 1 | 1 | None | 1 | True | 2023-12-08 11:33:10.423872+00:00 | 1 |
3 | RmDjsiSmdfzUeIkLvCIi | 1 | None | .parquet | DataFrame | my RNA-seq batch 2 | None | 3499 | xFu1vFetW040mVQppcx6zw | md5 | 2 | 2 | None | 1 | True | 2023-12-08 11:33:24.103953+00:00 | 1 |
More examples#
Understand data flow#
View the sequence of data transformations (Transform
) in a project (from here, based on Schmidt et al., 2022):
transform.view_parents()
Or, the generating flow of a file or dataset:
file.view_flow()
Both figures are based on mere calls to ln.track()
in notebooks, pipelines & app.
Manage biological registries#
Create a cell type registry from public knowledge and add a new cell state (from here):
import lnschema_bionty as lb
# create an ontology-coupled cell type record and save it
lb.CellType.from_bionty(name="neuron").save()
# create a record to track a new cell state
new_cell_state = lb.CellType(name="my neuron cell state", description="explains X")
new_cell_state.save()
# express that it's a neuron state
cell_types = lb.CellType.lookup()
new_cell_state.parents.add(cell_types.neuron)
Show code cell output
❗ now recursing through parents: this only happens once, but is much slower than bulk saving
# view ontological hierarchy
new_cell_state.view_parents(distance=2)
Leverage a mesh of instances#
LaminDB is a distributed system like git. Similar to cloning a repository, collaborators can load your instance on the command-line using:
lamin load myhandle/myinstance
If you run lamin save <notebook_path>
, you will save the notebook to your default storage location.
You can explore the notebook report corresponding to the quickstart here in LaminHub.
Manage custom schemas#
LaminDB can be customized & extended with schema & app plug-ins building on the Django ecosystem. Examples are
lnschema_bionty: Registries for basic biological entities, coupled to public ontologies.
lnschema_lamin1: Exemplary custom schema to manage samples, treatments, etc.
If you’d like to create your own schema or app:
Create a git repository with registries similar to lnschema_lamin1
Create & deploy migrations via
lamin migrate create
andlamin migrate deploy
It’s fastest if we do this for you based on our templates within an enterprise plan.
Design#
Why?#
We wrote a blog post about the key problems Lamin tries to solve when starting to work on it.
Schema & API#

LaminDB provides a SQL schema for common entities: File
, Dataset
, Transform
, Feature
, ULabel
etc. - see the API reference or the source code.
The core schema is extendable through plug ins (see blue vs. red entities in graphic), e.g., with basic biological (Gene
, Protein
, CellLine
, etc.) & operational entities (Biosample
, Techsample
, Treatment
, etc.).
What is the schema language?
Data models are defined in Python using the Django ORM. Django translates them to SQL tables.
Django is one of the most-used & highly-starred projects on GitHub (~1M dependents, ~73k stars) and has been robustly maintained for 15 years.
In the first year, LaminDB used SQLModel/SQLAlchemy – we might bring back compatibility.
On top of the schema, LaminDB is a Python API that abstracts over storage & database access, data transformations, and (biological) ontologies.
The code for this is open-source & accessible through the dependencies & repositories listed below.
Dependencies#
Data is stored in a platform-independent way:
location → local, on AWS S3 or GCP Storage, accessed through
fsspec
format → blob-like files or queryable formats like parquet, zarr, HDF5, TileDB, …
Metadata is stored in SQL: current backends are SQLite (small teams) and Postgres (any team size).
Django ORM for schema management & metadata queries.
Biological knowledge sources & ontologies: see Bionty.
For more details, see the pyproject.toml file in lamindb & the linked repositories below.
Repositories#
LaminDB and its plug-ins consist in open-source Python libraries & publicly hosted metadata assets:
lamindb: Core API, which builds on the core schema.
lnschema-bionty: Registries for basic biological entities, coupled to public ontologies.
lnschema-lamin1: Exemplary custom schema to manage samples, treatments, etc.
lamindb-setup: Setup & configure LaminDB, client for LaminHub.
lamin-cli: CLI for
lamindb
andlamindb-setup
.bionty: Accessor for public biological ontologies.
nbproject: Metadata parser for Jupyter notebooks.
lamin-utils: Generic utilities, e.g., a logger.
readfcs: FCS file reader.
LaminHub is not open-sourced, and neither are plug-ins that model lab operations.
Assumptions & principles#
Data is generated by instruments that process physical samples: it comes in batches stored as immutable files.
Files are transformed into more useful data representations, e.g.:
Summary statistics, e.g., count matrices for fastq files
Arrays of non-array-like input data (e.g., images)
Higher-level embeddings for lower-level array, text or graph representations
Concatenated arrays for large-scale atlas-like datasets
Semantics of high-level embeddings (“inflammatory”, “lipophile”) are anchored in experimental metadata and knowledge (ontologies)
Experimental metadata is another ontology type
Experiments measure features (
Feature
,CellMarker
, …)Learning and data warehousing both iterate transformations (see graphic,
Transform
)Basic biological entities should have the same meaning to anyone and across any data platform
Schema migrations should be easy
Influences#
LaminDB was influenced by many other projects, see Influences.
Notebooks#
Find all tutorial & guide notebooks here and use cases here.
You can run these notebooks in hosted versions of JupyterLab, e.g., Saturn Cloud, Google Vertex AI, Google Colab, and others.
Show code cell content
# clean up test instance
!lamin delete --force lamin-intro
!rm -r lamin-intro
💡 deleting instance testuser1/lamin-intro
✅ deleted instance settings file: /home/runner/.lamin/instance--testuser1--lamin-intro.env
✅ instance cache deleted
✅ deleted '.lndb' sqlite file
❗ consider manually deleting your stored data: /home/runner/work/lamindb/lamindb/docs/lamin-intro