Introduction#
LaminDB is an open-source Python framework to manage biological data & analyses in generic backends:
Unify access to data & metadata across storage (files, arrays) & database (SQL) backends.
Track data flow across notebooks, pipelines & UI.
Manage registries for experimental metadata & ontologies.
Validate & annotate data batches using in-house & public knowledge.
Organize and share data across a mesh of LaminDB instances.
LaminDB features
Often, siloed object stores, SQL databases & ELN/LIMS systems pile up inaccessible & hard-to-integrate data impacting derived analytical insights.
LaminDB’s features aim to address key problems underlying this tendency, taking inspiration from a number of data tools.
For data users
Unify access to data & metadata across storage (arrays, files) & SQL database backends:
Model data schema-less or schema-full, mount custom schema plug-ins & manage schema migrations (schemas)
Organize data around learning:
Feature
,FeatureSet
,ULabel
,Modality
Leverage support for common array formats in memory & storage:
DataFrame
,AnnData
,MuData
,pyarrow.Table
backed byparquet
,zarr
, TileDB, HDF5, h5ad, DuckDBBridge immutable data artifacts (
File
) and data warehousing (Dataset
)
Track data flow across notebooks, pipelines & UI:
track()
,Transform
&Run
Manage registries for experimental metadata & ontologies in a simple database:
Use >20 public ontologies with plug-in
lnschema_bionty
For instance,
Gene
,Protein
,CellMarker
,ExperimentalFactor
,CellType
…
Validate, standardize & annotate data batches:
Create DB instances within seconds and share data across a mesh of instances (
setup
)
For platform builders
Zero lock-in: LaminDB runs on generic backends server-side and is not a client for “Lamin Cloud”
Flexible storage backends (local, S3, GCP, anything fsspec supports)
Currently two SQL backends for managing metadata: SQLite & Postgres
Scalable: metadata tables support 100s of millions of entries
Access management:
High-level access management through Lamin’s collaborator roles
Fine-grained access management via embedded storage & SQL roles
Secure: embedded in your infrastructure (Lamin has no access to your data & metadata)
Idempotent & ACID operations
File, dataset & transform versioning
Safeguards against typos & duplications when populating registries
Tested & typed (up to Django Model fields, to come)
LaminHub is a data collaboration hub built on LaminDB similar to how GitHub is built on git.
LaminHub features
Unlike GitHub & most SaaS platforms, LaminHub by default neither hosts data nor metadata, but connects to distributed storage locations & databases through LaminDB.
Public demo instances to explore in the UI or load using the CLI via lamin load owner/instance
(you need an account):
lamin.ai/laminlabs/lamindata - a larger instance using Postgres & multiple private storage locations
lamin.ai/laminlabs/cellxgene-census - query cellxgene-census using LaminDB (guide)
lamin.ai/laminlabs/lamin-site-assets - explore Lamin’s static website assets
See validated files & arrays in context of ontologies & experimental metadata:

Track data flow through pipelines, notebooks & UI:


Upload & register files:

Browse files:

LaminHub is free for public data. Enterprise features, support, integration tests & wetlab plug-ins hosted in your or our infrastructure are available on a paid plan: please reach out!
Quickstart#
Warning
Public beta: Close to having converged a stable API, but some breaking changes might still occur.
Setup#
Install the
lamindb
Python package:pip install 'lamindb[jupyter,bionty]'
Log in on the command line:
lamin login <email> --password <password>
You can now init LaminDB instances like you init git repositories, e.g.:
!lamin init --schema bionty --storage ./lamin-intro # or s3://my-bucket, gs://my-bucket as default storage
Show code cell output
✅ saved: User(id='DzTjkKse', handle='testuser1', email='testuser1@lamin.ai', name='Test User1', updated_at=2023-09-26 15:21:04)
✅ saved: Storage(id='FGbhE5GU', root='/home/runner/work/lamindb/lamindb/docs/lamin-intro', type='local', updated_at=2023-09-26 15:21:05, created_by_id='DzTjkKse')
💡 loaded instance: testuser1/lamin-intro
💡 did not register local instance on hub (if you want, call `lamin register`)
Because we passed --schema bionty
, this instance mounted plug-in lnschema_bionty
.
Register a dataset#
import lamindb as ln
import pandas as pd
# track data flow through current notebook
ln.track()
# access a new batch of data
df = pd.DataFrame(
{"CD8": [1, 2, 3], "CD45": [3, 4, 5], "perturbation": ["DMSO", "IFNG", "DMSO"]}
)
# create a dataset
dataset = ln.Dataset(df, name="Immune phenotyping 1")
# register dataset
dataset.save()
💡 loaded instance: testuser1/lamin-intro (lamindb 0.54.2)
💡 notebook imports: lamindb==0.54.2 lnschema_bionty==0.31.2 pandas==2.1.1
💡 Transform(id='FPnfDtJz8qbEz8', name='Introduction', short_name='introduction', version='0', type=notebook, updated_at=2023-09-26 15:21:08, created_by_id='DzTjkKse')
💡 Run(id='gjdJnVZYc7ipewoya1rF', run_at=2023-09-26 15:21:08, transform_id='FPnfDtJz8qbEz8', created_by_id='DzTjkKse')
Access a dataset#
# search a dataset
ln.Dataset.search("immune")
# query a dataset
dataset = ln.Dataset.filter(name__contains="phenotyping 1").one()
# view data flow
dataset.view_flow()
# describe metadata
dataset.describe()
# load the dataset
df = dataset.load()
Dataset(id='bW2qYmpqAgNW4AbpVY7C', name='Immune phenotyping 1', hash='otKGnMVEtNo6amaaVewBug', updated_at=2023-09-26 15:21:08)
Provenance:
📔 transform: Transform(id='FPnfDtJz8qbEz8', name='Introduction', short_name='introduction', version='0', type='notebook', updated_at=2023-09-26 15:21:08, created_by_id='DzTjkKse')
👣 run: Run(id='gjdJnVZYc7ipewoya1rF', run_at=2023-09-26 15:21:08, transform_id='FPnfDtJz8qbEz8', created_by_id='DzTjkKse')
📄 file: File(id='bW2qYmpqAgNW4AbpVY7C', suffix='.parquet', accessor='DataFrame', description='See dataset bW2qYmpqAgNW4AbpVY7C', size=2913, hash='otKGnMVEtNo6amaaVewBug', hash_type='md5', updated_at=2023-09-26 15:21:08, storage_id='FGbhE5GU', transform_id='FPnfDtJz8qbEz8', run_id='gjdJnVZYc7ipewoya1rF', created_by_id='DzTjkKse')
👤 created_by: User(id='DzTjkKse', handle='testuser1', email='testuser1@lamin.ai', name='Test User1', updated_at=2023-09-26 15:21:04)
Validate & annotate a dataset#
Validate the column names in a DataFrame
schema-less:
# define validation criteria
names_types = [("CD8", "number"), ("CD45", "number"), ("perturbation", "category")]
# save validation criteria as features
features = [ln.Feature(name=name, type=type) for (name, type) in names_types]
ln.save(features)
# create dataset & validate features
dataset = ln.Dataset.from_df(df, name="Immune phenotyping 1")
# register dataset & link validated features
dataset.save()
# access linked features
dataset.features
❗ returning existing file with same hash: File(id='bW2qYmpqAgNW4AbpVY7C', suffix='.parquet', accessor='DataFrame', description='See dataset bW2qYmpqAgNW4AbpVY7C', size=2913, hash='otKGnMVEtNo6amaaVewBug', hash_type='md5', updated_at=2023-09-26 15:21:08, storage_id='FGbhE5GU', transform_id='FPnfDtJz8qbEz8', run_id='gjdJnVZYc7ipewoya1rF', created_by_id='DzTjkKse')
❗ returning existing dataset with same hash: Dataset(id='bW2qYmpqAgNW4AbpVY7C', name='Immune phenotyping 1', hash='otKGnMVEtNo6amaaVewBug', updated_at=2023-09-26 15:21:08, transform_id='FPnfDtJz8qbEz8', run_id='gjdJnVZYc7ipewoya1rF', file_id='bW2qYmpqAgNW4AbpVY7C', created_by_id='DzTjkKse')
Features:
columns: FeatureSet(id='HWLCaLwMcvH9m2c1F91o', n=3, registry='core.Feature', hash='2CWCsmFoutK4ueokkVNU', updated_at=2023-09-26 15:21:08, created_by_id='DzTjkKse')
perturbation (category)
CD45 (number)
CD8 (number)
Use the lnschema_bionty
plug-in to type biological entities and validate column names schema-full:
# requires the 'bionty' schema
import lnschema_bionty as lb
# set a global species for multi-species registries
lb.settings.species = "human"
# create cell marker records from the public ontology
cell_markers = [lb.CellMarker.from_bionty(name=name) for name in ["CD8", "CD45"]]
ln.save(cell_markers)
# create dataset & validate features
dataset = ln.Dataset.from_df(
df.iloc[:, :2], name="Immune phenotyping 2", field=lb.CellMarker.name
)
# register dataset & link validated features
dataset.save()
dataset.features
❗ record with similar name exist! did you mean to load it?
id | __ratio__ | |
---|---|---|
name | ||
Immune phenotyping 1 | bW2qYmpqAgNW4AbpVY7C | 95.0 |
Features:
columns: FeatureSet(id='b9YgiPXnH9EE7Db9n3Ye', n=2, type='number', registry='bionty.CellMarker', hash='VwiLioP9FaLRLPqK7k2c', updated_at=2023-09-26 15:21:11, created_by_id='DzTjkKse')
'CD45', 'CD8'
Query for annotations#
Query for a panel of cell markers & the linked datasets:
# an object to auto-complete cell markers
cell_markers = lb.CellMarker.lookup()
# all cell marker panel containing CD45
panels_with_cd45 = ln.FeatureSet.filter(cell_markers=cell_markers.cd45).all()
# all datasets measuring CD45
ln.Dataset.filter(feature_sets__in=panels_with_cd45).df()
name | description | version | hash | reference | reference_type | transform_id | run_id | file_id | initial_version_id | updated_at | created_by_id | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
id | ||||||||||||
46soXLHvTyl2rsGnEHLQ | Immune phenotyping 2 | None | None | D4h0vbNPI6kBVa2Io8766Q | None | None | FPnfDtJz8qbEz8 | gjdJnVZYc7ipewoya1rF | 46soXLHvTyl2rsGnEHLQ | None | 2023-09-26 15:21:11 | DzTjkKse |
Annotate with biological labels#
Use the Experimental Factor Ontology to link a validated label for the readout:
# search the public ontology from the bionty store
lb.ExperimentalFactor.bionty().search("facs").head(2)
# create a record for facs
facs = lb.ExperimentalFactor.from_bionty(ontology_id="EFO:0009108")
facs.save()
# label with an inhouse assay
immune_assay1 = lb.ExperimentalFactor(name="Immune phenotyping assay 1")
immune_assay1.save()
dataset.experimental_factors.add(facs, immune_assay1)
# create a tissue from a public ontology
bone_marrow = lb.Tissue.from_bionty(name="bone marrow")
bone_marrow.save()
dataset.tissues.add(bone_marrow)
dataset.describe()
❗ now recursing through parents: this only happens once, but is much slower than bulk saving
Dataset(id='46soXLHvTyl2rsGnEHLQ', name='Immune phenotyping 2', hash='D4h0vbNPI6kBVa2Io8766Q', updated_at=2023-09-26 15:21:11)
Provenance:
💫 transform: Transform(id='FPnfDtJz8qbEz8', name='Introduction', short_name='introduction', version='0', type=notebook, updated_at=2023-09-26 15:21:11, created_by_id='DzTjkKse')
👣 run: Run(id='gjdJnVZYc7ipewoya1rF', run_at=2023-09-26 15:21:08, transform_id='FPnfDtJz8qbEz8', created_by_id='DzTjkKse')
📄 file: File(id='46soXLHvTyl2rsGnEHLQ', suffix='.parquet', accessor='DataFrame', description='See dataset 46soXLHvTyl2rsGnEHLQ', size=2297, hash='D4h0vbNPI6kBVa2Io8766Q', hash_type='md5', updated_at=2023-09-26 15:21:11, storage_id='FGbhE5GU', transform_id='FPnfDtJz8qbEz8', run_id='gjdJnVZYc7ipewoya1rF', created_by_id='DzTjkKse')
👤 created_by: User(id='DzTjkKse', handle='testuser1', email='testuser1@lamin.ai', name='Test User1', updated_at=2023-09-26 15:21:04)
Features:
columns: FeatureSet(id='b9YgiPXnH9EE7Db9n3Ye', n=2, type='number', registry='bionty.CellMarker', hash='VwiLioP9FaLRLPqK7k2c', updated_at=2023-09-26 15:21:11, created_by_id='DzTjkKse')
'CD45', 'CD8'
Labels:
🏷️ tissues (1, bionty.Tissue): 'bone marrow'
🏷️ experimental_factors (2, bionty.ExperimentalFactor): 'Immune phenotyping assay 1', 'fluorescence-activated cell sorting'
More examples#
Understand data flow#
View the sequence of data transformations (Transform
) in a project (from here, based on Schmidt et al., 2022):
transform.view_parents()
Or, the generating flow of a file or dataset:
file.view_flow()
Both figures are based on mere calls to ln.track()
in notebooks, pipelines & app.
Manage biological registries#
Create a cell type registry from public knowledge and add a new cell state (from here):
import lnschema_bionty as lb
# create an ontology-coupled cell type record and save it
lb.CellType.from_bionty(name="neuron").save()
# create a record to track a new cell state
new_cell_state = lb.CellType(name="my neuron cell state", description="explains X")
new_cell_state.save()
# express that it's a neuron state
cell_types = lb.CellType.lookup()
new_cell_state.parents.add(cell_types.neuron)
Show code cell output
❗ now recursing through parents: this only happens once, but is much slower than bulk saving
❗ records with similar names exist! did you mean to load one of them?
id | synonyms | __ratio__ | |
---|---|---|---|
name | |||
cell | Ry0JGwSD | 90.0 | |
neuron | GCBGsuZM | nerve cell | 90.0 |
# view ontological hierarchy
new_cell_state.view_parents(distance=2)
Leverage a mesh of instances#
LaminDB is a distributed system like git.
For instance, collaborators can load your instance using:
$ lamin load myhandle/myinstance
Manage custom schemas#
LaminDB can be customized & extended with schema & app plug-ins building on the Django ecosystem. Examples are
lnschema_bionty: Registries for basic biological entities, coupled to public ontologies.
lnschema_lamin1: Exemplary custom schema to manage samples, treatments, etc.
If you’d like to create your own schema or app:
Create a git repository with registries similar to lnschema_lamin1
Create & deploy migrations via
lamin migrate create
andlamin migrate deploy
It’s fastest if we do this for you based on our templates within an enterprise plan.
Design#
LaminDB builds semantics of R&D and biology into well-established infrastructure.
It provides a SQL-schema specification for common entities: File
, Dataset
, Transform
, Feature
, ULabel
etc. - see the API reference or the source code.
What is the schema language?
Data models are defined in Python using the Django ORM. Django translates it to SQL.
Django is one of the most-used & highly-starred projects on GitHub (~1M dependents, ~73k stars) and has been robustly maintained for 15 years.
In the first year, LaminDB used SQLModel/SQLAlchemy – we might bring back compatibility.
On top of the schema, LaminDB is a Python API that abstracts over storage & database access, data transformations, and (biological) ontologies.
The code for this is open-source & accessible through the dependencies & repositories listed below.
Dependencies#
Data is stored in a platform-independent way:
location → local, on AWS S3 or GCP Storage, accessed through
fsspec
format → blob-like files or queryable formats like Parquet, zarr, HDF5, TileDB & DuckDB
Metadata is stored in SQL: current backends are SQLite (small teams) and Postgres (any team size).
Django ORM for schema management & metadata queries (until v0.41: SQLModel & SQLAlchemy).
Biological knowledge sources & ontologies: see Bionty.
For more details, see the pyproject.toml file in lamindb & the linked repositories below.
Repositories#
LaminDB and its plug-ins consist in open-source Python libraries & publicly hosted metadata assets:
lamindb: Core API, which builds on the core schema.
lnschema-bionty: Registries for basic biological entities, coupled to public ontologies.
lnschema-lamin1: Exemplary custom schema to manage samples, treatments, etc.
lamindb-setup: Setup & configure LaminDB, client for Lamin Hub.
bionty: Accessor for public biological ontologies.
nbproject: Metadata parser for Jupyter notebooks.
lamin-utils: Generic utilities, e.g., a logger.
readfcs: FCS file reader.
LaminHub is not open-sourced, and neither are plug-ins that model lab operations.
Assumptions & principles#
Data is generated by instruments that process physical samples: it comes in batches stored as immutable files.
Files are transformed into more useful data representations, e.g.:
Summary statistics like count matrices for fastq files
Array stores of non-array-like input data (e.g., images)
Higher-level embeddings for lower-level array, text or graph representations
Concatenated array stores for large-scale atlas-like datasets
Semantics of high-level embeddings (“inflammatory”, “lipophile”) are anchored in experimental metadata and knowledge (ontologies)
Experimental metadata is another ontology type
Experiments measure features (
Feature
,CellMarker
, …)Learning and data warehousing both iterate data transformations (
Transform
)Basic biological entities should have the same meaning to anyone and across any data platform
Schema migrations have to be easy
Influences#
LaminDB was influenced by many other projects, see Influences.
Notebooks#
Find all guide notebooks here.
You can run these notebooks in hosted versions of JupyterLab, e.g., Saturn Cloud, Google Vertex AI, Google Colab, and others.
Jupyter Lab & Notebook offer a fully interactive experience, VS Code & others require using the CLI to track notebooks:
lamin track my-notebook.ipynb
Show code cell content
!lamin delete --force lamin-intro
💡 deleting instance testuser1/lamin-intro
✅ deleted instance settings file: /home/runner/.lamin/instance--testuser1--lamin-intro.env
✅ instance cache deleted
✅ deleted '.lndb' sqlite file
❗ consider manually deleting your stored data: /home/runner/work/lamindb/lamindb/docs/lamin-intro