Introduction#

LaminDB is an open-source Python framework to manage biological data & analyses in generic backends:

  • Unify access to data & metadata across storage (files, arrays) & database (SQL) backends.

  • Track data flow across notebooks, pipelines & UI.

  • Manage registries for experimental metadata & ontologies.

  • Validate & annotate data batches using in-house & public knowledge.

  • Organize and share data across a mesh of LaminDB instances.

LaminDB features

Often, siloed object stores, SQL databases & ELN/LIMS systems pile up inaccessible & hard-to-integrate data impacting derived analytical insights.

LaminDB’s features aim to address key problems underlying this tendency, taking inspiration from a number of data tools.

For data users

  • Unify access to data & metadata across storage (arrays, files) & SQL database backends:

    • Query by & search for anything: filter, search

    • Stage, load or stream files & datasets: stage, load, backed

    • Model data schema-less or schema-full, mount custom schema plug-ins & manage schema migrations (schemas)

    • Organize data around learning: Feature, FeatureSet, ULabel, Modality

    • Leverage support for common array formats in memory & storage: DataFrame, AnnData, MuData, pyarrow.Table backed by parquet, zarr, TileDB, HDF5, h5ad, DuckDB

    • Bridge immutable data artifacts (File) and data warehousing (Dataset)

  • Track data flow across notebooks, pipelines & UI: track(), Transform & Run

  • Manage registries for experimental metadata & ontologies in a simple database:

  • Validate, standardize & annotate data batches:

  • Create DB instances within seconds and share data across a mesh of instances (setup)

For platform builders

  • Zero lock-in: LaminDB runs on generic backends server-side and is not a client for “Lamin Cloud”

    • Flexible storage backends (local, S3, GCP, anything fsspec supports)

    • Currently two SQL backends for managing metadata: SQLite & Postgres

  • Scalable: metadata tables support 100s of millions of entries

  • Access management:

    • High-level access management through Lamin’s collaborator roles

    • Fine-grained access management via embedded storage & SQL roles

  • Secure: embedded in your infrastructure (Lamin has no access to your data & metadata)

  • Idempotent & ACID operations

  • File, dataset & transform versioning

  • Safeguards against typos & duplications when populating registries

  • Tested & typed (up to Django Model fields, to come)

LaminHub is a data collaboration hub built on LaminDB similar to how GitHub is built on git.

LaminHub features

Unlike GitHub & most SaaS platforms, LaminHub by default neither hosts data nor metadata, but connects to distributed storage locations & databases through LaminDB.

Public demo instances to explore in the UI or load using the CLI via lamin load owner/instance (you need an account):

See validated files & arrays in context of ontologies & experimental metadata:

Track data flow through pipelines, notebooks & UI:

Upload & register files:

Browse files:

LaminHub is free for public data. Enterprise features, support, integration tests & wetlab plug-ins hosted in your or our infrastructure are available on a paid plan: please reach out!

Quickstart#

Warning

Public beta: Close to having converged a stable API, but some breaking changes might still occur.

Setup#

  1. Sign up for a free account (see more info).

  2. Install the lamindb Python package:

    pip install 'lamindb[jupyter,bionty]'
    
  3. Log in on the command line:

    lamin login <email> --password <password>
    

You can now init LaminDB instances like you init git repositories, e.g.:

!lamin init --schema bionty --storage ./lamin-intro  # or s3://my-bucket, gs://my-bucket as default storage
Hide code cell output
✅ saved: User(id='DzTjkKse', handle='testuser1', email='testuser1@lamin.ai', name='Test User1', updated_at=2023-09-26 15:21:04)
✅ saved: Storage(id='FGbhE5GU', root='/home/runner/work/lamindb/lamindb/docs/lamin-intro', type='local', updated_at=2023-09-26 15:21:05, created_by_id='DzTjkKse')
💡 loaded instance: testuser1/lamin-intro
💡 did not register local instance on hub (if you want, call `lamin register`)

Because we passed --schema bionty, this instance mounted plug-in lnschema_bionty.

Register a dataset#

import lamindb as ln
import pandas as pd

# track data flow through current notebook
ln.track()

# access a new batch of data
df = pd.DataFrame(
    {"CD8": [1, 2, 3], "CD45": [3, 4, 5], "perturbation": ["DMSO", "IFNG", "DMSO"]}
)

# create a dataset
dataset = ln.Dataset(df, name="Immune phenotyping 1")
# register dataset
dataset.save()
💡 loaded instance: testuser1/lamin-intro (lamindb 0.54.2)
💡 notebook imports: lamindb==0.54.2 lnschema_bionty==0.31.2 pandas==2.1.1
💡 Transform(id='FPnfDtJz8qbEz8', name='Introduction', short_name='introduction', version='0', type=notebook, updated_at=2023-09-26 15:21:08, created_by_id='DzTjkKse')
💡 Run(id='gjdJnVZYc7ipewoya1rF', run_at=2023-09-26 15:21:08, transform_id='FPnfDtJz8qbEz8', created_by_id='DzTjkKse')

Access a dataset#

# search a dataset
ln.Dataset.search("immune")

# query a dataset
dataset = ln.Dataset.filter(name__contains="phenotyping 1").one()

# view data flow
dataset.view_flow()

# describe metadata
dataset.describe()

# load the dataset
df = dataset.load()
_images/732bbd19d7621fe36fd64382c71305a3ca3a38ce4b227b187e2f0b62be9536f2.svg
Dataset(id='bW2qYmpqAgNW4AbpVY7C', name='Immune phenotyping 1', hash='otKGnMVEtNo6amaaVewBug', updated_at=2023-09-26 15:21:08)

Provenance:
  📔 transform: Transform(id='FPnfDtJz8qbEz8', name='Introduction', short_name='introduction', version='0', type='notebook', updated_at=2023-09-26 15:21:08, created_by_id='DzTjkKse')
  👣 run: Run(id='gjdJnVZYc7ipewoya1rF', run_at=2023-09-26 15:21:08, transform_id='FPnfDtJz8qbEz8', created_by_id='DzTjkKse')
  📄 file: File(id='bW2qYmpqAgNW4AbpVY7C', suffix='.parquet', accessor='DataFrame', description='See dataset bW2qYmpqAgNW4AbpVY7C', size=2913, hash='otKGnMVEtNo6amaaVewBug', hash_type='md5', updated_at=2023-09-26 15:21:08, storage_id='FGbhE5GU', transform_id='FPnfDtJz8qbEz8', run_id='gjdJnVZYc7ipewoya1rF', created_by_id='DzTjkKse')
  👤 created_by: User(id='DzTjkKse', handle='testuser1', email='testuser1@lamin.ai', name='Test User1', updated_at=2023-09-26 15:21:04)

Validate & annotate a dataset#

Validate the column names in a DataFrame schema-less:

# define validation criteria
names_types = [("CD8", "number"), ("CD45", "number"), ("perturbation", "category")]

# save validation criteria as features
features = [ln.Feature(name=name, type=type) for (name, type) in names_types]
ln.save(features)

# create dataset & validate features
dataset = ln.Dataset.from_df(df, name="Immune phenotyping 1")
# register dataset & link validated features
dataset.save()

# access linked features
dataset.features
❗ returning existing file with same hash: File(id='bW2qYmpqAgNW4AbpVY7C', suffix='.parquet', accessor='DataFrame', description='See dataset bW2qYmpqAgNW4AbpVY7C', size=2913, hash='otKGnMVEtNo6amaaVewBug', hash_type='md5', updated_at=2023-09-26 15:21:08, storage_id='FGbhE5GU', transform_id='FPnfDtJz8qbEz8', run_id='gjdJnVZYc7ipewoya1rF', created_by_id='DzTjkKse')
❗ returning existing dataset with same hash: Dataset(id='bW2qYmpqAgNW4AbpVY7C', name='Immune phenotyping 1', hash='otKGnMVEtNo6amaaVewBug', updated_at=2023-09-26 15:21:08, transform_id='FPnfDtJz8qbEz8', run_id='gjdJnVZYc7ipewoya1rF', file_id='bW2qYmpqAgNW4AbpVY7C', created_by_id='DzTjkKse')
Features:
  columns: FeatureSet(id='HWLCaLwMcvH9m2c1F91o', n=3, registry='core.Feature', hash='2CWCsmFoutK4ueokkVNU', updated_at=2023-09-26 15:21:08, created_by_id='DzTjkKse')
    perturbation (category)
    CD45 (number)
    CD8 (number)

Use the lnschema_bionty plug-in to type biological entities and validate column names schema-full:

# requires the 'bionty' schema
import lnschema_bionty as lb

# set a global species for multi-species registries
lb.settings.species = "human"

# create cell marker records from the public ontology
cell_markers = [lb.CellMarker.from_bionty(name=name) for name in ["CD8", "CD45"]]
ln.save(cell_markers)

# create dataset & validate features
dataset = ln.Dataset.from_df(
    df.iloc[:, :2], name="Immune phenotyping 2", field=lb.CellMarker.name
)
# register dataset & link validated features
dataset.save()

dataset.features
❗ record with similar name exist! did you mean to load it?
id __ratio__
name
Immune phenotyping 1 bW2qYmpqAgNW4AbpVY7C 95.0
Features:
  columns: FeatureSet(id='b9YgiPXnH9EE7Db9n3Ye', n=2, type='number', registry='bionty.CellMarker', hash='VwiLioP9FaLRLPqK7k2c', updated_at=2023-09-26 15:21:11, created_by_id='DzTjkKse')
    'CD45', 'CD8'

Query for annotations#

Query for a panel of cell markers & the linked datasets:

# an object to auto-complete cell markers
cell_markers = lb.CellMarker.lookup()

# all cell marker panel containing CD45
panels_with_cd45 = ln.FeatureSet.filter(cell_markers=cell_markers.cd45).all()

# all datasets measuring CD45
ln.Dataset.filter(feature_sets__in=panels_with_cd45).df()
name description version hash reference reference_type transform_id run_id file_id initial_version_id updated_at created_by_id
id
46soXLHvTyl2rsGnEHLQ Immune phenotyping 2 None None D4h0vbNPI6kBVa2Io8766Q None None FPnfDtJz8qbEz8 gjdJnVZYc7ipewoya1rF 46soXLHvTyl2rsGnEHLQ None 2023-09-26 15:21:11 DzTjkKse

Annotate with biological labels#

Use the Experimental Factor Ontology to link a validated label for the readout:

# search the public ontology from the bionty store
lb.ExperimentalFactor.bionty().search("facs").head(2)

# create a record for facs
facs = lb.ExperimentalFactor.from_bionty(ontology_id="EFO:0009108")
facs.save()

# label with an inhouse assay
immune_assay1 = lb.ExperimentalFactor(name="Immune phenotyping assay 1")
immune_assay1.save()

dataset.experimental_factors.add(facs, immune_assay1)

# create a tissue from a public ontology
bone_marrow = lb.Tissue.from_bionty(name="bone marrow")
bone_marrow.save()

dataset.tissues.add(bone_marrow)

dataset.describe()
❗ now recursing through parents: this only happens once, but is much slower than bulk saving
Dataset(id='46soXLHvTyl2rsGnEHLQ', name='Immune phenotyping 2', hash='D4h0vbNPI6kBVa2Io8766Q', updated_at=2023-09-26 15:21:11)

Provenance:
  💫 transform: Transform(id='FPnfDtJz8qbEz8', name='Introduction', short_name='introduction', version='0', type=notebook, updated_at=2023-09-26 15:21:11, created_by_id='DzTjkKse')
  👣 run: Run(id='gjdJnVZYc7ipewoya1rF', run_at=2023-09-26 15:21:08, transform_id='FPnfDtJz8qbEz8', created_by_id='DzTjkKse')
  📄 file: File(id='46soXLHvTyl2rsGnEHLQ', suffix='.parquet', accessor='DataFrame', description='See dataset 46soXLHvTyl2rsGnEHLQ', size=2297, hash='D4h0vbNPI6kBVa2Io8766Q', hash_type='md5', updated_at=2023-09-26 15:21:11, storage_id='FGbhE5GU', transform_id='FPnfDtJz8qbEz8', run_id='gjdJnVZYc7ipewoya1rF', created_by_id='DzTjkKse')
  👤 created_by: User(id='DzTjkKse', handle='testuser1', email='testuser1@lamin.ai', name='Test User1', updated_at=2023-09-26 15:21:04)
Features:
  columns: FeatureSet(id='b9YgiPXnH9EE7Db9n3Ye', n=2, type='number', registry='bionty.CellMarker', hash='VwiLioP9FaLRLPqK7k2c', updated_at=2023-09-26 15:21:11, created_by_id='DzTjkKse')
    'CD45', 'CD8'
Labels:
  🏷️ tissues (1, bionty.Tissue): 'bone marrow'
  🏷️ experimental_factors (2, bionty.ExperimentalFactor): 'Immune phenotyping assay 1', 'fluorescence-activated cell sorting'

More examples#

Understand data flow#

View the sequence of data transformations (Transform) in a project (from here, based on Schmidt et al., 2022):

transform.view_parents()

Or, the generating flow of a file or dataset:

file.view_flow()

Both figures are based on mere calls to ln.track() in notebooks, pipelines & app.

Manage biological registries#

Create a cell type registry from public knowledge and add a new cell state (from here):

import lnschema_bionty as lb

# create an ontology-coupled cell type record and save it
lb.CellType.from_bionty(name="neuron").save()

# create a record to track a new cell state
new_cell_state = lb.CellType(name="my neuron cell state", description="explains X")
new_cell_state.save()

# express that it's a neuron state
cell_types = lb.CellType.lookup()
new_cell_state.parents.add(cell_types.neuron)
Hide code cell output
❗ now recursing through parents: this only happens once, but is much slower than bulk saving
❗ records with similar names exist! did you mean to load one of them?
id synonyms __ratio__
name
cell Ry0JGwSD 90.0
neuron GCBGsuZM nerve cell 90.0
# view ontological hierarchy
new_cell_state.view_parents(distance=2)
_images/a66ada3e93da810eb119f6a18980bb56299ab4ba4bb2918d856a2b73347b194d.svg

Leverage a mesh of instances#

LaminDB is a distributed system like git.

For instance, collaborators can load your instance using:

$ lamin load myhandle/myinstance

Manage custom schemas#

LaminDB can be customized & extended with schema & app plug-ins building on the Django ecosystem. Examples are

  • lnschema_bionty: Registries for basic biological entities, coupled to public ontologies.

  • lnschema_lamin1: Exemplary custom schema to manage samples, treatments, etc.

If you’d like to create your own schema or app:

  1. Create a git repository with registries similar to lnschema_lamin1

  2. Create & deploy migrations via lamin migrate create and lamin migrate deploy

It’s fastest if we do this for you based on our templates within an enterprise plan.

Design#

LaminDB builds semantics of R&D and biology into well-established infrastructure.

It provides a SQL-schema specification for common entities: File, Dataset, Transform, Feature, ULabel etc. - see the API reference or the source code.

What is the schema language?

Data models are defined in Python using the Django ORM. Django translates it to SQL.

Django is one of the most-used & highly-starred projects on GitHub (~1M dependents, ~73k stars) and has been robustly maintained for 15 years.

In the first year, LaminDB used SQLModel/SQLAlchemy – we might bring back compatibility.

On top of the schema, LaminDB is a Python API that abstracts over storage & database access, data transformations, and (biological) ontologies.

The code for this is open-source & accessible through the dependencies & repositories listed below.

Dependencies#

  • Data is stored in a platform-independent way:

    • location → local, on AWS S3 or GCP Storage, accessed through fsspec

    • format → blob-like files or queryable formats like Parquet, zarr, HDF5, TileDB & DuckDB

  • Metadata is stored in SQL: current backends are SQLite (small teams) and Postgres (any team size).

  • Django ORM for schema management & metadata queries (until v0.41: SQLModel & SQLAlchemy).

  • Biological knowledge sources & ontologies: see Bionty.

For more details, see the pyproject.toml file in lamindb & the linked repositories below.

Repositories#

LaminDB and its plug-ins consist in open-source Python libraries & publicly hosted metadata assets:

LaminHub is not open-sourced, and neither are plug-ins that model lab operations.

Assumptions & principles#

  1. Data is generated by instruments that process physical samples: it comes in batches stored as immutable files.

  2. Files are transformed into more useful data representations, e.g.:

    • Summary statistics like count matrices for fastq files

    • Array stores of non-array-like input data (e.g., images)

    • Higher-level embeddings for lower-level array, text or graph representations

    • Concatenated array stores for large-scale atlas-like datasets

  3. Semantics of high-level embeddings (“inflammatory”, “lipophile”) are anchored in experimental metadata and knowledge (ontologies)

  4. Experimental metadata is another ontology type

  5. Experiments measure features (Feature, CellMarker, …)

  6. Samples are annotated by labels (ULabel, CellLine, …)

  7. Learning and data warehousing both iterate data transformations (Transform)

  8. Basic biological entities should have the same meaning to anyone and across any data platform

  9. Schema migrations have to be easy

Influences#

LaminDB was influenced by many other projects, see Influences.

Notebooks#

  • Find all guide notebooks here.

  • You can run these notebooks in hosted versions of JupyterLab, e.g., Saturn Cloud, Google Vertex AI, Google Colab, and others.

  • Jupyter Lab & Notebook offer a fully interactive experience, VS Code & others require using the CLI to track notebooks: lamin track my-notebook.ipynb

Hide code cell content
!lamin delete --force lamin-intro
💡 deleting instance testuser1/lamin-intro
✅     deleted instance settings file: /home/runner/.lamin/instance--testuser1--lamin-intro.env
✅     instance cache deleted
✅     deleted '.lndb' sqlite file
❗     consider manually deleting your stored data: /home/runner/work/lamindb/lamindb/docs/lamin-intro