Introduction#

LaminDB is an open-source Python framework to manage biological data & analyses in generic backends:

  • Access data & metadata across storage (files, arrays) & database (SQL) backends.

  • Track data flow across notebooks, pipelines & UI.

  • Manage registries for experimental metadata & in-house ontologies, import public ontologies.

  • Validate, standardize & annotate data using registries.

  • Organize and share data across a mesh of LaminDB instances.

  • Manage data access with an auditable system of record.

LaminDB features

Access data & metadata across storage (files, arrays) & database (SQL) backends.

Track data flow across notebooks, pipelines & UI: track(), Transform & Run.

Manage registries for experimental metadata & in-house ontologies, import public ontologies.

Validate, standardize & annotate data using registries: validate & standardize.

  • Inspect validation failures: inspect

  • Annotate with untyped or typed labels: add

  • Save data & metadata ACID: save

Organize and share data across a mesh of LaminDB instances.

  • Create & load instances like git repos: lamin init & lamin load

  • Zero-copy transfer data across instances

Zero lock-in, scalable, auditable, access management, and more.

  • Zero lock-in: LaminDB runs on generic backends server-side and is not a client for “Lamin Cloud”

    • Flexible storage backends (local, S3, GCP, anything fsspec supports)

    • Currently two SQL backends for managing metadata: SQLite & Postgres

  • Scalable: metadata tables support 100s of millions of entries

  • Auditable: data & metadata records are hashed, timestamped, and attributed to users (soon to come: LaminDB Log)

  • Access management:

    • High-level access management through Lamin’s collaborator roles

    • Fine-grained access management via storage & SQL roles (and soon to come: Lamin Vault)

  • Secure: embedded in your infrastructure (Lamin has no access to your data & metadata)

  • Tested & typed (up to Django Model fields)

  • Idempotent & ACID operations

LaminHub is a data collaboration hub built on LaminDB similar to how GitHub is built on git.

LaminHub features

Public demo instances to explore in the UI or load using the CLI via lamin load owner/instance:

LaminHub neither hosts data nor metadata, but connects to distributed storage locations & databases through LaminDB.

See validated data artifacts in context of ontologies & experimental metadata.

Query & search.

See scripts, notebooks & pipelines with their inputs & outputs.

Track pipelines, notebooks & UI transforms in one registry.

See parents and children of transforms.

Basic features of LaminHub are free. Enterprise features hosted in your or our infrastructure are available on a paid plan!

Quickstart#

Warning

Public beta: Close to having converged a stable API, but some breaking changes might still occur.

Setup LaminDB#

  1. Install the lamindb Python package:

    pip install 'lamindb[jupyter,bionty]'
    
  2. Sign up for a free account (see more info) and copy the API key.

  3. Log in on the command line (data remains in your infrastructure, with Lamin having no access to it):

    lamin login <email> --key <API-key>
    

You can now init LaminDB instances like you init git repositories:

!lamin init --schema bionty --storage ./lamin-intro  # or s3://my-bucket, gs://my-bucket as default storage
Hide code cell output
✅ saved: User(uid='DzTjkKse', handle='testuser1', name='Test User1', updated_at=2023-12-08 11:33:02 UTC)
✅ saved: Storage(uid='aXvyR27m', root='/home/runner/work/lamindb/lamindb/docs/lamin-intro', type='local', updated_at=2023-12-08 11:33:02 UTC, created_by_id=1)
💡 loaded instance: testuser1/lamin-intro
💡 did not register local instance on hub

Because we passed --schema bionty, this instance mounted plug-in lnschema_bionty.

Register a file#

Track files using the File registry:

import lamindb as ln
import pandas as pd

# track run context
ln.track()

# access a batch of data
df = pd.DataFrame(
    {"CD8A": [1, 2, 3], "CD4": [3, 4, 5], "CD14": [5, 6, 7]},
    index=["observation1", "observation2", "observation3"],
)

# create a file (versioning is optional)
file = ln.File(df, description="my RNA-seq", version="1")

# register file
file.save()
💡 lamindb instance: testuser1/lamin-intro
💡 notebook imports: lamindb==0.63.4 lnschema_bionty==0.35.3 pandas==1.5.3
💡 saved: Transform(uid='FPnfDtJz8qbEz8', name='Introduction', short_name='introduction', version='0', type=notebook, updated_at=2023-12-08 11:33:05 UTC, created_by_id=1)
💡 saved: Run(uid='BKP3qYT8hwel1W7rxXQP', run_at=2023-12-08 11:33:05 UTC, transform_id=1, created_by_id=1)

Access a file#

# search a file
ln.File.search("RNAseq")

# filter a file
file = ln.File.filter(description__contains="RNA-seq").first()

# view data flow
file.view_flow()

# describe metadata
file.describe()

# load the file
df = file.load()
_images/ca6c8e97434e872b940c853948198b2d1f8b2549ef642766bbe252ec9ec39a75.svg
File(uid='6yMEbv1eZs1t6YaijPcZ', suffix='.parquet', accessor='DataFrame', description='my RNA-seq', version='1', size=3506, hash='I3bJ9fOfamH1pAcJ4kydIg', hash_type='md5', visibility=1, key_is_virtual=True, updated_at=2023-12-08 11:33:05 UTC)

Provenance:
  🗃️ storage: Storage(uid='aXvyR27m', root='/home/runner/work/lamindb/lamindb/docs/lamin-intro', type='local', updated_at=2023-12-08 11:33:02 UTC, created_by_id=1)
  📔 transform: Transform(uid='FPnfDtJz8qbEz8', name='Introduction', short_name='introduction', version='0', type='notebook', updated_at=2023-12-08 11:33:05 UTC, created_by_id=1)
  👣 run: Run(uid='BKP3qYT8hwel1W7rxXQP', run_at=2023-12-08 11:33:05 UTC, transform_id=1, created_by_id=1)
  👤 created_by: User(uid='DzTjkKse', handle='testuser1', name='Test User1', updated_at=2023-12-08 11:33:02 UTC)
Filter & search in the UI
Data flow in the UI

Define features & labels#

Define features and labels using Feature and ULabel:

# define features
features = ln.Feature.from_df(df)
ln.save(features)

# define tissue label
tissue = ln.ULabel(name="umbilical blood")
tissue.save()

# define a parent label
is_tissue = ln.ULabel(name="is_tissue")
is_tissue.save()
is_tissue.children.add(tissue)

# view hierarchy
tissue.view_parents()
_images/61c47e6f6f33035e27b5816e26e3a6110ec9ad03d17b37d29fa2165c9d8f4000.svg

Validate & annotate data#

# create file & validate features
file = ln.File.from_df(df, description="my RNA-seq")

# register file & link validated features
file.save()

# annotate with a label
file.labels.add(tissue)

# show metadata
file.describe()
❗ returning existing file with same hash: File(uid='6yMEbv1eZs1t6YaijPcZ', suffix='.parquet', accessor='DataFrame', description='my RNA-seq', version='1', size=3506, hash='I3bJ9fOfamH1pAcJ4kydIg', hash_type='md5', visibility=1, key_is_virtual=True, updated_at=2023-12-08 11:33:05 UTC, storage_id=1, transform_id=1, run_id=1, created_by_id=1)
File(uid='6yMEbv1eZs1t6YaijPcZ', suffix='.parquet', accessor='DataFrame', description='my RNA-seq', version='1', size=3506, hash='I3bJ9fOfamH1pAcJ4kydIg', hash_type='md5', visibility=1, key_is_virtual=True, updated_at=2023-12-08 11:33:05 UTC)

Provenance:
  🗃️ storage: Storage(uid='aXvyR27m', root='/home/runner/work/lamindb/lamindb/docs/lamin-intro', type='local', updated_at=2023-12-08 11:33:02 UTC, created_by_id=1)
  📔 transform: Transform(uid='FPnfDtJz8qbEz8', name='Introduction', short_name='introduction', version='0', type='notebook', updated_at=2023-12-08 11:33:05 UTC, created_by_id=1)
  👣 run: Run(uid='BKP3qYT8hwel1W7rxXQP', run_at=2023-12-08 11:33:05 UTC, transform_id=1, created_by_id=1)
  👤 created_by: User(uid='DzTjkKse', handle='testuser1', name='Test User1', updated_at=2023-12-08 11:33:02 UTC)
Features:
  columns: FeatureSet(uid='WTgR8NcuFs0q5wJj9iXK', n=3, registry='core.Feature', hash='UnW_pPBmoCTHePSnwWUB', updated_at=2023-12-08 11:33:05 UTC, created_by_id=1)
    CD8A (number)
    CD4 (number)
    CD14 (number)
Labels:
  🏷️ ulabels (1, core.ULabel): 'umbilical blood'
Data artifacts with context in the UI

Query for annotations#

# a look-up object for all the children of "is_tissue" in ULabel registry
tissues = is_tissue.children.lookup()

# query for exactly one result annotated with umbilical blood
dataset = ln.File.filter(ulabels=tissues.umbilical_blood).one()

# permanently delete the file (without the permanent flag, moves to trash)
file.delete(permanent=True)

Use biological types#

The generic Feature and ULabel will get you pretty far.

But if you use an entity many times, you typically want a dedicated registry, which you can use to type your code & as an interface for public ontologies.

Let’s do this with Gene and Tissue from plug-in lnschema_bionty:

import lnschema_bionty as lb

# create gene records from the public ontology as features
genes = lb.Gene.from_values(df.columns, organism="human")
ln.save(genes)

# query the entire Gene registry content as a DataFrame
lb.Gene.filter().df()

# create file & validate features using the symbol field of Gene
file = ln.File.from_df(
    df, description="my RNA-seq", field=lb.Gene.symbol, organism="human"
)
file.save()

# search the public tissue ontology from the bionty store
lb.Tissue.bionty().search("umbilical blood").head(2)

# define tissue label
tissue = lb.Tissue.from_bionty(name="umbilical cord blood")
tissue.save()

# ontological hierarchy comes by default
tissue.view_parents(distance=2)

# annotate with tissue label
file.labels.add(tissue)

# show metadata
file.describe()
❗ now recursing through parents: this only happens once, but is much slower than bulk saving
_images/49df3a36b7aea7fe3288a885d24b7286bc1d0d83904e54332164e980245d5fad.svg
File(uid='lDyYMTRQXd2N5vJ3gzZ8', suffix='.parquet', accessor='DataFrame', description='my RNA-seq', size=3506, hash='I3bJ9fOfamH1pAcJ4kydIg', hash_type='md5', visibility=1, key_is_virtual=True, updated_at=2023-12-08 11:33:10 UTC)

Provenance:
  🗃️ storage: Storage(uid='aXvyR27m', root='/home/runner/work/lamindb/lamindb/docs/lamin-intro', type='local', updated_at=2023-12-08 11:33:02 UTC, created_by_id=1)
  💫 transform: Transform(uid='FPnfDtJz8qbEz8', name='Introduction', short_name='introduction', version='0', type=notebook, updated_at=2023-12-08 11:33:05 UTC, created_by_id=1)
  👣 run: Run(uid='BKP3qYT8hwel1W7rxXQP', run_at=2023-12-08 11:33:05 UTC, transform_id=1, created_by_id=1)
  👤 created_by: User(uid='DzTjkKse', handle='testuser1', name='Test User1', updated_at=2023-12-08 11:33:02 UTC)
Features:
  columns: FeatureSet(uid='lOYmKviSYNrmjDRT2sIs', n=3, type='number', registry='bionty.Gene', hash='7wsvCyRhtmkNeD2dpTHg', updated_at=2023-12-08 11:33:10 UTC, created_by_id=1)
    'CD8A', 'CD4', 'CD14'
Labels:
  🏷️ tissues (1, bionty.Tissue): 'umbilical cord blood'

Query for gene sets & the linked files:

# an object to auto-complete human genes
genes = lb.Gene.filter(organism__name="human").lookup()

# all gene sets measuring CD8A
genesets_with_cd8a = ln.FeatureSet.filter(genes=genes.cd8a).all()

# all files measuring CD8A
ln.File.filter(feature_sets__in=genesets_with_cd8a).df()
uid storage_id key suffix accessor description version size hash hash_type transform_id run_id initial_version_id visibility key_is_virtual updated_at created_by_id
id
2 lDyYMTRQXd2N5vJ3gzZ8 1 None .parquet DataFrame my RNA-seq None 3506 I3bJ9fOfamH1pAcJ4kydIg md5 1 1 None 1 True 2023-12-08 11:33:10.423872+00:00 1

Append a new batch of data#

# assume we now run a pipeline in which we access a new batch of data
transform = ln.Transform(name="RNA-seq file ingestion", type="pipeline", version="1")
ln.track(transform)

# access a new batch of data with a different schema
df = pd.DataFrame(
    {
        "CD8A": [2, 3, 3],
        "CD4": [3, 4, 5],
        "CD38": [4, 2, 3],
    },
    index=["observation4", "observation5", "observation6"],
)

# because gene `"CD38"` is not yet registered, it doesn't yet validate
file2 = ln.File.from_df(
    df, description="my RNA-seq batch 2", field=lb.Gene.symbol, organism="human"
)

# let's add it to the `Gene` registry and re-create the file - now everything passes
lb.Gene.from_bionty(symbol="CD38", organism="human").save()

# now we can validate all features
file2 = ln.File.from_df(
    df, description="my RNA-seq batch 2", field=lb.Gene.symbol, organism="human"
)
file2.save()
💡 saved: Transform(uid='VKrFIeI5Edud27', name='RNA-seq file ingestion', version='1', type='pipeline', updated_at=2023-12-08 11:33:21 UTC, created_by_id=1)
💡 saved: Run(uid='EswO863J8aKbv7pbQSbG', run_at=2023-12-08 11:33:21 UTC, transform_id=2, created_by_id=1)
1 term (33.30%) is not validated for symbol: CD38

Create a dataset using Dataset by linking both batches in a “sharded dataset”:

dataset = ln.Dataset([file, file2], name="my RNA-seq dataset")
dataset.save()
dataset.describe()
dataset.view_flow()
Dataset(uid='gzlsEnucixYv1r2cEWMR', name='my RNA-seq dataset', hash='Kv5biKHSYT0rrCcw3TKG', visibility=1, updated_at=2023-12-08 11:33:24 UTC)

Provenance:
  🧩 transform: Transform(uid='VKrFIeI5Edud27', name='RNA-seq file ingestion', version='1', type='pipeline', updated_at=2023-12-08 11:33:21 UTC, created_by_id=1)
  👣 run: Run(uid='EswO863J8aKbv7pbQSbG', run_at=2023-12-08 11:33:21 UTC, transform_id=2, created_by_id=1)
  👤 created_by: User(uid='DzTjkKse', handle='testuser1', name='Test User1', updated_at=2023-12-08 11:33:02 UTC)
Features:
  columns: FeatureSet(uid='HaSQni4Sx75mM1c8wkPC', n=4, type='number', registry='bionty.Gene', hash='OaTsWN-lR7zDUC1bjzk8', updated_at=2023-12-08 11:33:24 UTC, created_by_id=1)
    'CD8A', 'CD4', 'CD14', 'CD38'
_images/a78b39b8721b0a7030be91b9eebc2fbdaec9b4e7c94805c88e6457d923072858.svg

You can load the entire dataset into memory as if it was one:

dataset.load()
CD8A CD4 CD14 CD38
observation1 1 3 5.0 NaN
observation2 2 4 6.0 NaN
observation3 3 5 7.0 NaN
observation4 2 3 NaN 4.0
observation5 3 4 NaN 2.0
observation6 3 5 NaN 3.0

Or iterate over its files:

dataset.files.df()
uid storage_id key suffix accessor description version size hash hash_type transform_id run_id initial_version_id visibility key_is_virtual updated_at created_by_id
id
2 lDyYMTRQXd2N5vJ3gzZ8 1 None .parquet DataFrame my RNA-seq None 3506 I3bJ9fOfamH1pAcJ4kydIg md5 1 1 None 1 True 2023-12-08 11:33:10.423872+00:00 1
3 RmDjsiSmdfzUeIkLvCIi 1 None .parquet DataFrame my RNA-seq batch 2 None 3499 xFu1vFetW040mVQppcx6zw md5 2 2 None 1 True 2023-12-08 11:33:24.103953+00:00 1

More examples#

Understand data flow#

View the sequence of data transformations (Transform) in a project (from here, based on Schmidt et al., 2022):

transform.view_parents()

Or, the generating flow of a file or dataset:

file.view_flow()

Both figures are based on mere calls to ln.track() in notebooks, pipelines & app.

Manage biological registries#

Create a cell type registry from public knowledge and add a new cell state (from here):

import lnschema_bionty as lb

# create an ontology-coupled cell type record and save it
lb.CellType.from_bionty(name="neuron").save()

# create a record to track a new cell state
new_cell_state = lb.CellType(name="my neuron cell state", description="explains X")
new_cell_state.save()

# express that it's a neuron state
cell_types = lb.CellType.lookup()
new_cell_state.parents.add(cell_types.neuron)
Hide code cell output
❗ now recursing through parents: this only happens once, but is much slower than bulk saving
# view ontological hierarchy
new_cell_state.view_parents(distance=2)
_images/10484bc13381cb2c54f3cfbaeb52e7203c6bb7635d6dbbd02f571b3d9c4c0e68.svg

Leverage a mesh of instances#

LaminDB is a distributed system like git. Similar to cloning a repository, collaborators can load your instance on the command-line using:

lamin load myhandle/myinstance

If you run lamin save <notebook_path>, you will save the notebook to your default storage location.

You can explore the notebook report corresponding to the quickstart here in LaminHub.

Manage custom schemas#

LaminDB can be customized & extended with schema & app plug-ins building on the Django ecosystem. Examples are

  • lnschema_bionty: Registries for basic biological entities, coupled to public ontologies.

  • lnschema_lamin1: Exemplary custom schema to manage samples, treatments, etc.

If you’d like to create your own schema or app:

  1. Create a git repository with registries similar to lnschema_lamin1

  2. Create & deploy migrations via lamin migrate create and lamin migrate deploy

It’s fastest if we do this for you based on our templates within an enterprise plan.

Design#

Why?#

We wrote a blog post about the key problems Lamin tries to solve when starting to work on it.

Schema & API#

LaminDB provides a SQL schema for common entities: File, Dataset, Transform, Feature, ULabel etc. - see the API reference or the source code.

The core schema is extendable through plug ins (see blue vs. red entities in graphic), e.g., with basic biological (Gene, Protein, CellLine, etc.) & operational entities (Biosample, Techsample, Treatment, etc.).

What is the schema language?

Data models are defined in Python using the Django ORM. Django translates them to SQL tables.

Django is one of the most-used & highly-starred projects on GitHub (~1M dependents, ~73k stars) and has been robustly maintained for 15 years.

In the first year, LaminDB used SQLModel/SQLAlchemy – we might bring back compatibility.

On top of the schema, LaminDB is a Python API that abstracts over storage & database access, data transformations, and (biological) ontologies.

The code for this is open-source & accessible through the dependencies & repositories listed below.

Dependencies#

  • Data is stored in a platform-independent way:

    • location → local, on AWS S3 or GCP Storage, accessed through fsspec

    • format → blob-like files or queryable formats like parquet, zarr, HDF5, TileDB, …

  • Metadata is stored in SQL: current backends are SQLite (small teams) and Postgres (any team size).

  • Django ORM for schema management & metadata queries.

  • Biological knowledge sources & ontologies: see Bionty.

For more details, see the pyproject.toml file in lamindb & the linked repositories below.

Repositories#

LaminDB and its plug-ins consist in open-source Python libraries & publicly hosted metadata assets:

LaminHub is not open-sourced, and neither are plug-ins that model lab operations.

Assumptions & principles#

  1. Data is generated by instruments that process physical samples: it comes in batches stored as immutable files.

  2. Files are transformed into more useful data representations, e.g.:

    • Summary statistics, e.g., count matrices for fastq files

    • Arrays of non-array-like input data (e.g., images)

    • Higher-level embeddings for lower-level array, text or graph representations

    • Concatenated arrays for large-scale atlas-like datasets

  3. Semantics of high-level embeddings (“inflammatory”, “lipophile”) are anchored in experimental metadata and knowledge (ontologies)

  4. Experimental metadata is another ontology type

  5. Experiments measure features (Feature, CellMarker, …)

  6. Samples are annotated by labels (ULabel, CellLine, …)

  7. Learning and data warehousing both iterate transformations (see graphic, Transform)

  8. Basic biological entities should have the same meaning to anyone and across any data platform

  9. Schema migrations should be easy

Influences#

LaminDB was influenced by many other projects, see Influences.

Notebooks#

  • Find all tutorial & guide notebooks here and use cases here.

  • You can run these notebooks in hosted versions of JupyterLab, e.g., Saturn Cloud, Google Vertex AI, Google Colab, and others.

Hide code cell content
# clean up test instance
!lamin delete --force lamin-intro
!rm -r lamin-intro
💡 deleting instance testuser1/lamin-intro
✅     deleted instance settings file: /home/runner/.lamin/instance--testuser1--lamin-intro.env
✅     instance cache deleted
✅     deleted '.lndb' sqlite file
❗     consider manually deleting your stored data: /home/runner/work/lamindb/lamindb/docs/lamin-intro