Stars codecov pypi

Guide#

Welcome to the LaminDB guide! đź‘‹

Curate, store, track, query, integrate, and learn from biological data.

LaminDB is an open-source data lake for R&D in biology.

It gives you components to build on data lineage & biological entities with an ORM for your existing infrastructure: object storage (local directories, S3, GCP) with a mapped SQL query engine (SQLite, Postgres, and soon, BigQuery).

You can readily create distributed LaminDB instances at any scale:

  • Get started on your laptop, deploy in the cloud, or work with a mesh of instances for different teams and purposes.

  • Share them through a hub akin to HuggingFace & GitHub - see, e.g, lamin.ai/sunnyosun.

Warning

Public beta: Currently only recommended for collaborators as we still make breaking changes.

Installation#

LaminDB is a Python package (Python 3.8+).

pip install lamindb

Install biological entities like so:

pip install 'lamindb[bionty]'

Import#

In your python script, import LaminDB as:

import lamindb as ln

Setup#

Quick setup on the command line:

  • Sign up via lamin signup <email>

  • Log in via lamin login <handle>

  • Set up an instance via lamin init --storage <storage> --schema <schema_modules>

Example
lamin signup testuser1@lamin.ai
lamin login testuser1
lamin init --storage ./mydata --schema bionty,lamin1

Quickstart#

Track files & metadata with sources#

transform = ln.Transform(name="My pipeline/notebook")
ln.track(transform)

df = pd.DataFrame({"a": [1, 2], "b": [3, 4]})

file = ln.File(df, name="My dataframe")
ln.add(file)

Under-the-hood, this created 3 linked records:

Transform(id='OdlFhFWW7qg3', version='0', name='My pipeline/notebook', type=notebook, created_by_id='DzTjkKse', created_at=datetime.datetime(2023, 4, 28, 6, 7, 30))
Run(id='g1xwllJfFZuh24AWKySc', transform_id='OdlFhFWW7qg3', transform_version='0', created_by_id='DzTjkKse', created_at=datetime.datetime(2023, 4, 28, 6, 7, 30))
File(id='DY9JINrVH6sMtqEirMpM', name='iris', suffix='.parquet', size=5629, hash='jUTdERuqlGv_GyqFfIEb2Q', run_id='g1xwllJfFZuh24AWKySc', transform_id='OdlFhFWW7qg3', transform_version='0', storage_id='GLWpJhvg', created_at=datetime.datetime(2023, 4, 28, 6, 7, 32), created_by_id='DzTjkKse')

Query & load files#

file = ln.select(ln.File, name="My dataframe").one()
df = file.load()
#>      a	b
#>  0	1	3
#>  1	2	4

Get the file ingested by the latest run:

run = ln.select(ln.Run).order_by(ln.Run.created_at.desc()).first()
file = ln.select(ln.File, run_id=run.id).all()

Tip

  • Each page in this guide is a Jupyter Notebook, which you can download here.

  • You can run these notebooks in hosted versions of JupyterLab, e.g., Saturn Cloud, Google Vertex AI, and others.

  • We recommend using JupyterLab for best notebook tracking experience.

📬 Reach out to learn about schema modules that connect your assays & workflows within our data platform enterprise plan.