Guide#
Welcome to the LaminDB guide! đź‘‹
Curate, store, track, query, integrate, and learn from biological data.
LaminDB is an open-source data lake for R&D in biology.
It gives you components to build on data lineage & biological entities with an ORM for your existing infrastructure: object storage (local directories, S3, GCP) with a mapped SQL query engine (SQLite, Postgres, and soon, BigQuery).
You can readily create distributed LaminDB instances at any scale:
Get started on your laptop, deploy in the cloud, or work with a mesh of instances for different teams and purposes.
Share them through a hub akin to HuggingFace & GitHub - see, e.g, lamin.ai/sunnyosun.
Warning
Public beta: Currently only recommended for collaborators as we still make breaking changes.
Installation#
LaminDB is a Python package (Python 3.8+).
pip install lamindb
Install biological entities like so:
pip install 'lamindb[bionty]'
Import#
In your python script, import LaminDB as:
import lamindb as ln
Setup#
Quick setup on the command line:
Sign up via
lamin signup <email>
Log in via
lamin login <handle>
Set up an instance via
lamin init --storage <storage> --schema <schema_modules>
Example
lamin signup testuser1@lamin.ai
lamin login testuser1
lamin init --storage ./mydata --schema bionty,lamin1
Quickstart#
Track files & metadata with sources#
transform = ln.Transform(name="My pipeline/notebook")
ln.track(transform)
df = pd.DataFrame({"a": [1, 2], "b": [3, 4]})
file = ln.File(df, name="My dataframe")
ln.add(file)
Under-the-hood, this created 3 linked records:
Transform(id='OdlFhFWW7qg3', version='0', name='My pipeline/notebook', type=notebook, created_by_id='DzTjkKse', created_at=datetime.datetime(2023, 4, 28, 6, 7, 30))
Run(id='g1xwllJfFZuh24AWKySc', transform_id='OdlFhFWW7qg3', transform_version='0', created_by_id='DzTjkKse', created_at=datetime.datetime(2023, 4, 28, 6, 7, 30))
File(id='DY9JINrVH6sMtqEirMpM', name='iris', suffix='.parquet', size=5629, hash='jUTdERuqlGv_GyqFfIEb2Q', run_id='g1xwllJfFZuh24AWKySc', transform_id='OdlFhFWW7qg3', transform_version='0', storage_id='GLWpJhvg', created_at=datetime.datetime(2023, 4, 28, 6, 7, 32), created_by_id='DzTjkKse')
Query & load files#
file = ln.select(ln.File, name="My dataframe").one()
df = file.load()
#> a b
#> 0 1 3
#> 1 2 4
Get the file ingested by the latest run:
run = ln.select(ln.Run).order_by(ln.Run.created_at.desc()).first()
file = ln.select(ln.File, run_id=run.id).all()
Tip
Each page in this guide is a Jupyter Notebook, which you can download here.
You can run these notebooks in hosted versions of JupyterLab, e.g., Saturn Cloud, Google Vertex AI, and others.
We recommend using JupyterLab for best notebook tracking experience.
📬 Reach out to learn about schema modules that connect your assays & workflows within our data platform enterprise plan.