Jupyter Notebook Binder

Project flow

LaminDB allows tracking data lineage on the entire project level.

Here, we walk through exemplified app uploads, pipelines & notebooks following Schmidt et al., 2022.

A CRISPR screen reading out a phenotypic endpoint on T cells is paired with scRNA-seq to generate insights into IFN-γ production.

These insights get linked back to the original data through the steps taken in the project to provide context for interpretation & future decision making.

More specifically: Why should I care about data flow?

Data flow tracks data sources & transformations to trace biological insights, verify experimental outcomes, meet regulatory standards, increase the robustness of research and optimize the feedback loop of team-wide learning iterations.

While tracking data flow is easier when it’s governed by deterministic pipelines, it becomes hard when it’s governed by interactive human-driven analyses.

LaminDB interfaces workflow mangers for the former and embraces the latter.

Setup

Init a test instance:

!lamin init --storage ./mydata
Hide code cell output
💡 connected lamindb: testuser1/mydata

Import lamindb:

import lamindb as ln
from IPython.display import Image, display
💡 connected lamindb: testuser1/mydata

Steps

In the following, we walk through exemplified steps covering different types of transforms (Transform).

Note

The full notebooks are in this repository.

App upload of phenotypic data

Register data through app upload from wetlab by testuser1:

# This function mimics the upload of artifacts via the UI
# In reality, you simply drag and drop files into the UI
def mock_upload_crispra_result_app():
    ln.setup.login("testuser1")
    transform = ln.Transform(name="Upload GWS CRISPRa result", type="upload")
    ln.track(transform=transform)
    output_path = ln.core.datasets.schmidt22_crispra_gws_IFNG(ln.settings.storage)
    output_file = ln.Artifact(
        output_path, description="Raw data of schmidt22 crispra GWS"
    )
    output_file.save()

mock_upload_crispra_result_app()
Hide code cell output
💡 saved: Transform(uid='UUmq5ofDaStL', name='Upload GWS CRISPRa result', type='upload', updated_at=2024-05-14 16:04:45 UTC, created_by_id=1)
💡 saved: Run(uid='Zx3yRoySOYPnQxQVTU2L', transform_id=1, created_by_id=1)

Hit identification in notebook

Access, transform & register data in drylab by testuser2 in notebook hit-identification.

Hide code cell content
# the following mimics the integrated analysis notebook
# In reality, you would execute inside the notebook
import nbproject_test
from pathlib import Path

cwd = Path.cwd()
nbproject_test.execute_notebooks(cwd / "project-flow-scripts/hit-identification.ipynb", write=True)
Executing notebooks in /home/runner/work/lamin-usecases/lamin-usecases/docs/project-flow-scripts/hit-identification.ipynb
Scheduled: ['hit-identification']
hit-identification 
✓ (4.051s)
Total time: 4.052s

Inspect data flow:

artifact = ln.Artifact.filter(description="hits from schmidt22 crispra GWS").one()
artifact.view_lineage()
_images/57b1fb4cad5f6f4264a3dba3cea6f55ebdff814b07be2754d6754038058b55a3.svg

Sequencer upload

Upload files from sequencer via script chromium_10x_upload.py:

!python project-flow-scripts/chromium_10x_upload.py
Hide code cell output
💡 connected lamindb: testuser1/mydata
💡 saved: Transform(version='1', uid='qCJPkOuZAi9q5zKv', name='chromium_10x_upload.py', key='chromium_10x_upload.py', type='script', updated_at=2024-05-14 16:04:52 UTC, created_by_id=1)
💡 saved: Run(uid='XWtHbmGjTMLe1VHPyFxC', transform_id=3, created_by_id=1)
✅ saved transform.source_code: Artifact(version='1', uid='7tRxSsEw0MCorTyx4Mg7', suffix='.py', description='Source of transform qCJPkOuZAi9q5zKv', size=474, hash='o-QoKgEZGxbk5oBtcAKoWw', hash_type='md5', visibility=0, key_is_virtual=True, updated_at=2024-05-14 16:04:52 UTC, storage_id=1, created_by_id=1)
✅ saved run.environment: Artifact(uid='wjVsGKIA6ChMl7t3I833', suffix='.txt', description='requirements.txt', size=3346, hash='eaALIzW0H4dHF-aRYZQ-_Q', hash_type='md5', visibility=0, key_is_virtual=True, updated_at=2024-05-14 16:04:52 UTC, storage_id=1, created_by_id=1)

scRNA-seq bioinformatics pipeline

Process uploaded files using a script or workflow manager: Pipelines and obtain 3 output files in a directory filtered_feature_bc_matrix/:

cellranger.py

!python project-flow-scripts/cellranger.py
Hide code cell output
💡 connected lamindb: testuser1/mydata
💡 saved: Transform(version='7.2.0', uid='XtQCgMKY7lOv', name='Cell Ranger', type='pipeline', reference='https://www.10xgenomics.com/support/software/cell-ranger/7.2', updated_at=2024-05-14 16:04:54 UTC, created_by_id=2)
💡 saved: Run(uid='zTwhl8iMkgCZ57jd5H0I', transform_id=4, created_by_id=2)
❗ this creates one artifact per file in the directory - you might simply call ln.Artifact(dir) to get one artifact for the entire directory

postprocess_cellranger.py

!python project-flow-scripts/postprocess_cellranger.py
Hide code cell output
💡 connected lamindb: testuser1/mydata
💡 saved: Transform(version='2', uid='YqmbO6oMXjRj65cN', name='postprocess_cellranger.py', key='postprocess_cellranger.py', type='script', updated_at=2024-05-14 16:04:57 UTC, created_by_id=2)
💡 saved: Run(uid='HM0tFS1Q5m7RtpBwTgB2', transform_id=5, created_by_id=2)

Inspect data flow:

output_file = ln.Artifact.filter(description="perturbseq counts").one()
output_file.view_lineage()
_images/af07d37b0ef1ef2d9d145ba45fd2c82bd72a4ecbbbc365fb27adebf75d857759.svg

Integrate scRNA-seq & phenotypic data

Integrate data in notebook integrated-analysis.

Hide code cell content
# the following mimics the integrated analysis notebook
# In reality, you would execute inside the notebook
nbproject_test.execute_notebooks(cwd / "project-flow-scripts/integrated-analysis.ipynb", write=True)
Executing notebooks in /home/runner/work/lamin-usecases/lamin-usecases/docs/project-flow-scripts/integrated-analysis.ipynb
Scheduled: ['integrated-analysis']
integrated-analysis 
✓ (4.414s)
Total time: 4.415s

Review results

Let’s load one of the plots:

# track the current notebook as transform
ln.settings.transform.stem_uid = "1LCd8kco9lZU"
ln.settings.transform.version = "0"
ln.track()
💡 notebook imports: ipython==8.24.0 lamindb==0.71.3 nbproject_test==0.5.1
💡 saved: Transform(version='0', uid='1LCd8kco9lZU6K79', name='Project flow', key='project-flow', type='notebook', updated_at=2024-05-14 16:05:03 UTC, created_by_id=1)
💡 saved: Run(uid='3BJ7pwsj8QHpldexHVLX', transform_id=7, created_by_id=1)
artifact = ln.Artifact.filter(key__contains="figures/matrixplot").one()
artifact.cache()
Hide code cell output
PosixUPath('/home/runner/work/lamin-usecases/lamin-usecases/docs/mydata/.lamindb/EmeRjD4yFFmcVU0iYtl3.png')
display(Image(filename=artifact.path))

We see that the image artifact is tracked as an input of the current notebook. The input is highlighted, the notebook follows at the bottom:

artifact.view_lineage()
_images/28fed1bd6af961573b1f62c4af30b440b58c794dc88163a93cd61b0243671d49.svg

Alternatively, we can also look at the sequence of transforms:

transform = ln.Transform.search("Project flow").first()
transform.parents.df()
version uid name key description type latest_report_id source_code_id reference reference_type created_at updated_at created_by_id
id
6 1 lB3IyPLQSmvt5zKv Perform single cell analysis, integrate with C... integrated-analysis None notebook None None None None 2024-05-14 16:05:01.180977+00:00 2024-05-14 16:05:01.181008+00:00 2
transform.view_parents()
_images/a65b4d67068a3da6c83ebaf6afd9ae93eef57e68dc8524befa6b82dacbfdd632.svg

Understand runs

We tracked pipeline and notebook runs through run_context, which stores a Transform and a Run record as a global context.

Artifact objects are the inputs and outputs of runs.

What if I don’t want a global context?

Sometimes, we don’t want to create a global run context but manually pass a run when creating an artifact:

run = ln.Run(transform=transform)
ln.Artifact(filepath, run=run)
When does an artifact appear as a run input?

When accessing an artifact via cache(), load() or backed(), two things happen:

  1. The current run gets added to artifact.input_of

  2. The transform of that artifact gets added as a parent of the current transform

You can then switch off auto-tracking of run inputs if you set ln.settings.track_run_inputs = False: Can I disable tracking run inputs?

You can also track run inputs on a case by case basis via is_run_input=True, e.g., here:

artifact.load(is_run_input=True)

Query by provenance

We can query or search for the notebook that created the artifact:

transform = ln.Transform.search("GWS CRIPSRa analysis").first()

And then find all the artifacts created by that notebook:

ln.Artifact.filter(transform=transform).df()
version uid storage_id key suffix accessor description size hash hash_type n_objects n_observations transform_id run_id visibility key_is_virtual created_at updated_at created_by_id
id
2 None k6nZ3FAddp3EJpapGusd 1 None .parquet DataFrame hits from schmidt22 crispra GWS 18368 5GtKh_v__shMvLryIdXkKA md5 None None 2 2 1 True 2024-05-14 16:04:50.009842+00:00 2024-05-14 16:04:50.009866+00:00 2

Which transform ingested a given artifact?

artifact = ln.Artifact.filter().first()
artifact.transform
Transform(uid='UUmq5ofDaStL', name='Upload GWS CRISPRa result', type='upload', updated_at=2024-05-14 16:04:45 UTC, created_by_id=1)

And which user?

artifact.created_by
User(uid='DzTjkKse', handle='testuser1', name='Test User1', updated_at=2024-05-14 16:04:52 UTC)

Which transforms were created by a given user?

users = ln.User.lookup()
ln.Transform.filter(created_by=users.testuser1).df()
version uid name key description type latest_report_id source_code_id reference reference_type created_at updated_at created_by_id
id
1 None UUmq5ofDaStL Upload GWS CRISPRa result None None upload None NaN None None 2024-05-14 16:04:45.690005+00:00 2024-05-14 16:04:45.690024+00:00 1
3 1 qCJPkOuZAi9q5zKv chromium_10x_upload.py chromium_10x_upload.py None script None 5.0 None None 2024-05-14 16:04:52.269109+00:00 2024-05-14 16:04:52.749180+00:00 1
7 0 1LCd8kco9lZU6K79 Project flow project-flow None notebook None NaN None None 2024-05-14 16:05:03.100139+00:00 2024-05-14 16:05:03.100184+00:00 1

Which notebooks were created by a given user?

ln.Transform.filter(created_by=users.testuser1, type="notebook").df()
version uid name key description type latest_report_id source_code_id reference reference_type created_at updated_at created_by_id
id
7 0 1LCd8kco9lZU6K79 Project flow project-flow None notebook None None None None 2024-05-14 16:05:03.100139+00:00 2024-05-14 16:05:03.100184+00:00 1

We can also view all recent additions to the entire database:

ln.view()
Hide code cell output
Artifact
version uid storage_id key suffix accessor description size hash hash_type n_objects n_observations transform_id run_id visibility key_is_virtual created_at updated_at created_by_id
id
12 None EmeRjD4yFFmcVU0iYtl3 1 figures/matrixplot_fig2_score-wgs-hits-per-clu... .png None None 28814 8zXF_cVwaZnfhmrLbt_0kA md5 None None 6 6 1 True 2024-05-14 16:05:02.274649+00:00 2024-05-14 16:05:02.274675+00:00 2
11 None rVNA7p8ddbxYRsNoryez 1 figures/umap_fig1_score-wgs-hits.png .png None None 118999 DCFDLUMF-UohaBvkThn0mA md5 None None 6 6 1 True 2024-05-14 16:05:02.033713+00:00 2024-05-14 16:05:02.033737+00:00 2
10 None EIt5h35ckPVVywKCs755 1 schmidt22_perturbseq.h5ad .h5ad AnnData perturbseq counts 20659936 la7EvqEUMDlug9-rpw-udA md5 None None 5 5 1 False 2024-05-14 16:04:58.100566+00:00 2024-05-14 16:04:58.100596+00:00 2
9 None tNwiOmMWfsBmFXsmlX8Z 1 perturbseq/filtered_feature_bc_matrix/matrix.m... .mtx.gz None None 6 NlipxKKAKE9efjoQV8dI7g md5 None None 4 4 1 False 2024-05-14 16:04:55.401155+00:00 2024-05-14 16:04:55.401173+00:00 2
8 None FNroa8g3o4UdoxHB9XFc 1 perturbseq/filtered_feature_bc_matrix/features... .tsv.gz None None 6 goPn3yeeHzLGz2auTH2U0A md5 None None 4 4 1 False 2024-05-14 16:04:55.400532+00:00 2024-05-14 16:04:55.400550+00:00 2
7 None mDHl6aNHiNlm0Iiwz6Yg 1 perturbseq/filtered_feature_bc_matrix/barcodes... .tsv.gz None None 6 EFoUnWqbXWb3yieENX8nEg md5 None None 4 4 1 False 2024-05-14 16:04:55.399754+00:00 2024-05-14 16:04:55.399774+00:00 2
4 None T8OxnvY81zQ68pT57pV2 1 fastq/perturbseq_R2_001.fastq.gz .fastq.gz None None 6 LD60tRjeniPxMt6u5OwoTA md5 None None 3 3 1 False 2024-05-14 16:04:52.723657+00:00 2024-05-14 16:04:52.723675+00:00 1
Run
uid transform_id started_at finished_at created_by_id json report_id environment_id is_consecutive reference reference_type created_at
id
1 Zx3yRoySOYPnQxQVTU2L 1 2024-05-14 16:04:45.693598+00:00 NaT 1 None None NaN True None None 2024-05-14 16:04:45.693710+00:00
2 OcPf379yrqoIcUBFG0Ue 2 2024-05-14 16:04:49.510892+00:00 NaT 2 None None NaN True None None 2024-05-14 16:04:49.510988+00:00
3 XWtHbmGjTMLe1VHPyFxC 3 2024-05-14 16:04:52.271882+00:00 2024-05-14 16:04:52.747370+00:00 1 None None 6.0 True None None 2024-05-14 16:04:52.272010+00:00
4 zTwhl8iMkgCZ57jd5H0I 4 2024-05-14 16:04:54.942049+00:00 NaT 2 None None NaN None None None 2024-05-14 16:04:54.942138+00:00
5 HM0tFS1Q5m7RtpBwTgB2 5 2024-05-14 16:04:57.013914+00:00 NaT 2 None None NaN None None None 2024-05-14 16:04:57.014036+00:00
6 PHzMM86neRAz0VAhXwjt 6 2024-05-14 16:05:01.186825+00:00 NaT 2 None None NaN True None None 2024-05-14 16:05:01.186926+00:00
7 3BJ7pwsj8QHpldexHVLX 7 2024-05-14 16:05:03.105829+00:00 NaT 1 None None NaN True None None 2024-05-14 16:05:03.105930+00:00
Storage
uid root description type region instance_uid created_at updated_at created_by_id
id
1 o2SlmBlntqrM /home/runner/work/lamin-usecases/lamin-usecase... None local None 54ZGqgkROOFf 2024-05-14 16:04:43.972938+00:00 2024-05-14 16:04:43.972957+00:00 1
Transform
version uid name key description type latest_report_id source_code_id reference reference_type created_at updated_at created_by_id
id
7 0 1LCd8kco9lZU6K79 Project flow project-flow None notebook None NaN None None 2024-05-14 16:05:03.100139+00:00 2024-05-14 16:05:03.100184+00:00 1
6 1 lB3IyPLQSmvt5zKv Perform single cell analysis, integrate with C... integrated-analysis None notebook None NaN None None 2024-05-14 16:05:01.180977+00:00 2024-05-14 16:05:01.181008+00:00 2
5 2 YqmbO6oMXjRj65cN postprocess_cellranger.py postprocess_cellranger.py None script None NaN None None 2024-05-14 16:04:57.011234+00:00 2024-05-14 16:04:57.011264+00:00 2
4 7.2.0 XtQCgMKY7lOv Cell Ranger None None pipeline None NaN https://www.10xgenomics.com/support/software/c... None 2024-05-14 16:04:54.939843+00:00 2024-05-14 16:04:54.939863+00:00 2
3 1 qCJPkOuZAi9q5zKv chromium_10x_upload.py chromium_10x_upload.py None script None 5.0 None None 2024-05-14 16:04:52.269109+00:00 2024-05-14 16:04:52.749180+00:00 1
2 1 T0T28btuB0PG5zKv GWS CRIPSRa analysis hit-identification None notebook None NaN None None 2024-05-14 16:04:49.505651+00:00 2024-05-14 16:04:49.505670+00:00 2
1 None UUmq5ofDaStL Upload GWS CRISPRa result None None upload None NaN None None 2024-05-14 16:04:45.690005+00:00 2024-05-14 16:04:45.690024+00:00 1
User
uid handle name created_at updated_at
id
2 bKeW4T6E testuser2 Test User2 2024-05-14 16:04:49.503054+00:00 2024-05-14 16:04:54.934189+00:00
1 DzTjkKse testuser1 Test User1 2024-05-14 16:04:43.968751+00:00 2024-05-14 16:04:52.134735+00:00
Hide code cell content
!lamin login testuser1
!lamin delete --force mydata
!rm -r ./mydata
✅ logged in with email [email protected] (uid: DzTjkKse)
Traceback (most recent call last):
  File "/opt/hostedtoolcache/Python/3.10.14/x64/bin/lamin", line 8, in <module>
    sys.exit(main())
  File "/opt/hostedtoolcache/Python/3.10.14/x64/lib/python3.10/site-packages/rich_click/rich_command.py", line 367, in __call__
    return super().__call__(*args, **kwargs)
  File "/opt/hostedtoolcache/Python/3.10.14/x64/lib/python3.10/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
  File "/opt/hostedtoolcache/Python/3.10.14/x64/lib/python3.10/site-packages/rich_click/rich_command.py", line 152, in main
    rv = self.invoke(ctx)
  File "/opt/hostedtoolcache/Python/3.10.14/x64/lib/python3.10/site-packages/click/core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/opt/hostedtoolcache/Python/3.10.14/x64/lib/python3.10/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/opt/hostedtoolcache/Python/3.10.14/x64/lib/python3.10/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
  File "/opt/hostedtoolcache/Python/3.10.14/x64/lib/python3.10/site-packages/lamin_cli/__main__.py", line 103, in delete
    return delete(instance, force=force)
  File "/opt/hostedtoolcache/Python/3.10.14/x64/lib/python3.10/site-packages/lamindb_setup/_delete.py", line 97, in delete
    n_objects = check_storage_is_empty(
  File "/opt/hostedtoolcache/Python/3.10.14/x64/lib/python3.10/site-packages/lamindb_setup/core/upath.py", line 824, in check_storage_is_empty
    raise InstanceNotEmpty(message)
lamindb_setup.core.upath.InstanceNotEmpty: Storage /home/runner/work/lamin-usecases/lamin-usecases/docs/mydata/.lamindb contains 5 objects ('_is_initialized' ignored) - delete them prior to deleting the instance
['/home/runner/work/lamin-usecases/lamin-usecases/docs/mydata/.lamindb/7tRxSsEw0MCorTyx4Mg7.py', '/home/runner/work/lamin-usecases/lamin-usecases/docs/mydata/.lamindb/EmeRjD4yFFmcVU0iYtl3.png', '/home/runner/work/lamin-usecases/lamin-usecases/docs/mydata/.lamindb/_is_initialized', '/home/runner/work/lamin-usecases/lamin-usecases/docs/mydata/.lamindb/k6nZ3FAddp3EJpapGusd.parquet', '/home/runner/work/lamin-usecases/lamin-usecases/docs/mydata/.lamindb/rVNA7p8ddbxYRsNoryez.png', '/home/runner/work/lamin-usecases/lamin-usecases/docs/mydata/.lamindb/wjVsGKIA6ChMl7t3I833.txt']