Track redun workflows#

Note

This use case starts out with Rico Meinl’s GitHub repository (and blog post).

Tip

Source notebooks are in the redun-lamin-fasta repository.

While redun focuses on managing worfklows for data pipelines, LaminDB offers a provenance-aware data lake.

redun schedules, executes, and tracks pipelines runs with a great level of control and metadata.

LaminDB’s data lake complements redun with

  1. data lineage across computational pipelines, interactive analyses (notebooks), and UI-submitted data

  2. curating, querying & structuring data by biological entities

  3. extensible & modular Python ORM for queries & data access

Track the workflow as a pipeline#

!lamin login testuser1@lamin.ai --password cEvcwMJFX4OwbsYVaMt2Os6GxxGgDUlBGILs2RyS
ℹ️ Your handle is testuser1 and your id is DzTjkKse.
!lamin init --storage ./fasta
ℹ️ Loading schema modules: core==0.29.1 
ℹ️ Created instance testuser1/fasta
import lamindb as ln
import lamindb.schema as lns
from pathlib import Path
import redun_lamin_fasta
ln.nb.header()
authorTest User1 (testuser1)
id0ymQDuqM5Lwq
version0
time_init2022-11-13 21:29
time_run2023-03-09 17:02
pypackage_storelamindb==0.10.0
pypackage_livelamindb==0.30.3 redun_lamin_fasta==0.1.0
ℹ️ Instance: testuser1/fasta
ℹ️ Added notebook: 0ymQDuqM5Lwq v0
ℹ️ Added run: NsKuQJ3EYpITaGUS03jt

Create a pipeline record:

pipeline = lns.Pipeline(
    name="lamin-redun-fasta",
    v=redun_lamin_fasta.__version__,
    reference="https://github.com/laminlabs/redun-lamin-fasta",
)

Add the record to the db.

ln.add(pipeline)
Pipeline(id='R8QwchFP', v='0.1.0', name='lamin-redun-fasta', reference='https://github.com/laminlabs/redun-lamin-fasta', created_by='DzTjkKse', created_at=datetime.datetime(2023, 3, 9, 17, 2, 38))

Register the input files#

Let’s first register input files for processing with the redun pipeline.

!ls ./fasta
KLF4.fasta  MYC.fasta  PO5F1.fasta  SOX2.fasta	fasta.lndb
for filepath in Path("./fasta/").glob("*.fasta"):
    dobject = ln.DObject(filepath)
    ln.add(dobject)