• ⸻ 2022-07-31

Key problems of data-heavy R&D

The complexity of modern R&D data often blocks realizing the scientific progress it promises.

Here, we list key problems we see and how we think about solving them.

Data can’t be accessed

Problem

Description

Solution

Object storage.

Data in object storage can’t be queried.

Index observations and variables and link them in a query database.

Pile of data.

Data can’t be accessed as it’s not structured and siloed in fragmented infrastructure.

Structure data both by biological entities and by provenance with one interface across storage and database backends.

Data can’t be accessed at scale

Problem

Description

Solution

Anecdotal data.

Data can’t be accessed at scale as no viable programmatic interfaces exist.

API-first platform.

Cross-storage integration.

Molecular (high-dimensional) data can’t be efficiently integrated with phenotypic (low-dimensional) data.

Index molecular data with the same biological entities as phenotypic data. Provide connectors for low-dimensional data management systems (ELN & LIMS systems).

Scientific results aren’t solid

Problem

Description

Solution

Stand on solid ground.

Key analytics results cannot be linked to supporting data as too many processing steps are involved.

Provide full data provenance.

Collaboration across organizations is hard

Problem

Description

Solution

Siloed infrastructure.

Data can’t be easily shared across organizations.

Federated collaboration hub on distributed infrastructure.

Siloed semantics.

External data can’t be mapped on in-house data and vice versa.

Provide curation and ingestion API, operate on open-source data models that can be adopted by any organization.

R&D could be more effective

Problem

Description

Solution

Optimal decision making.

There is no framework for tracking decision making in complex R&D teams.

Graph of data flow in R&D team, including scientists, computation, decisions, predictions. Unlike workflow frameworks, LaminDB creates an emergent graph.

Dry lab is not integrated.

Data platforms offer no adequate interface for the drylab.

API-first with data scientist needs in mind.

Support learning.

There is no support for the learning-from-data cycle.

Support data models across the full lab cycle, including measured → relevant → derived features. Manage knowledge through rich semantic models that map high-dimensional data.

No support for basic R&D operations

Problem

Description

Solution

Development data.

Data associated with assay development can’t be ingested as data models are too rigid.

Allow partial integrity in LaminDB’s implementation of a data lakehouse: ingest data of any curation level and label them with corresponding QC flags.

Corrupted data.

Data is often corrupted.

Full provenance allows to trace back corruption to its origin and write a simple fix, typically, in form of an ingestion constraint.

Building a data platform is hard

Problem

Description

Solution

Aligning data models.

Data models are hard to align across interdisciplinary stakeholders.

Lamin’s data model templates cover 90% of cases, the remaining 10% can be get configured.

Lock-in.

Existing platforms lock organizations into specific cloud infrastructure.

Open-source and multi-cloud stack with zero lock-in danger.

Migrations are a pain.

Migrating data models in a fast-paced R&D environment can be prohibitive.

LaminDB’s schema modules migrate automatically.

Note: This problem statement was originally published as part of the lamindb docs. It remained prominently linked from the about page of lamin.ai while traveling through various repositories with small edits: within lamindb, within lamin-about, within lamin-docs. It got moved to the blog page on 2023-08-11 and will remain there unmodified.