⸻ 2022-07-31

Key problems of data-heavy R&D¶

Sunny Sun, Alex Wolf

The complexity of modern R&D data often blocks realizing the scientific progress it promises.

Here, we list key problems we see and how we think about solving them.

Data can’t be accessed¶

Problem	Description	Solution
Object storage.	Data in object storage can’t be queried.	Index observations and variables and link them in a query database.
Pile of data.	Data can’t be accessed as it’s not structured and siloed in fragmented infrastructure.	Structure data both by biological entities and by provenance with one interface across storage and database backends.

Data can’t be accessed at scale¶

Problem	Description	Solution
Anecdotal data.	Data can’t be accessed at scale as no viable programmatic interfaces exist.	API-first platform.
Cross-storage integration.	Molecular (high-dimensional) data can’t be efficiently integrated with phenotypic (low-dimensional) data.	Index molecular data with the same biological entities as phenotypic data. Provide connectors for low-dimensional data management systems (ELN & LIMS systems).

Scientific results aren’t solid¶

Problem	Description	Solution
Stand on solid ground.	Key analytics results cannot be linked to supporting data as too many processing steps are involved.	Provide full data provenance.

Collaboration across organizations is hard¶

Problem	Description	Solution
Siloed infrastructure.	Data can’t be easily shared across organizations.	Federated collaboration hub on distributed infrastructure.
Siloed semantics.	External data can’t be mapped on in-house data and vice versa.	Provide curation and ingestion API, operate on open-source data models that can be adopted by any organization.

R&D could be more effective¶

Problem	Description	Solution
Optimal decision making.	There is no framework for tracking decision making in complex R&D teams.	Graph of data flow in R&D team, including scientists, computation, decisions, predictions. Unlike workflow frameworks, LaminDB creates an emergent graph.
Dry lab is not integrated.	Data platforms offer no adequate interface for the drylab.	API-first with data scientist needs in mind.
Support learning.	There is no support for the learning-from-data cycle.	Support data models across the full lab cycle, including measured → relevant → derived features. Manage knowledge through rich semantic models that map high-dimensional data.

No support for basic R&D operations¶

Problem	Description	Solution
Development data.	Data associated with assay development can’t be ingested as data models are too rigid.	Allow partial integrity in LaminDB’s implementation of a data lakehouse: ingest data of any curation level and label them with corresponding QC flags.
Corrupted data.	Data is often corrupted.	Full provenance allows to trace back corruption to its origin and write a simple fix, typically, in form of an ingestion constraint.

Building a data platform is hard¶

Problem	Description	Solution
Aligning data models.	Data models are hard to align across interdisciplinary stakeholders.	Lamin’s data model templates cover 90% of cases, the remaining 10% can be get configured.
Lock-in.	Existing platforms lock organizations into specific cloud infrastructure.	Open-source and multi-cloud stack with zero lock-in danger.
Migrations are a pain.	Migrating data models in a fast-paced R&D environment can be prohibitive.	LaminDB’s schema modules migrate automatically.

Note: This problem statement was originally published as part of the lamindb docs. It remained prominently linked from the about page of lamin.ai while traveling through various repositories with small edits: within lamindb, within lamin-about, within lamin-docs. It got moved to the blog page on 2023-08-11 and will remain there unmodified.

Previous: Hello world! Next: readfcs: Read FCS files