lamindata

RxRx: cell imaging#

rxrx.ai hosts high-throughput cell imaging datasets generated by Recursion.

High numbers of fluorescent microscopy images characterize cellular phenotypes in vitro based on morphology and protein expression (5-10 stains) across a range of conditions.

  • In this guide, you’ll see how to query some of these data using LaminDB: laminlabs/rxrx.

  • If you’d like to transfer data into your own LaminDB instance, see the transfer guide.

  • If you’d like to understand how the laminlabs/rxrx instance was curated, see this repository.

Setup#

!lamin load laminlabs/rxrx
πŸ’‘ loaded instance: laminlabs/rxrx

import lamindb as ln
import lnschema_bionty as lb
import lnschema_lamin1 as ln1
from lamindb import UPath
πŸ’‘ lamindb instance: laminlabs/rxrx

Search & look up metadata#

We’ll find all treatments in the Treatment registry:

df = ln1.Treatment.filter().df()
df.shape
(1139, 12)

Let us create a look up object for siRNAs so that we can easily auto-complete queries involving it:

sirnas = ln1.Treatment.filter(system="siRNA").lookup(return_field="name")

We’re also interested in features, cell lines & wells:

features = ln.Feature.lookup(return_field="name")
cell_lines = lb.CellLine.lookup(return_field="abbr")
wells = ln1.Well.lookup(return_field="name")

Load the dataset#

In this instance, there is only a single dataset:

ln.Dataset.filter().df()
uid name description version hash reference reference_type transform_id run_id file_id storage_id initial_version_id visibility updated_at created_by_id
id
12 flLeukogmLRzleFCpCRD RxRx1 images None 1 4on4AbbmBL0sr0xe9_gxxQ None None 4 3 6 2 None 1 2023-11-15 08:47:53.842900+00:00 1

This is RxRx1: 125k images for 1138 siRNA perturbation across 4 cell lines reading out 5 stains, image dimension is 512x512x6.

Let us get the corresponding object and some information about it:

dataset = ln.Dataset.filter(uid="flLeukogmLRzleFCpCRD").one()
dataset.view_flow()
dataset.describe()
Hide code cell output
_images/daafd03fbf1d8af60449f5f7c6fbb56fc5d5f4eb18cb84a51ff106804fa3a5a6.svg
Dataset(uid='flLeukogmLRzleFCpCRD', name='RxRx1 images', version='1', hash='4on4AbbmBL0sr0xe9_gxxQ', visibility=1, updated_at=2023-11-15 08:47:53 UTC)

Provenance:
  πŸ“” transform: Transform(uid='Zo0qJt4IQPsbxM', name='Register the RxRx1 dataset', short_name='02-rxrx1', version='1', type='notebook', updated_at=2023-11-15 06:16:01 UTC, latest_report_id=11, source_file_id=10, initial_version_id=3, created_by_id=1)
  πŸ‘£ run: Run(uid='1wQSEWx8oK23GxLlAIHj', run_at=2023-11-15 05:48:27 UTC, is_consecutive=True, transform_id=4, created_by_id=1, report_id=11)
  πŸ“„ file: File(uid='flLeukogmLRzleFCpCRD', suffix='.parquet', accessor='DataFrame', description='Metadata with file paths for each RxRx1 image.', size=7806003, hash='4on4AbbmBL0sr0xe9_gxxQ', hash_type='md5', visibility=1, key_is_virtual=True, updated_at=2023-11-15 08:47:53 UTC, storage_id=1, transform_id=3, run_id=2, created_by_id=1)
  πŸ—ƒοΈ storage: Storage(uid='aLcXXffe', root='gs://rxrx1-europe-west4/images', type='gs', updated_at=2023-11-15 05:55:24 UTC, created_by_id=1)
  πŸ‘€ created_by: User(uid='FBa7SHjn', handle='falexwolf', name='Alex Wolf', updated_at=2023-11-05 14:57:57 UTC)
Features:
  columns: FeatureSet(uid='y0uhIW520iTEzrxI14mL', n=8, registry='core.Feature', hash='mnhzsJj-j7VZgNJ88VM0', updated_at=2023-11-05 17:57:58 UTC, created_by_id=1)
    πŸ”— cell_line (4, bionty.CellLine): 'HUV-EC-C cell', 'U-2 OS cell', 'hTERT RPE-1 cell', 'Hep G2 cell'
    πŸ”— split (2, core.ULabel): 'test', 'train'
    πŸ”— experiment (51, lamin1.Experiment): 'HEPG2-08', 'HEPG2-09', 'HEPG2-10', 'HEPG2-11', 'HUVEC-17', 'HUVEC-18', 'HUVEC-19', 'HUVEC-20', 'HUVEC-21', 'HUVEC-22', ...
    plate (number)
    πŸ”— well (308, lamin1.Well): 'B02', 'B03', 'B04', 'B05', 'B06', 'B07', 'B08', 'B09', 'B10', 'B11', ...
    πŸ”— well_type (3, core.ULabel): 'negative_control', 'treatment', 'positive_control'
    πŸ”— sirna (1139, lamin1.Treatment): 'EMPTY', 's21721', 's20894', 's19827', 's19792', 's19935', 's21398', 's223097', 's348', 's19975', ...
    path (object)
  external: FeatureSet(uid='JWcRhQneUEwjkoYnEf59', n=1, registry='core.Feature', hash='AF42DRsoUacKb5l4WgT-', updated_at=2023-11-15 05:55:14 UTC, created_by_id=1)
    πŸ”— readout (1, bionty.ExperimentalFactor): 'high content screen'
Labels:
  🏷️ cell_lines (4, bionty.CellLine): 'HUV-EC-C cell', 'U-2 OS cell', 'hTERT RPE-1 cell', 'Hep G2 cell'
  🏷️ experimental_factors (1, bionty.ExperimentalFactor): 'high content screen'
  🏷️ experiments (51, lamin1.Experiment): 'HEPG2-08', 'HEPG2-09', 'HEPG2-10', 'HEPG2-11', 'HUVEC-17', 'HUVEC-18', 'HUVEC-19', 'HUVEC-20', 'HUVEC-21', 'HUVEC-22', ...
  🏷️ wells (308, lamin1.Well): 'B02', 'B03', 'B04', 'B05', 'B06', 'B07', 'B08', 'B09', 'B10', 'B11', ...
  🏷️ treatments (1139, lamin1.Treatment): 'EMPTY', 's21721', 's20894', 's19827', 's19792', 's19935', 's21398', 's223097', 's348', 's19975', ...
  🏷️ ulabels (9, core.ULabel): 'test', 'train', 'Plate1', 'Plate2', 'Plate3', 'Plate4', 'negative_control', 'treatment', 'positive_control'

The dataset consists in a metadata file and a folder path pointing to the image files:

dataset.file.load().head()
site_id well_id cell_line split experiment plate well site well_type sirna sirna_id paths path
0 HEPG2-08_1_B02_1 HEPG2-08_1_B02 HEPG2 test HEPG2-08 1 B02 1 negative_control EMPTY 1138 images/test/HEPG2-08/Plate1/B02_s1_w1-w6.png images/test/HEPG2-08/Plate1/B02_s1_w1.png
1 HEPG2-08_1_B02_1 HEPG2-08_1_B02 HEPG2 test HEPG2-08 1 B02 1 negative_control EMPTY 1138 images/test/HEPG2-08/Plate1/B02_s1_w1-w6.png images/test/HEPG2-08/Plate1/B02_s1_w2.png
2 HEPG2-08_1_B02_1 HEPG2-08_1_B02 HEPG2 test HEPG2-08 1 B02 1 negative_control EMPTY 1138 images/test/HEPG2-08/Plate1/B02_s1_w1-w6.png images/test/HEPG2-08/Plate1/B02_s1_w3.png
3 HEPG2-08_1_B02_1 HEPG2-08_1_B02 HEPG2 test HEPG2-08 1 B02 1 negative_control EMPTY 1138 images/test/HEPG2-08/Plate1/B02_s1_w1-w6.png images/test/HEPG2-08/Plate1/B02_s1_w4.png
4 HEPG2-08_1_B02_1 HEPG2-08_1_B02 HEPG2 test HEPG2-08 1 B02 1 negative_control EMPTY 1138 images/test/HEPG2-08/Plate1/B02_s1_w1-w6.png images/test/HEPG2-08/Plate1/B02_s1_w5.png
dataset.path
GCSPath('gs://rxrx1-europe-west4/images')

We can get an idea of the folder structure like so:

dataset.path.view_tree(level=2)
Hide code cell output
images (53 sub-directories & 0 files): 
β”œβ”€β”€ test
β”‚   β”œβ”€β”€ HEPG2-08
β”‚   β”œβ”€β”€ HEPG2-09
β”‚   β”œβ”€β”€ HEPG2-10
β”‚   β”œβ”€β”€ HEPG2-11
β”‚   β”œβ”€β”€ HUVEC-17
β”‚   β”œβ”€β”€ HUVEC-18
β”‚   β”œβ”€β”€ HUVEC-19
β”‚   β”œβ”€β”€ HUVEC-20
β”‚   β”œβ”€β”€ HUVEC-21
β”‚   β”œβ”€β”€ HUVEC-22
β”‚   β”œβ”€β”€ HUVEC-23
β”‚   β”œβ”€β”€ HUVEC-24
β”‚   β”œβ”€β”€ RPE-08
β”‚   β”œβ”€β”€ RPE-09
β”‚   β”œβ”€β”€ RPE-10
β”‚   β”œβ”€β”€ RPE-11
β”‚   β”œβ”€β”€ U2OS-04
β”‚   └── U2OS-05
└── train
    β”œβ”€β”€ HEPG2-01
    β”œβ”€β”€ HEPG2-02
    β”œβ”€β”€ HEPG2-03
    β”œβ”€β”€ HEPG2-04
    β”œβ”€β”€ HEPG2-05
    β”œβ”€β”€ HEPG2-06
    β”œβ”€β”€ HEPG2-07
    β”œβ”€β”€ HUVEC-01
    β”œβ”€β”€ HUVEC-02
    β”œβ”€β”€ HUVEC-03
    β”œβ”€β”€ HUVEC-04
    β”œβ”€β”€ HUVEC-05
    β”œβ”€β”€ HUVEC-06
    β”œβ”€β”€ HUVEC-07
    β”œβ”€β”€ HUVEC-08
    β”œβ”€β”€ HUVEC-09
    β”œβ”€β”€ HUVEC-10
    β”œβ”€β”€ HUVEC-11
    β”œβ”€β”€ HUVEC-12
    β”œβ”€β”€ HUVEC-13
    β”œβ”€β”€ HUVEC-14
    β”œβ”€β”€ HUVEC-15
    β”œβ”€β”€ HUVEC-16
    β”œβ”€β”€ RPE-01
    β”œβ”€β”€ RPE-02
    β”œβ”€β”€ RPE-03
    β”œβ”€β”€ RPE-04
    β”œβ”€β”€ RPE-05
    β”œβ”€β”€ RPE-06
    β”œβ”€β”€ RPE-07
    β”œβ”€β”€ U2OS-01
    β”œβ”€β”€ U2OS-02
    └── U2OS-03

Get an idea of all image files like so:

# dataset.path.view_tree()

Query image files#

Because we didn’t choose to register each image as a record in the File registry, we have to query the images through the metadata file of the dataset:

df = dataset.file.load()

We can query a subset of images using metadata registries & pandas query syntax:

query = df[
    (df.cell_line == cell_lines.hep_g2_cell)
    & (df.sirna == sirnas.s15652)
    & (df.well == wells.m15)
    & (df.plate == 1)
    & (df.site == 2)
]

query
site_id well_id cell_line split experiment plate well site well_type sirna sirna_id paths path
3066 HEPG2-08_1_M15_2 HEPG2-08_1_M15 HEPG2 test HEPG2-08 1 M15 2 positive_control s15652 1114 images/test/HEPG2-08/Plate1/M15_s2_w1-w6.png images/test/HEPG2-08/Plate1/M15_s2_w1.png
3067 HEPG2-08_1_M15_2 HEPG2-08_1_M15 HEPG2 test HEPG2-08 1 M15 2 positive_control s15652 1114 images/test/HEPG2-08/Plate1/M15_s2_w1-w6.png images/test/HEPG2-08/Plate1/M15_s2_w2.png
3068 HEPG2-08_1_M15_2 HEPG2-08_1_M15 HEPG2 test HEPG2-08 1 M15 2 positive_control s15652 1114 images/test/HEPG2-08/Plate1/M15_s2_w1-w6.png images/test/HEPG2-08/Plate1/M15_s2_w3.png
3069 HEPG2-08_1_M15_2 HEPG2-08_1_M15 HEPG2 test HEPG2-08 1 M15 2 positive_control s15652 1114 images/test/HEPG2-08/Plate1/M15_s2_w1-w6.png images/test/HEPG2-08/Plate1/M15_s2_w4.png
3070 HEPG2-08_1_M15_2 HEPG2-08_1_M15 HEPG2 test HEPG2-08 1 M15 2 positive_control s15652 1114 images/test/HEPG2-08/Plate1/M15_s2_w1-w6.png images/test/HEPG2-08/Plate1/M15_s2_w5.png
3071 HEPG2-08_1_M15_2 HEPG2-08_1_M15 HEPG2 test HEPG2-08 1 M15 2 positive_control s15652 1114 images/test/HEPG2-08/Plate1/M15_s2_w1-w6.png images/test/HEPG2-08/Plate1/M15_s2_w6.png

To access the individual images based on this query result:

images = [dataset.path.parent / key for key in query.path]

images
[GCSPath('gs://rxrx1-europe-west4/images/test/HEPG2-08/Plate1/M15_s2_w1.png'),
 GCSPath('gs://rxrx1-europe-west4/images/test/HEPG2-08/Plate1/M15_s2_w2.png'),
 GCSPath('gs://rxrx1-europe-west4/images/test/HEPG2-08/Plate1/M15_s2_w3.png'),
 GCSPath('gs://rxrx1-europe-west4/images/test/HEPG2-08/Plate1/M15_s2_w4.png'),
 GCSPath('gs://rxrx1-europe-west4/images/test/HEPG2-08/Plate1/M15_s2_w5.png'),
 GCSPath('gs://rxrx1-europe-west4/images/test/HEPG2-08/Plate1/M15_s2_w6.png')]

Download an image to disk:

path = UPath(images[1])
path.download_to(".")
from IPython.display import Image

Image(f"./{path.name}")
_images/e9ab80eeba21bdcf86c18651e2665c5a5406cd56b4860eaa76eb961fa3a225fd.png
Use DuckDB to query metadata

As an alternative to pandas, we could use DuckDB to query image metadata.

import duckdb

filter = (
    f"{features.cell_type} == '{cell_lines.hep_g2_cell}' and {features.sirna} =="
    f" '{sirnas.s15652}' and {features.well} == '{wells.m15}' and "
    f"{features.plate} == '1' and {features.site} == '2'"
)

parquet_data = duckdb.from_parquet(file.path.as_posix())

parquet_data.filter(filter)