RxRx: cell imaging#
rxrx.ai hosts high-throughput cell imaging datasets generated by Recursion.
High numbers of fluorescent microscopy images characterize cellular phenotypes in vitro based on morphology and protein expression (5-10 stains) across a range of conditions.
In this guide, youβll see how to query some of these data using LaminDB: laminlabs/rxrx.
If youβd like to transfer data into your own LaminDB instance, see the transfer guide.
If youβd like to understand how the
laminlabs/rxrx
instance was curated, see this repository.
Setup#
!lamin load laminlabs/rxrx
π‘ loaded instance: laminlabs/rxrx
import lamindb as ln
import lnschema_bionty as lb
import lnschema_lamin1 as ln1
from lamindb import UPath
π‘ lamindb instance: laminlabs/rxrx
Search & look up metadata#
Weβll find all treatments in the Treatment
registry:
df = ln1.Treatment.filter().df()
df.shape
(1139, 12)
Let us create a look up object for siRNAs so that we can easily auto-complete queries involving it:
sirnas = ln1.Treatment.filter(system="siRNA").lookup(return_field="name")
Weβre also interested in features, cell lines & wells:
features = ln.Feature.lookup(return_field="name")
cell_lines = lb.CellLine.lookup(return_field="abbr")
wells = ln1.Well.lookup(return_field="name")
Load the dataset#
In this instance, there is only a single dataset:
ln.Dataset.filter().df()
uid | name | description | version | hash | reference | reference_type | transform_id | run_id | file_id | storage_id | initial_version_id | visibility | updated_at | created_by_id | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
id | |||||||||||||||
12 | flLeukogmLRzleFCpCRD | RxRx1 images | None | 1 | 4on4AbbmBL0sr0xe9_gxxQ | None | None | 4 | 3 | 6 | 2 | None | 1 | 2023-11-15 08:47:53.842900+00:00 | 1 |
This is RxRx1: 125k images for 1138 siRNA perturbation across 4 cell lines reading out 5 stains, image dimension is 512x512x6.
Let us get the corresponding object and some information about it:
dataset = ln.Dataset.filter(uid="flLeukogmLRzleFCpCRD").one()
dataset.view_flow()
dataset.describe()
Show code cell output
Dataset(uid='flLeukogmLRzleFCpCRD', name='RxRx1 images', version='1', hash='4on4AbbmBL0sr0xe9_gxxQ', visibility=1, updated_at=2023-11-15 08:47:53 UTC)
Provenance:
π transform: Transform(uid='Zo0qJt4IQPsbxM', name='Register the RxRx1 dataset', short_name='02-rxrx1', version='1', type='notebook', updated_at=2023-11-15 06:16:01 UTC, latest_report_id=11, source_file_id=10, initial_version_id=3, created_by_id=1)
π£ run: Run(uid='1wQSEWx8oK23GxLlAIHj', run_at=2023-11-15 05:48:27 UTC, is_consecutive=True, transform_id=4, created_by_id=1, report_id=11)
π file: File(uid='flLeukogmLRzleFCpCRD', suffix='.parquet', accessor='DataFrame', description='Metadata with file paths for each RxRx1 image.', size=7806003, hash='4on4AbbmBL0sr0xe9_gxxQ', hash_type='md5', visibility=1, key_is_virtual=True, updated_at=2023-11-15 08:47:53 UTC, storage_id=1, transform_id=3, run_id=2, created_by_id=1)
ποΈ storage: Storage(uid='aLcXXffe', root='gs://rxrx1-europe-west4/images', type='gs', updated_at=2023-11-15 05:55:24 UTC, created_by_id=1)
π€ created_by: User(uid='FBa7SHjn', handle='falexwolf', name='Alex Wolf', updated_at=2023-11-05 14:57:57 UTC)
Features:
columns: FeatureSet(uid='y0uhIW520iTEzrxI14mL', n=8, registry='core.Feature', hash='mnhzsJj-j7VZgNJ88VM0', updated_at=2023-11-05 17:57:58 UTC, created_by_id=1)
π cell_line (4, bionty.CellLine): 'HUV-EC-C cell', 'U-2 OS cell', 'hTERT RPE-1 cell', 'Hep G2 cell'
π split (2, core.ULabel): 'test', 'train'
π experiment (51, lamin1.Experiment): 'HEPG2-08', 'HEPG2-09', 'HEPG2-10', 'HEPG2-11', 'HUVEC-17', 'HUVEC-18', 'HUVEC-19', 'HUVEC-20', 'HUVEC-21', 'HUVEC-22', ...
plate (number)
π well (308, lamin1.Well): 'B02', 'B03', 'B04', 'B05', 'B06', 'B07', 'B08', 'B09', 'B10', 'B11', ...
π well_type (3, core.ULabel): 'negative_control', 'treatment', 'positive_control'
π sirna (1139, lamin1.Treatment): 'EMPTY', 's21721', 's20894', 's19827', 's19792', 's19935', 's21398', 's223097', 's348', 's19975', ...
path (object)
external: FeatureSet(uid='JWcRhQneUEwjkoYnEf59', n=1, registry='core.Feature', hash='AF42DRsoUacKb5l4WgT-', updated_at=2023-11-15 05:55:14 UTC, created_by_id=1)
π readout (1, bionty.ExperimentalFactor): 'high content screen'
Labels:
π·οΈ cell_lines (4, bionty.CellLine): 'HUV-EC-C cell', 'U-2 OS cell', 'hTERT RPE-1 cell', 'Hep G2 cell'
π·οΈ experimental_factors (1, bionty.ExperimentalFactor): 'high content screen'
π·οΈ experiments (51, lamin1.Experiment): 'HEPG2-08', 'HEPG2-09', 'HEPG2-10', 'HEPG2-11', 'HUVEC-17', 'HUVEC-18', 'HUVEC-19', 'HUVEC-20', 'HUVEC-21', 'HUVEC-22', ...
π·οΈ wells (308, lamin1.Well): 'B02', 'B03', 'B04', 'B05', 'B06', 'B07', 'B08', 'B09', 'B10', 'B11', ...
π·οΈ treatments (1139, lamin1.Treatment): 'EMPTY', 's21721', 's20894', 's19827', 's19792', 's19935', 's21398', 's223097', 's348', 's19975', ...
π·οΈ ulabels (9, core.ULabel): 'test', 'train', 'Plate1', 'Plate2', 'Plate3', 'Plate4', 'negative_control', 'treatment', 'positive_control'
The dataset consists in a metadata file and a folder path pointing to the image files:
dataset.file.load().head()
site_id | well_id | cell_line | split | experiment | plate | well | site | well_type | sirna | sirna_id | paths | path | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | HEPG2-08_1_B02_1 | HEPG2-08_1_B02 | HEPG2 | test | HEPG2-08 | 1 | B02 | 1 | negative_control | EMPTY | 1138 | images/test/HEPG2-08/Plate1/B02_s1_w1-w6.png | images/test/HEPG2-08/Plate1/B02_s1_w1.png |
1 | HEPG2-08_1_B02_1 | HEPG2-08_1_B02 | HEPG2 | test | HEPG2-08 | 1 | B02 | 1 | negative_control | EMPTY | 1138 | images/test/HEPG2-08/Plate1/B02_s1_w1-w6.png | images/test/HEPG2-08/Plate1/B02_s1_w2.png |
2 | HEPG2-08_1_B02_1 | HEPG2-08_1_B02 | HEPG2 | test | HEPG2-08 | 1 | B02 | 1 | negative_control | EMPTY | 1138 | images/test/HEPG2-08/Plate1/B02_s1_w1-w6.png | images/test/HEPG2-08/Plate1/B02_s1_w3.png |
3 | HEPG2-08_1_B02_1 | HEPG2-08_1_B02 | HEPG2 | test | HEPG2-08 | 1 | B02 | 1 | negative_control | EMPTY | 1138 | images/test/HEPG2-08/Plate1/B02_s1_w1-w6.png | images/test/HEPG2-08/Plate1/B02_s1_w4.png |
4 | HEPG2-08_1_B02_1 | HEPG2-08_1_B02 | HEPG2 | test | HEPG2-08 | 1 | B02 | 1 | negative_control | EMPTY | 1138 | images/test/HEPG2-08/Plate1/B02_s1_w1-w6.png | images/test/HEPG2-08/Plate1/B02_s1_w5.png |
dataset.path
GCSPath('gs://rxrx1-europe-west4/images')
We can get an idea of the folder structure like so:
dataset.path.view_tree(level=2)
Show code cell output
images (53 sub-directories & 0 files):
βββ test
β βββ HEPG2-08
β βββ HEPG2-09
β βββ HEPG2-10
β βββ HEPG2-11
β βββ HUVEC-17
β βββ HUVEC-18
β βββ HUVEC-19
β βββ HUVEC-20
β βββ HUVEC-21
β βββ HUVEC-22
β βββ HUVEC-23
β βββ HUVEC-24
β βββ RPE-08
β βββ RPE-09
β βββ RPE-10
β βββ RPE-11
β βββ U2OS-04
β βββ U2OS-05
βββ train
βββ HEPG2-01
βββ HEPG2-02
βββ HEPG2-03
βββ HEPG2-04
βββ HEPG2-05
βββ HEPG2-06
βββ HEPG2-07
βββ HUVEC-01
βββ HUVEC-02
βββ HUVEC-03
βββ HUVEC-04
βββ HUVEC-05
βββ HUVEC-06
βββ HUVEC-07
βββ HUVEC-08
βββ HUVEC-09
βββ HUVEC-10
βββ HUVEC-11
βββ HUVEC-12
βββ HUVEC-13
βββ HUVEC-14
βββ HUVEC-15
βββ HUVEC-16
βββ RPE-01
βββ RPE-02
βββ RPE-03
βββ RPE-04
βββ RPE-05
βββ RPE-06
βββ RPE-07
βββ U2OS-01
βββ U2OS-02
βββ U2OS-03
Get an idea of all image files like so:
# dataset.path.view_tree()
Query image files#
Because we didnβt choose to register each image as a record in the File
registry, we have to query the images through the metadata file of the dataset:
df = dataset.file.load()
We can query a subset of images using metadata registries & pandas query syntax:
query = df[
(df.cell_line == cell_lines.hep_g2_cell)
& (df.sirna == sirnas.s15652)
& (df.well == wells.m15)
& (df.plate == 1)
& (df.site == 2)
]
query
site_id | well_id | cell_line | split | experiment | plate | well | site | well_type | sirna | sirna_id | paths | path | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
3066 | HEPG2-08_1_M15_2 | HEPG2-08_1_M15 | HEPG2 | test | HEPG2-08 | 1 | M15 | 2 | positive_control | s15652 | 1114 | images/test/HEPG2-08/Plate1/M15_s2_w1-w6.png | images/test/HEPG2-08/Plate1/M15_s2_w1.png |
3067 | HEPG2-08_1_M15_2 | HEPG2-08_1_M15 | HEPG2 | test | HEPG2-08 | 1 | M15 | 2 | positive_control | s15652 | 1114 | images/test/HEPG2-08/Plate1/M15_s2_w1-w6.png | images/test/HEPG2-08/Plate1/M15_s2_w2.png |
3068 | HEPG2-08_1_M15_2 | HEPG2-08_1_M15 | HEPG2 | test | HEPG2-08 | 1 | M15 | 2 | positive_control | s15652 | 1114 | images/test/HEPG2-08/Plate1/M15_s2_w1-w6.png | images/test/HEPG2-08/Plate1/M15_s2_w3.png |
3069 | HEPG2-08_1_M15_2 | HEPG2-08_1_M15 | HEPG2 | test | HEPG2-08 | 1 | M15 | 2 | positive_control | s15652 | 1114 | images/test/HEPG2-08/Plate1/M15_s2_w1-w6.png | images/test/HEPG2-08/Plate1/M15_s2_w4.png |
3070 | HEPG2-08_1_M15_2 | HEPG2-08_1_M15 | HEPG2 | test | HEPG2-08 | 1 | M15 | 2 | positive_control | s15652 | 1114 | images/test/HEPG2-08/Plate1/M15_s2_w1-w6.png | images/test/HEPG2-08/Plate1/M15_s2_w5.png |
3071 | HEPG2-08_1_M15_2 | HEPG2-08_1_M15 | HEPG2 | test | HEPG2-08 | 1 | M15 | 2 | positive_control | s15652 | 1114 | images/test/HEPG2-08/Plate1/M15_s2_w1-w6.png | images/test/HEPG2-08/Plate1/M15_s2_w6.png |
To access the individual images based on this query result:
images = [dataset.path.parent / key for key in query.path]
images
[GCSPath('gs://rxrx1-europe-west4/images/test/HEPG2-08/Plate1/M15_s2_w1.png'),
GCSPath('gs://rxrx1-europe-west4/images/test/HEPG2-08/Plate1/M15_s2_w2.png'),
GCSPath('gs://rxrx1-europe-west4/images/test/HEPG2-08/Plate1/M15_s2_w3.png'),
GCSPath('gs://rxrx1-europe-west4/images/test/HEPG2-08/Plate1/M15_s2_w4.png'),
GCSPath('gs://rxrx1-europe-west4/images/test/HEPG2-08/Plate1/M15_s2_w5.png'),
GCSPath('gs://rxrx1-europe-west4/images/test/HEPG2-08/Plate1/M15_s2_w6.png')]
Download an image to disk:
path = UPath(images[1])
path.download_to(".")
from IPython.display import Image
Image(f"./{path.name}")

Use DuckDB to query metadata
As an alternative to pandas, we could use DuckDB to query image metadata.
import duckdb
filter = (
f"{features.cell_type} == '{cell_lines.hep_g2_cell}' and {features.sirna} =="
f" '{sirnas.s15652}' and {features.well} == '{wells.m15}' and "
f"{features.plate} == '1' and {features.site} == '2'"
)
parquet_data = duckdb.from_parquet(file.path.as_posix())
parquet_data.filter(filter)