RxRx: cell imaging#

rxrx.ai hosts high-throughput cell imaging datasets generated by Recursion.

High numbers of fluorescent microscopy images characterize cellular phenotypes in vitro based on morphology and protein expression (5-10 stains) across a range of conditions.

  • In this guide, you’ll see how to query some of these data using LaminDB: laminlabs/rxrx.

  • If you’d like to transfer data into your own LaminDB instance, see the transfer guide.

  • If you’d like to understand how the laminlabs/rxrx instance was curated, see this repository.

Setup#

import lamindb as ln
import bionty as bt
import wetlab as wl

ln.connect("laminlabs/lamindata")

Search & look up metadata#

We’ll find all treatments in the Treatment registry:

df = wl.Treatment.df()
df.shape
(1139, 13)

Let us create a look up object for siRNAs so that we can easily auto-complete queries involving it:

sirnas = wl.Treatment.filter(system="siRNA").lookup(return_field="name")

We’re also interested in features, cell lines & wells:

ln.Feature.df()
uid name type unit description registries synonyms created_at updated_at created_by_id
id
135 UPnuN18Vro7T sirna float None None wetlab.Treatment None 2023-07-12 12:54:25.605932+00:00 2024-03-26 13:23:37.050138+00:00 2
134 RFz9tVF39RXJ well_type float None None core.ULabel None 2023-07-12 12:54:25.605879+00:00 2024-03-26 13:20:59.284093+00:00 2
132 ghhC57uNYQhD well float None None wetlab.Well None 2023-07-12 12:54:25.605769+00:00 2024-03-26 13:20:57.352241+00:00 2
131 gUecWT2bNsch plate float None None core.ULabel None 2023-07-12 12:54:25.605717+00:00 2024-03-26 13:20:13.255028+00:00 2
303 4ycwa8er0EB2 experiment category None None core.ULabel|wetlab.Experiment None 2023-07-12 12:54:25.605663+00:00 2024-03-26 13:20:11.349207+00:00 2
... ... ... ... ... ... ... ... ... ... ...
5 b1oB0I2Nxx7w feature_4 float None None None None 2023-07-12 12:54:24.401456+00:00 2023-10-14 15:42:03.557973+00:00 2
4 qehni2DU75bT feature_3 float None None None None 2023-07-12 12:54:24.401441+00:00 2023-10-14 15:42:03.431243+00:00 2
3 cANjhBnEosz7 feature_2 float None None None None 2023-07-12 12:54:24.401425+00:00 2023-10-14 15:42:03.306655+00:00 2
2 RhHNXlP1jpqi feature_1 float None None None None 2023-07-12 12:54:24.401408+00:00 2023-10-14 15:42:03.181750+00:00 2
1 UwWDQLrCTdks feature_0 float None None None None 2023-07-12 12:54:24.401373+00:00 2023-10-14 15:42:03.055457+00:00 2

311 rows × 10 columns

cell_lines = bt.CellLine.lookup(return_field="abbr")
wells = wl.Well.lookup(return_field="name")

Load the collection#

This is RxRx1: 125k images for 1138 siRNA perturbation across 4 cell lines reading out 5 stains, image dimension is 512x512x6.

Let us get the corresponding object and some information about it:

collection = ln.Collection.filter(uid="KMEQhAvRQDXLvNTNWlsT").one()
collection.view_lineage()
collection.describe()
Hide code cell output
_images/adcf37330da9094a5c239cf7ac4625d6723f2d9384ae82e33cd924f049819ace.svg
Collection(uid='KMEQhAvRQDXLvNTNWlsT', name='Annotated RxRx1 images', version='1', hash='jKVAYzd5in11dWtr-C0M7g', visibility=1, updated_at=2024-03-26 13:24:48 UTC)

Provenance:
  📎 transform: Transform(uid='Zo0qJt4IQPsb5zKv', name='Ingest the RxRx1 dataset', key='02-rxrx1-ingest', version='1', type='notebook')
  📎 run: Run(uid='o7nwbuGqaY65aZ6jzmrt', started_at=2024-03-26 13:13:16 UTC, is_consecutive=True)
  📎 artifact: Artifact(uid='KMEQhAvRQDXLvNTNWlsT', key='rxrx1/metadata.parquet', suffix='.parquet', accessor='DataFrame', description='Metadata with file paths for each RxRx1 image.', size=5722206, hash='jKVAYzd5in11dWtr-C0M7g', hash_type='md5', visibility=1, key_is_virtual=True)
  📎 created_by: User(uid='FBa7SHjn', handle='falexwolf', name='Alex Wolf')
Features:
  columns: FeatureSet(uid='E58U5AxvUTGmMnE5P4iT', n=11, registry='core.Feature')
    well_id (float)
    site (float)
    sirna_id (float)
    well (float)
    well_type (float)
    sirna (float)
    path (category)
    🔗 cell_line (4, bionty.CellLine): 'U-2 OS cell', 'Hep G2 cell', 'HUV-EC-C cell', 'hTERT RPE-1 cell'
    🔗 split (2, core.ULabel): 'train', 'test'
    🔗 experiment (core.ULabel|wetlab.Experiment)
        🔗 experiment (0, core.ULabel): 
        🔗 experiment (0, wetlab.Experiment): 
    plate (float)
  external: FeatureSet(uid='jyMP6a3aI8kKzVZquyFK', n=1, registry='core.Feature')
    🔗 readout (0, bionty.ExperimentalFactor): 
Labels:
  📎 cell_lines (4, bionty.CellLine): 'U-2 OS cell', 'Hep G2 cell', 'HUV-EC-C cell', 'hTERT RPE-1 cell'
  📎 ulabels (2, core.ULabel): 'train', 'test'

The dataset consists in a metadata file and a folder path pointing to the image files:

collection.artifact.load().head()
site_id well_id cell_line split experiment plate well site well_type sirna sirna_id path
0 HEPG2-08_1_B02_1 HEPG2-08_1_B02 HEPG2 test HEPG2-08 1 B02 1 negative_control EMPTY 1138 images/test/HEPG2-08/Plate1/B02_s1_w1.png
1 HEPG2-08_1_B02_1 HEPG2-08_1_B02 HEPG2 test HEPG2-08 1 B02 1 negative_control EMPTY 1138 images/test/HEPG2-08/Plate1/B02_s1_w2.png
2 HEPG2-08_1_B02_1 HEPG2-08_1_B02 HEPG2 test HEPG2-08 1 B02 1 negative_control EMPTY 1138 images/test/HEPG2-08/Plate1/B02_s1_w3.png
3 HEPG2-08_1_B02_1 HEPG2-08_1_B02 HEPG2 test HEPG2-08 1 B02 1 negative_control EMPTY 1138 images/test/HEPG2-08/Plate1/B02_s1_w4.png
4 HEPG2-08_1_B02_1 HEPG2-08_1_B02 HEPG2 test HEPG2-08 1 B02 1 negative_control EMPTY 1138 images/test/HEPG2-08/Plate1/B02_s1_w5.png

Query image files#

Because we didn’t choose to register each image as a record in the Artifact registry, we have to query the images through the metadata file of the dataset:

# df = collection.artifact.load()

We can query a subset of images using metadata registries & pandas query syntax:

# query = df[
#     (df.cell_line == cell_lines.hep_g2_cell)
#     & (df.sirna == sirnas.s15652)
#     & (df.well == wells.m15)
#     & (df.plate == 1)
#     & (df.site == 2)
# ]
# query

To access the individual images based on this query result:

# images = [artifact.path.parent / key for key in query.path]
# images

Download an image to disk:

# path = UPath(images[1])
# path.download_to(".")
# from IPython.display import Image
# Image(f"./{path.name}")
Use DuckDB to query metadata

As an alternative to pandas, we could use DuckDB to query image metadata.

import duckdb

filter = (
    f"{features.cell_type} == '{cell_lines.hep_g2_cell}' and {features.sirna} =="
    f" '{sirnas.s15652}' and {features.well} == '{wells.m15}' and "
    f"{features.plate} == '1' and {features.site} == '2'"
)

parquet_data = duckdb.from_parquet(artifact.path.as_posix())

parquet_data.filter(filter)