CELLxGENE: scRNA-seq#
CZ CELLxGENE hosts the globally largest standardized collection of scRNA-seq datasets.
LaminDB makes it easy to query the CELLxGENE data and integrate it with in-house data of any kind (omics, phenotypes, pdfs, notebooks, ML models, …).
You can use the CELLxGENE data in three ways:
In the current guide, you’ll see how to query metadata and data based on
AnnData
objects.If you want to use these in your own LaminDB instance, see the transfer guide.
If you’d like to leverage the TileDB-SOMA API for the data subset of CELLxGENE Census, see the Census guide.
If you are interested in building similar data assets in-house:
See the scRNA guide for how to create a growing versioned queryable scRNA-seq dataset.
Reach out if you are interested in a full zero-copy clone of
laminlabs/cellxgene
to accelerate building your in-house LaminDB instances.
Setup#
Load the public LaminDB instance that mirrors cellxgene on the CLI:
!lamin load laminlabs/cellxgene
💡 loaded instance: laminlabs/cellxgene
import lamindb as ln
import lnschema_bionty as lb
💡 lamindb instance: laminlabs/cellxgene
Query & understand metadata#
Auto-complete metadata#
You can create look-up objects for any registry in LaminDB, including basic biological entities and things like users or storage locations.
Let’s use auto-complete to look up cell types:
Show me a screenshot

cell_types = lb.CellType.lookup()
cell_types.effector_t_cell
CellType(uid='yvHkIrVI', name='effector T cell', ontology_id='CL:0000911', synonyms='effector T-cell|effector T-lymphocyte|effector T lymphocyte', description='A Differentiated T Cell With Ability To Traffic To Peripheral Tissues And Is Capable Of Mounting A Specific Immune Response.', updated_at=2023-11-28 22:30:57 UTC, bionty_source_id=48, created_by_id=1)
You can also arbitrarily chain filters and create lookups from them:
organisms = lb.Organism.lookup() # species
genes = lb.Gene.filter(organism=organisms.human).lookup() # ~60k human genes
features = ln.Feature.lookup() # non-gene features, like `cell_type`, `assay`, etc.
experimental_factors = lb.ExperimentalFactor.lookup() # labels for experimental factors
tissues = lb.Tissue.lookup() # tissue labels
ulabels = ln.ULabel.lookup() # universal labels, e.g. dataset collections
suspension_types = ulabels.is_suspension_type.children.all().lookup()
Search & filter metadata#
We can use search & filters for metadata:
lb.CellType.search("effector T cell")
Show code cell output
uid | synonyms | score | |
---|---|---|---|
name | |||
effector T cell | yvHkIrVI | effector T-cell|effector T-lymphocyte|effector... | 100.0 |
ectodermal cell | e2QmwdvB | ectoderm cell | 71.4 |
helper T cell | TwCkoWgT | helper T-lymphocyte|T-helper cell|helper T lym... | 71.4 |
memory T cell | Re00kg0W | memory T-cell|memory T lymphocyte|memory T-lym... | 71.4 |
sensory receptor cell | j0WdHDdi | receptor cell | 71.4 |
excretory cell | AA00OTcM | 69.0 | |
secretory cell | wVT2qeb9 | 69.0 | |
neurectodermal cell | KjesToYa | neurectoderm cell | 68.8 |
pro-T cell | XaWRfcwg | pro-T lymphocyte|progenitor T cell | 68.8 |
regulatory T cell | Z7uMAWUF | regulatory T lymphocyte|Treg|regulatory T-lymp... | 68.8 |
Kupffer cell | YN0gzDt3 | hepatic macrophage|macrophagocytus stellatus|l... | 66.7 |
chemoreceptor cell | 9lDVTP4o | 66.7 | |
follicular B cell | FMTngXKK | Fo B cell|follicular B lymphocyte|follicular B... | 66.7 |
lb.CellType.search("CD8-positive cytokine effector T cell")
Show code cell output
uid | synonyms | score | |
---|---|---|---|
name | |||
CD8-positive, alpha-beta cytokine secreting effector T cell | pam4JjkW | CD8-positive, alpha-beta cytokine secreting ef... | 77.1 |
CD4-positive helper T cell | oyjZhi4K | CD4-positive T-helper cell|CD4-positive helper... | 69.8 |
CD8-positive, alpha-beta T cell | VnKkQsME | CD8-positive, alpha-beta T-cell|CD8-positive, ... | 67.6 |
CD8-positive, alpha-beta cytotoxic T cell | baEuJabx | CD8-positive, alpha-beta cytotoxic T-cell|CD8-... | 66.7 |
CD8-positive, alpha-beta memory T cell | 9FR0LnTI | CD8-positive, alpha-beta memory T lymphocyte|C... | 66.7 |
CD1c-positive myeloid dendritic cell | gXOMeVM0 | 65.8 | |
CD141-positive myeloid dendritic cell | dRUgw2Fo | 64.9 | |
CD4-positive, alpha-beta T cell | 05vQoepH | CD4-positive, alpha-beta T lymphocyte|CD4-posi... | 64.7 |
CD4-positive, alpha-beta cytotoxic T cell | 3sKh2cA7 | CD4-positive, alpha-beta cytotoxic T-cell|CD4-... | 64.1 |
CD34-positive, CD38-negative hematopoietic stem cell | Tf2NM0hD | CD133-positive hematopoietic stem cell | 64.0 |
And use a uid
to filter exactly one metadata record:
effector_t_cell = lb.CellType.filter(uid="yvHkIrVI").one()
effector_t_cell
CellType(uid='yvHkIrVI', name='effector T cell', ontology_id='CL:0000911', synonyms='effector T-cell|effector T-lymphocyte|effector T lymphocyte', description='A Differentiated T Cell With Ability To Traffic To Peripheral Tissues And Is Capable Of Mounting A Specific Immune Response.', updated_at=2023-11-28 22:30:57 UTC, bionty_source_id=48, created_by_id=1)
Understand ontologies#
View the surrounding ontology terms:
effector_t_cell.view_parents(distance=2, with_children=True)
Or access them programmatically:
effector_t_cell.children.df()
uid | name | ontology_id | abbr | synonyms | description | bionty_source_id | updated_at | created_by_id | |
---|---|---|---|---|---|---|---|---|---|
id | |||||||||
931 | o9T53Uso | effector CD8-positive, alpha-beta T cell | CL:0001050 | None | effector CD8-positive, alpha-beta T lymphocyte... | A Cd8-Positive, Alpha-Beta T Cell With The Phe... | 48 | 2023-11-28 22:27:55.565981+00:00 | 1 |
1088 | tQZFurra | effector CD4-positive, alpha-beta T cell | CL:0001044 | None | effector CD4-positive, alpha-beta T lymphocyte... | A Cd4-Positive, Alpha-Beta T Cell With The Phe... | 48 | 2023-11-28 22:27:55.569832+00:00 | 1 |
1229 | 7roaTzhI | exhausted T cell | CL:0011025 | None | Tex cell|An effector T cell that displays impa... | None | 48 | 2023-11-28 22:27:55.572884+00:00 | 1 |
1309 | OxsmyL44 | cytotoxic T cell | CL:0000910 | None | cytotoxic T lymphocyte|cytotoxic T-lymphocyte|... | A Mature T Cell That Differentiated And Acquir... | 48 | 2023-11-28 22:27:55.575444+00:00 | 1 |
1331 | TwCkoWgT | helper T cell | CL:0000912 | None | helper T-lymphocyte|T-helper cell|helper T lym... | A Effector T Cell That Provides Help In The Fo... | 48 | 2023-11-28 22:27:55.575955+00:00 | 1 |
Query files#
Unlike in the SOMA guide, here, we’ll query sets of h5ad
files, which correspond to AnnData
objects.
To access them, we query the Dataset
record that links the latest LTS set of h5ad files:
dataset = ln.Dataset.filter(name="cellxgene-census", version="2023-07-25").one()
dataset
Dataset(uid='OirHTWDrudY2TYltvIX1', name='cellxgene-census', version='2023-07-25', hash='pEJ9uvIeTLvHkZW2TBT5', visibility=1, updated_at=2023-11-28 21:46:40 UTC, transform_id=11, run_id=16, created_by_id=1)
You can get all linked files as a dataframe - there are 850 files in cellxgene-census
version 2023-07-25
.
dataset.files.df().head() # not tracking run & transform because read-only instance
Show code cell output
❗ no run & transform get linked, consider passing a `run` or calling ln.track()
uid | storage_id | key | suffix | accessor | description | version | size | hash | hash_type | transform_id | run_id | initial_version_id | visibility | key_is_virtual | updated_at | created_by_id | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
id | |||||||||||||||||
1029 | 6IYilXiyiTxZYMCJ2TnY | 2 | cell-census/2023-07-25/h5ads/2fb24a91-55b9-4cc... | .h5ad | AnnData | High Resolution Slide-seqV2 Spatial Transcript... | None | 8856712 | BXH-IIW1Et1CyugN0DMroQ-2 | md5-n | 11 | 16 | None | 1 | False | 2023-11-28 22:45:38.554629+00:00 | 1 |
872 | vEw6vGy47Zi0Qj6T6YJr | 2 | cell-census/2023-07-25/h5ads/0041b9c3-6a49-4bf... | .h5ad | AnnData | Tabula Sapiens | None | 198592773 | 0tEolD_cGXenPjobh1M8Gw-24 | md5-n | 11 | 16 | None | 1 | False | 2023-11-28 22:44:26.759174+00:00 | 1 |
873 | dptYEcjH6o3p9Vy1qZAp | 2 | cell-census/2023-07-25/h5ads/00476f9f-ebc1-4b7... | .h5ad | AnnData | Human Brain Cell Atlas v1.0 | None | 131643578 | HCQOV1VHonILymJHLkcNdg-16 | md5-n | 11 | 16 | None | 1 | False | 2023-11-28 22:44:27.440055+00:00 | 1 |
875 | bittNWi0gJTdcJ0pm9Jo | 2 | cell-census/2023-07-25/h5ads/00ff600e-6e2e-4d7... | .h5ad | AnnData | Single-cell analysis of human B cell maturatio... | None | 5919670 | PxGgTrFmiCh6AMwiu1fHWw | md5 | 11 | 16 | None | 1 | False | 2023-11-28 22:44:28.116237+00:00 | 1 |
876 | HQPT59lX80spJyfKXDC5 | 2 | cell-census/2023-07-25/h5ads/01209dce-3575-4be... | .h5ad | AnnData | Single-cell transcriptomics of human T cells r... | None | 312536917 | zAlluOa2WUIWvs2jkXKvkQ-38 | md5-n | 11 | 16 | None | 1 | False | 2023-11-28 22:44:28.784784+00:00 | 1 |
You can query across files by arbitrary metadata combinations, for instance:
query = dataset.files.filter(
organism=organisms.human,
cell_types__in=[cell_types.dendritic_cell, cell_types.neutrophil],
tissues=tissues.kidney,
ulabels=suspension_types.cell,
experimental_factors=experimental_factors.ln_10x_3_v2,
)
query = query.order_by("size").distinct() # order by size, drop duplicates
query.df().head() # convert to DataFrame
Show code cell output
uid | storage_id | key | suffix | accessor | description | version | size | hash | hash_type | transform_id | run_id | initial_version_id | visibility | key_is_virtual | updated_at | created_by_id | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
id | |||||||||||||||||
983 | WwmBIhBNLTlRcSoBky88 | 2 | cell-census/2023-07-25/h5ads/20d87640-4be8-487... | .h5ad | AnnData | Spatiotemporal immune zonation of the human ki... | None | 44647761 | dAApZI2IZr64F5b1jDMgtA-6 | md5-n | 11 | 16 | None | 1 | False | 2023-11-28 22:45:31.292961+00:00 | 1 |
1019 | gHlQ5Muwu3G9pvFC4GDV | 2 | cell-census/2023-07-25/h5ads/2d31c0ca-0233-41c... | .h5ad | AnnData | Spatiotemporal immune zonation of the human ki... | None | 64056560 | YjLm7iPkIFIEYimgQEfJSA-8 | md5-n | 11 | 16 | None | 1 | False | 2023-11-28 22:45:52.133169+00:00 | 1 |
1382 | P4Oai3OLGAzRwoicQ5HD | 2 | cell-census/2023-07-25/h5ads/9ea768a2-87ab-46b... | .h5ad | AnnData | Spatiotemporal immune zonation of the human ki... | None | 192484358 | odAyLe_6uoRCQV5eJRijqQ-23 | md5-n | 11 | 16 | None | 1 | False | 2023-11-28 22:49:46.348257+00:00 | 1 |
932 | DSpevwaIl5E2jIWHp0uR | 2 | cell-census/2023-07-25/h5ads/105c7dad-0468-462... | .h5ad | AnnData | Single-cell transcriptomes from human kidneys ... | None | 232722706 | 3yOOhI-gP3TlpyLcDNUTBA-28 | md5-n | 11 | 16 | None | 1 | False | 2023-11-28 22:53:48.624548+00:00 | 1 |
1579 | 11HQaMeIUaOwyHoOjEVN | 2 | cell-census/2023-07-25/h5ads/d7dcfd8f-2ee7-438... | .h5ad | AnnData | Spatiotemporal immune zonation of the human ki... | None | 341214674 | R8-G4h5ztVfX29r58T4g_Q-41 | md5-n | 11 | 16 | None | 1 | False | 2023-11-28 22:52:00.821957+00:00 | 1 |
Query arrays#
Each file stores an array in form of an annotated data matrix, an AnnData
object.
Let’s look at the first array in the file query and show metadata using .describe()
:
file = query.first()
file.describe()
Show code cell output
File(uid='WwmBIhBNLTlRcSoBky88', key='cell-census/2023-07-25/h5ads/20d87640-4be8-487f-93d4-dce38378d00f.h5ad', suffix='.h5ad', accessor='AnnData', description='Spatiotemporal immune zonation of the human kidney', size=44647761, hash='dAApZI2IZr64F5b1jDMgtA-6', hash_type='md5-n', visibility=1, key_is_virtual=False, updated_at=2023-11-28 22:45:31 UTC)
Provenance:
🗃️ storage: Storage(uid='oIYGbD74', root='s3://cellxgene-data-public', type='s3', region='us-west-2', updated_at=2023-10-16 15:04:08 UTC, created_by_id=1)
📔 transform: Transform(uid='pNa7RdI26sp4z8', name='Register files from Census release 2023-07-25', short_name='census-release-2023-07-25', version='0', type='notebook', updated_at=2023-11-29 13:53:43 UTC, latest_report_id=1724, source_file_id=1723, created_by_id=1)
👣 run: Run(uid='ZYgsnqK5v2hPmFlS0kfG', run_at=2023-11-29 13:52:08 UTC, is_consecutive=False, transform_id=11, created_by_id=1, report_id=1724)
👤 created_by: User(uid='kmvZDIX9', handle='sunnyosun', name='Sunny Sun', updated_at=2023-11-28 21:14:48 UTC)
⬇️ input_of (core.Run): ['2023-11-29 12:51:05 UTC']
Features:
obs: FeatureSet(uid='kwKICViF5O3QjHdg0nov', name='obs features', n=9, type='category', registry='core.Feature', hash='Bx10EzvDxdlAVjqVKdKC', updated_at=2023-11-29 09:28:28 UTC, created_by_id=1)
🔗 assay (1, bionty.ExperimentalFactor): '10x 3' v2'
🔗 cell_type (12, bionty.CellType): 'CD8-positive, alpha-beta T cell', 'mature NK T cell', 'CD4-positive, alpha-beta T cell', 'natural killer cell', 'non-classical monocyte', 'plasmacytoid dendritic cell', 'neutrophil', 'B cell', 'kidney resident macrophage', 'dendritic cell', ...
🔗 development_stage (12, bionty.DevelopmentalStage): '2-year-old human stage', '4-year-old human stage', '12-year-old human stage', '44-year-old human stage', '49-year-old human stage', '53-year-old human stage', '63-year-old human stage', '64-year-old human stage', '67-year-old human stage', '70-year-old human stage', ...
🔗 disease (1, bionty.Disease): 'normal'
🔗 donor_id (13, core.ULabel): 'TxK2', 'Wilms1', 'TxK4', 'TTx', 'RCC3', 'RCC1', 'VHL', 'TxK3', 'TxK1', 'Wilms3', ...
🔗 self_reported_ethnicity (1, bionty.Ethnicity): 'unknown'
🔗 sex (2, bionty.Phenotype): 'male', 'female'
🔗 suspension_type (1, core.ULabel): 'cell'
🔗 tissue (5, bionty.Tissue): 'renal medulla', 'kidney blood vessel', 'renal pelvis', 'cortex of kidney', 'kidney'
external: FeatureSet(uid='zIgncie4AywRKgLmKHUW', name='external features', n=2, type='category', registry='core.Feature', hash='5E4xD6tOhDB5EOnLx3tv', updated_at=2023-11-29 09:28:20 UTC, created_by_id=1)
🔗 organism (1, bionty.Organism): 'human'
🔗 collection (1, core.ULabel): 'Spatiotemporal immune zonation of the human kidney'
var: FeatureSet(uid='8AAiWbuUrP2DI1MpuPD0', n=32922, type='number', registry='bionty.Gene', hash='fHMWMViqV_PilN1PWrgF', updated_at=2023-11-29 13:28:55 UTC, created_by_id=1)
'MIR1302-2HG', 'FAM138A', 'OR4F5', 'None', 'None', 'None', 'None', 'DDX11L17', 'WASH9P', 'None', 'None', 'None', 'None', 'None', 'None', 'LINC01409', 'FAM87B', 'LINC00115', 'FAM41C', 'None', ...
Labels:
🏷️ organism (1, bionty.Organism): 'human'
🏷️ tissues (5, bionty.Tissue): 'renal medulla', 'kidney blood vessel', 'renal pelvis', 'cortex of kidney', 'kidney'
🏷️ cell_types (12, bionty.CellType): 'CD8-positive, alpha-beta T cell', 'mature NK T cell', 'CD4-positive, alpha-beta T cell', 'natural killer cell', 'non-classical monocyte', 'plasmacytoid dendritic cell', 'neutrophil', 'B cell', 'kidney resident macrophage', 'dendritic cell', ...
🏷️ diseases (1, bionty.Disease): 'normal'
🏷️ phenotypes (2, bionty.Phenotype): 'male', 'female'
🏷️ experimental_factors (1, bionty.ExperimentalFactor): '10x 3' v2'
🏷️ developmental_stages (12, bionty.DevelopmentalStage): '2-year-old human stage', '4-year-old human stage', '12-year-old human stage', '44-year-old human stage', '49-year-old human stage', '53-year-old human stage', '63-year-old human stage', '64-year-old human stage', '67-year-old human stage', '70-year-old human stage', ...
🏷️ ethnicities (1, bionty.Ethnicity): 'unknown'
🏷️ ulabels (15, core.ULabel): 'Spatiotemporal immune zonation of the human kidney', 'TxK2', 'Wilms1', 'TxK4', 'TTx', 'RCC3', 'RCC1', 'VHL', 'TxK3', 'TxK1', ...
More ways of accessing metadata
Access just features:
file.features
Or get labels given a feature:
file.labels.get(features.tissue).df()
file.labels.get(features.collection).one()
If you want to query a slice of the array data, you have two options:
Cache & load the entire array into memory via
file.load() -> AnnData
(caches the h5ad on disk, so that you only download once)Stream the array from the cloud using a cloud-backed accessor
file.backed() -> AnnDataAccessor
Both options will run much faster if you run them close to the data (AWS S3 on the US West Coast, consider logging into hosted compute there).
1. Cache & load#
Let us first consider option 1:
adata = file.load()
adata
Show code cell output
AnnData object with n_obs × n_vars = 7803 × 32922
obs: 'donor_id', 'donor_age', 'self_reported_ethnicity_ontology_term_id', 'organism_ontology_term_id', 'sample_uuid', 'tissue_ontology_term_id', 'development_stage_ontology_term_id', 'suspension_uuid', 'suspension_type', 'library_uuid', 'assay_ontology_term_id', 'mapped_reference_annotation', 'is_primary_data', 'cell_type_ontology_term_id', 'author_cell_type', 'disease_ontology_term_id', 'reported_diseases', 'sex_ontology_term_id', 'compartment', 'Experiment', 'Project', 'cell_type', 'assay', 'disease', 'organism', 'sex', 'tissue', 'self_reported_ethnicity', 'development_stage'
var: 'feature_is_filtered', 'feature_name', 'feature_reference', 'feature_biotype'
uns: 'default_embedding', 'schema_version', 'title'
obsm: 'X_umap'
Now we have an AnnData
object, which stores observation annotations matching our file-level query in the .obs
slot, and we can re-use almost the same query on the array-level:
See the file-level query for comparison
query = dataset.files.filter(
organism=organisms.human,
cell_types__in=[cell_types.dendritic_cell, cell_types.neutrophil],
tissues=tissues.kidney,
ulabels=suspension_types.cell,
experimental_factors=experimental_factors.ln_10x_3_v2,
)
AnnData
uses pandas to manage metadata and the syntax differs slightly. However, the same metadata records are used.
adata_slice = adata[
adata.obs.cell_type.isin(
[cell_types.dendritic_cell.name, cell_types.neutrophil.name]
)
& (adata.obs.tissue == tissues.kidney.name)
& (adata.obs.suspension_type == suspension_types.cell.name)
& (adata.obs.assay == experimental_factors.ln_10x_3_v2.name)
]
adata_slice
Show code cell output
View of AnnData object with n_obs × n_vars = 199 × 32922
obs: 'donor_id', 'donor_age', 'self_reported_ethnicity_ontology_term_id', 'organism_ontology_term_id', 'sample_uuid', 'tissue_ontology_term_id', 'development_stage_ontology_term_id', 'suspension_uuid', 'suspension_type', 'library_uuid', 'assay_ontology_term_id', 'mapped_reference_annotation', 'is_primary_data', 'cell_type_ontology_term_id', 'author_cell_type', 'disease_ontology_term_id', 'reported_diseases', 'sex_ontology_term_id', 'compartment', 'Experiment', 'Project', 'cell_type', 'assay', 'disease', 'organism', 'sex', 'tissue', 'self_reported_ethnicity', 'development_stage'
var: 'feature_is_filtered', 'feature_name', 'feature_reference', 'feature_biotype'
uns: 'default_embedding', 'schema_version', 'title'
obsm: 'X_umap'
2. Stream#
Let us now consider option 2:
adata_backed = file.backed()
adata_backed
Show code cell output
AnnDataAccessor object with n_obs × n_vars = 7803 × 32922
constructed for the AnnData object 20d87640-4be8-487f-93d4-dce38378d00f.h5ad
obs: ['Experiment', 'Project', '_index', 'assay', 'assay_ontology_term_id', 'author_cell_type', 'cell_type', 'cell_type_ontology_term_id', 'compartment', 'development_stage', 'development_stage_ontology_term_id', 'disease', 'disease_ontology_term_id', 'donor_age', 'donor_id', 'is_primary_data', 'library_uuid', 'mapped_reference_annotation', 'organism', 'organism_ontology_term_id', 'reported_diseases', 'sample_uuid', 'self_reported_ethnicity', 'self_reported_ethnicity_ontology_term_id', 'sex', 'sex_ontology_term_id', 'suspension_type', 'suspension_uuid', 'tissue', 'tissue_ontology_term_id']
obsm: ['X_umap']
raw: ['X', 'var', 'varm']
uns: ['default_embedding', 'schema_version', 'title']
var: ['_index', 'feature_biotype', 'feature_is_filtered', 'feature_name', 'feature_reference']
We now have an AnnDataAccessor
object, which behaves much like an AnnData
, and the query looks the same:
adata_backed_slice = adata_backed[
adata_backed.obs.cell_type.isin(
[cell_types.dendritic_cell.name, cell_types.neutrophil.name]
)
& (adata_backed.obs.tissue == tissues.kidney.name)
& (adata_backed.obs.suspension_type == suspension_types.cell.name)
& (adata_backed.obs.assay == experimental_factors.ln_10x_3_v2.name)
]
adata_backed_slice.to_memory()
Show code cell output
AnnData object with n_obs × n_vars = 199 × 32922
obs: 'donor_id', 'donor_age', 'self_reported_ethnicity_ontology_term_id', 'organism_ontology_term_id', 'sample_uuid', 'tissue_ontology_term_id', 'development_stage_ontology_term_id', 'suspension_uuid', 'suspension_type', 'library_uuid', 'assay_ontology_term_id', 'mapped_reference_annotation', 'is_primary_data', 'cell_type_ontology_term_id', 'author_cell_type', 'disease_ontology_term_id', 'reported_diseases', 'sex_ontology_term_id', 'compartment', 'Experiment', 'Project', 'cell_type', 'assay', 'disease', 'organism', 'sex', 'tissue', 'self_reported_ethnicity', 'development_stage'
var: 'feature_is_filtered', 'feature_name', 'feature_reference', 'feature_biotype'
uns: 'default_embedding', 'schema_version', 'title'
obsm: 'X_umap'
3. Concatenate slices#
If we want to concatenate these individual file-level slices, loop over all files in query
and concatenate the results.
How would this look like?
adata_slices = []
for file in query:
adata_backed = file.backed()
adata_slice = adata_backed[
adata_backed.obs.cell_type.isin(
[cell_types.dendritic_cell.name, cell_types.neutrophil.name]
)
& (adata_backed.obs.tissue == tissues.kidney.name)
& (adata_backed.obs.suspension_type == suspension_types.cell.name)
& (adata_backed.obs.assay == experimental_factors.ln_10x_3_v2.name)
]
adata_slices.append(adata_slice.to_memory())
import anndata as ad
adata_query = ad.concat(adata_slices)
(LaminDB will track data lineage if we store the concatenated result as a new File
or Dataset
.)
Train an ML model#
Exploring data by collection#
Alternatively,
you can search a file on the LaminHub UI and fetch it through:
ln.File.filter(uid="...").one()
or query for a collection you found on CZ CELLxGENE Discover
Let’s search the collections from CELLxGENE:
ulabels.is_collection.search("immune human kidney", limit=10)
uid | score | |
---|---|---|
name | ||
Spatiotemporal immune zonation of the human kidney | iBsTRZPg | 55.1 |
mouse_HAKYY | 1tpv6c10 | 53.3 |
mouse_HKIEN | gvzn29mX | 53.3 |
Human-WT-D | kTlCuYMA | 48.3 |
mouse_EUNBK | vva626sq | 46.7 |
mouse_HBTAE | hSuaL0T3 | 46.7 |
mouse_KFMKE | 6k4F0ucU | 46.7 |
mouse_SQUNI | 25VbKfwE | 46.7 |
mouse_UAOAE | 1H5vbbjE | 46.7 |
mouse_WANEU | oetUq9Ie | 46.7 |
Let’s get the full metadata record of the top hit collection:
collection_iBsTRZPg = ln.ULabel.filter(uid="iBsTRZPg").one()
collection_iBsTRZPg
ULabel(uid='iBsTRZPg', name='Spatiotemporal immune zonation of the human kidney', description='10.1126/science.aat5031', reference='120e86b4-1195-48c5-845b-b98054105eec', reference_type='collection_id', updated_at=2023-11-28 21:50:41 UTC, created_by_id=1)
We see it’s a Science paper and we could find more information using the DOI or CELLxGENE collection id.
Each collection has at least one File
file associated to it. Let’s query them for this collection:
ln.File.filter(ulabels=collection_iBsTRZPg).df()
uid | storage_id | key | suffix | accessor | description | version | size | hash | hash_type | transform_id | run_id | initial_version_id | visibility | key_is_virtual | updated_at | created_by_id | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
id | |||||||||||||||||
1579 | 11HQaMeIUaOwyHoOjEVN | 2 | cell-census/2023-07-25/h5ads/d7dcfd8f-2ee7-438... | .h5ad | AnnData | Spatiotemporal immune zonation of the human ki... | None | 341214674 | R8-G4h5ztVfX29r58T4g_Q-41 | md5-n | 11 | 16 | None | 1 | False | 2023-11-28 22:52:00.821957+00:00 | 1 |
1513 | 6mnZ3SeQFhffr3wTiMEZ | 2 | cell-census/2023-07-25/h5ads/c52de62a-058d-4d7... | .h5ad | AnnData | Spatiotemporal immune zonation of the human ki... | None | 109942751 | Pqa4Ln0Xt7xmTN5IMiU4OA-14 | md5-n | 11 | 16 | None | 1 | False | 2023-11-28 22:51:15.788096+00:00 | 1 |
1382 | P4Oai3OLGAzRwoicQ5HD | 2 | cell-census/2023-07-25/h5ads/9ea768a2-87ab-46b... | .h5ad | AnnData | Spatiotemporal immune zonation of the human ki... | None | 192484358 | odAyLe_6uoRCQV5eJRijqQ-23 | md5-n | 11 | 16 | None | 1 | False | 2023-11-28 22:49:46.348257+00:00 | 1 |
1030 | USUgRVwrCMquHiImAnnJ | 2 | cell-census/2023-07-25/h5ads/2fc9c59f-3cfd-48d... | .h5ad | AnnData | Spatiotemporal immune zonation of the human ki... | None | 39294782 | rXmzfcuICx72PcUvYHsOiA-5 | md5-n | 11 | 16 | None | 1 | False | 2023-11-28 22:45:39.238212+00:00 | 1 |
1019 | gHlQ5Muwu3G9pvFC4GDV | 2 | cell-census/2023-07-25/h5ads/2d31c0ca-0233-41c... | .h5ad | AnnData | Spatiotemporal immune zonation of the human ki... | None | 64056560 | YjLm7iPkIFIEYimgQEfJSA-8 | md5-n | 11 | 16 | None | 1 | False | 2023-11-28 22:45:52.133169+00:00 | 1 |
983 | WwmBIhBNLTlRcSoBky88 | 2 | cell-census/2023-07-25/h5ads/20d87640-4be8-487... | .h5ad | AnnData | Spatiotemporal immune zonation of the human ki... | None | 44647761 | dAApZI2IZr64F5b1jDMgtA-6 | md5-n | 11 | 16 | None | 1 | False | 2023-11-28 22:45:31.292961+00:00 | 1 |
906 | b2x19Eg28GGSNnXWVa1m | 2 | cell-census/2023-07-25/h5ads/08073b32-d389-41f... | .h5ad | AnnData | Spatiotemporal immune zonation of the human ki... | None | 159545411 | e8gqdcJCy_gsp6sZ_8OI7Q-20 | md5-n | 11 | 16 | None | 1 | False | 2023-11-28 22:44:43.041536+00:00 | 1 |