Query cellxgene-census using TileDB-SOMA#
The first guide queried metadata and h5ad artifacts directly through LaminDB.
This guide uses the TileDB-SOMA API to run similar queries.
Setup#
Load your LaminDB instance for storing queried data:
!lamin init --storage ./test-cellxgene --schema bionty
๐ก connected lamindb: testuser1/test-cellxgene
import lamindb as ln
import bionty as bt
import cellxgene_census
census_version = "2023-07-25"
๐ก connected lamindb: testuser1/test-cellxgene
Create lookup objects#
We use metadata records in the laminlabs/cellxgene
instance to generate lookups:
source = "laminlabs/cellxgene"
human = "homo_sapiens"
features = ln.Feature.using(source).lookup(return_field="name")
assays = bt.ExperimentalFactor.using(source).lookup(return_field="name")
cell_types = bt.CellType.using(source).lookup(return_field="name")
tissues = bt.Tissue.using(source).lookup(return_field="name")
ulabels = ln.ULabel.using(source).lookup()
suspension_types = ulabels.is_suspension_type.children.all().lookup(return_field="name")
Query data#
value_filter = (
f'{features.tissue} == "{tissues.brain}" and {features.cell_type} in'
f' ["{cell_types.microglial_cell}", "{cell_types.neuron}"] and'
f' {features.suspension_type} == "{suspension_types.cell}" and {features.assay} =='
f' "{assays.ln_10x_3_v3}"'
)
value_filter
'tissue == "brain" and cell_type in ["microglial cell", "neuron"] and suspension_type == "cell" and assay == "10x 3\' v3"'
%%time
with cellxgene_census.open_soma(census_version=census_version) as census:
# Reads SOMADataFrame as a slice
cell_metadata = census["census_data"][human].obs.read(value_filter=value_filter)
# Concatenates results to pyarrow.Table
cell_metadata = cell_metadata.concat()
# Converts to pandas.DataFrame
cell_metadata = cell_metadata.to_pandas()
CPU times: user 4.33 s, sys: 1.66 s, total: 6 s
Wall time: 5.41 s
cell_metadata.shape
(66418, 21)
cell_metadata.head()
soma_joinid | dataset_id | assay | assay_ontology_term_id | cell_type | cell_type_ontology_term_id | development_stage | development_stage_ontology_term_id | disease | disease_ontology_term_id | ... | is_primary_data | self_reported_ethnicity | self_reported_ethnicity_ontology_term_id | sex | sex_ontology_term_id | suspension_type | tissue | tissue_ontology_term_id | tissue_general | tissue_general_ontology_term_id | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 29071956 | c888b684-6c51-431f-972a-6c963044cef0 | 10x 3' v3 | EFO:0009922 | microglial cell | CL:0000129 | 68-year-old human stage | HsapDv:0000162 | glioblastoma | MONDO:0018177 | ... | False | unknown | unknown | female | PATO:0000383 | cell | brain | UBERON:0000955 | brain | UBERON:0000955 |
1 | 29071957 | c888b684-6c51-431f-972a-6c963044cef0 | 10x 3' v3 | EFO:0009922 | microglial cell | CL:0000129 | 68-year-old human stage | HsapDv:0000162 | glioblastoma | MONDO:0018177 | ... | False | unknown | unknown | female | PATO:0000383 | cell | brain | UBERON:0000955 | brain | UBERON:0000955 |
2 | 29071964 | c888b684-6c51-431f-972a-6c963044cef0 | 10x 3' v3 | EFO:0009922 | microglial cell | CL:0000129 | 68-year-old human stage | HsapDv:0000162 | glioblastoma | MONDO:0018177 | ... | False | unknown | unknown | female | PATO:0000383 | cell | brain | UBERON:0000955 | brain | UBERON:0000955 |
3 | 29071966 | c888b684-6c51-431f-972a-6c963044cef0 | 10x 3' v3 | EFO:0009922 | microglial cell | CL:0000129 | 68-year-old human stage | HsapDv:0000162 | glioblastoma | MONDO:0018177 | ... | False | unknown | unknown | female | PATO:0000383 | cell | brain | UBERON:0000955 | brain | UBERON:0000955 |
4 | 29071967 | c888b684-6c51-431f-972a-6c963044cef0 | 10x 3' v3 | EFO:0009922 | microglial cell | CL:0000129 | 68-year-old human stage | HsapDv:0000162 | glioblastoma | MONDO:0018177 | ... | False | unknown | unknown | female | PATO:0000383 | cell | brain | UBERON:0000955 | brain | UBERON:0000955 |
5 rows ร 21 columns
Create AnnData#
%%time
with cellxgene_census.open_soma(census_version=census_version) as census:
adata = cellxgene_census.get_anndata(
census=census,
organism=human,
obs_value_filter=value_filter,
column_names={
"obs": [
features.assay,
features.cell_type,
features.tissue,
features.disease,
features.suspension_type,
]
},
)
CPU times: user 36.1 s, sys: 11.3 s, total: 47.4 s
Wall time: 29.8 s
adata.var = adata.var.set_index("feature_id")
adata
AnnData object with n_obs ร n_vars = 66418 ร 60664
obs: 'assay', 'cell_type', 'tissue', 'disease', 'suspension_type'
var: 'soma_joinid', 'feature_name', 'feature_length'
adata.var.head()
soma_joinid | feature_name | feature_length | |
---|---|---|---|
feature_id | |||
ENSG00000121410 | 0 | A1BG | 3999 |
ENSG00000268895 | 1 | A1BG-AS1 | 3374 |
ENSG00000148584 | 2 | A1CF | 9603 |
ENSG00000175899 | 3 | A2M | 6318 |
ENSG00000245105 | 4 | A2M-AS1 | 2948 |
adata.obs.head()
assay | cell_type | tissue | disease | suspension_type | |
---|---|---|---|---|---|
0 | 10x 3' v3 | microglial cell | brain | glioblastoma | cell |
1 | 10x 3' v3 | microglial cell | brain | glioblastoma | cell |
2 | 10x 3' v3 | microglial cell | brain | glioblastoma | cell |
3 | 10x 3' v3 | microglial cell | brain | glioblastoma | cell |
4 | 10x 3' v3 | microglial cell | brain | glioblastoma | cell |
Register the queried AnnData#
ln.transform.stem_uid = "6oq3VJy5yxIU"
ln.transform.version = "0"
ln.track()
๐ก notebook imports: bionty==0.42.9 cellxgene-census==1.13.0 lamindb==0.71.0
๐ก saved: Transform(uid='6oq3VJy5yxIU6K79', name='Query cellxgene-census using TileDB-SOMA', key='query-census', version='0', type='notebook', updated_at=2024-05-01 18:50:02 UTC, created_by_id=1)
๐ก saved: Run(uid='mhqsIeOhVxpOWa3WLqrG', transform_id=1, created_by_id=1)
Register genes and features:
bt.settings.organism = "human"
genes = bt.Gene.from_values(adata.var_names, field=bt.Gene.ensembl_gene_id)
ln.save(genes)
features = ln.Feature.from_df(adata.obs)
ln.save(features)
โ did not create Gene records for 147 non-validated ensembl_gene_ids: 'ENSG00000256222', 'ENSG00000273301', 'ENSG00000272370', 'ENSG00000112096', 'ENSG00000256427', 'ENSG00000228906', 'ENSG00000277077', 'ENSG00000272551', 'ENSG00000239446', 'ENSG00000273370', 'ENSG00000261773', 'ENSG00000280095', 'ENSG00000261963', 'ENSG00000261534', 'ENSG00000268955', 'ENSG00000273923', 'ENSG00000279765', 'ENSG00000287388', 'ENSG00000260461', 'ENSG00000227902', ...
Register the AnnData
object:
artifact = ln.Artifact.from_anndata(
adata,
description=(
"microglial and neuron cell data from 10x 3' v3 in brain queried from Census"
),
)
artifact.save()
Artifact(uid='RbANRPl53UcoG9W8lLna', suffix='.h5ad', accessor='AnnData', description='microglial and neuron cell data from 10x 3' v3 in brain queried from Census', size=674995866, hash='v8QkSfHA4jUocUskUyBSzl', hash_type='sha1-fl', visibility=1, key_is_virtual=True, updated_at=2024-05-01 18:50:23 UTC, storage_id=1, transform_id=1, run_id=1, created_by_id=1)
Link validated metadata:
artifact.features.add_from_anndata(var_field=bt.Gene.ensembl_gene_id)
โ 147 terms (0.20%) are not validated for ensembl_gene_id: ENSG00000285162, ENSG00000276814, ENSG00000282080, ENSG00000237513, ENSG00000239467, ENSG00000236886, ENSG00000273576, ENSG00000256427, ENSG00000272040, ENSG00000278198, ENSG00000273496, ENSG00000279765, ENSG00000224739, ENSG00000226380, ENSG00000285106, ENSG00000272551, ENSG00000237133, ENSG00000272267, ENSG00000271870, ENSG00000227902, ...
features_remote = ln.Feature.using(source).lookup().dict()
features = ln.Feature.lookup().dict()
for col, orm in {
"assay": bt.ExperimentalFactor,
"cell_type": bt.CellType,
"tissue": bt.Tissue,
"disease": bt.Disease,
"suspension_type": ln.ULabel,
}.items():
labels = orm.from_values(adata.obs[col])
if len(labels) > 0:
ln.save(labels)
else:
labels = [orm(name=name) for name in adata.obs[col].unique()]
ln.save(labels)
artifact.labels.add(labels, features.get(col))
Show code cell output
โ now recursing through parents: this only happens once, but is much slower than bulk saving
โ now recursing through parents: this only happens once, but is much slower than bulk saving
โ now recursing through parents: this only happens once, but is much slower than bulk saving
โ now recursing through parents: this only happens once, but is much slower than bulk saving
โ did not create ULabel record for 1 non-validated name: 'cell'
artifact.describe()
Artifact(uid='RbANRPl53UcoG9W8lLna', suffix='.h5ad', accessor='AnnData', description='microglial and neuron cell data from 10x 3' v3 in brain queried from Census', size=674995866, hash='v8QkSfHA4jUocUskUyBSzl', hash_type='sha1-fl', visibility=1, key_is_virtual=True, updated_at=2024-05-01 18:50:29 UTC)
Provenance:
๐ storage: Storage(uid='SNvsFGNNBINT', root='/home/runner/work/cellxgene-lamin/cellxgene-lamin/docs/test-cellxgene', type='local', instance_uid='5lZgSHvkhwQL')
๐ transform: Transform(uid='6oq3VJy5yxIU6K79', name='Query cellxgene-census using TileDB-SOMA', key='query-census', version='0', type='notebook')
๐ run: Run(uid='mhqsIeOhVxpOWa3WLqrG', started_at=2024-05-01 18:50:02 UTC, is_consecutive=True)
๐ created_by: User(uid='DzTjkKse', handle='testuser1', name='Test User1')
Features:
var: FeatureSet(uid='LAldMPaIO8RuFcmadgnr', n=60517, type='number', registry='bionty.Gene')
'SMCR8', 'RXFP4', 'RLIMP2', 'MIR4477A', 'GIMAP7', 'FKBP1AP1', 'NR1H4', 'CDR2L', 'LINC01746', 'CXCL5', 'COX7BP6', 'TAAR5', 'RNU4ATAC16P', 'RPL18AP2', 'ZNHIT3', 'RN7SL606P', 'RN7SKP235', 'LAP3P1', 'PCDH9-AS2', 'STK38L', ...
obs: FeatureSet(uid='xVUCP6IE7EIV36vdp7Rz', n=5, registry='core.Feature')
๐ assay (1, bionty.ExperimentalFactor): '10x 3' v3'
๐ cell_type (2, bionty.CellType): 'microglial cell', 'neuron'
๐ tissue (1, bionty.Tissue): 'brain'
๐ disease (1, bionty.Disease): 'glioblastoma'
๐ suspension_type (1, core.ULabel): 'cell'
Labels:
๐ tissues (1, bionty.Tissue): 'brain'
๐ cell_types (2, bionty.CellType): 'microglial cell', 'neuron'
๐ diseases (1, bionty.Disease): 'glioblastoma'
๐ experimental_factors (1, bionty.ExperimentalFactor): '10x 3' v3'
๐ ulabels (1, core.ULabel): 'cell'
artifact.view_lineage()
# clean up test instance
!lamin delete --force test-cellxgene
!rm -r ./test-cellxgene
Show code cell output
Traceback (most recent call last):
File "/opt/hostedtoolcache/Python/3.10.14/x64/bin/lamin", line 8, in <module>
sys.exit(main())
File "/opt/hostedtoolcache/Python/3.10.14/x64/lib/python3.10/site-packages/rich_click/rich_command.py", line 360, in __call__
return super().__call__(*args, **kwargs)
File "/opt/hostedtoolcache/Python/3.10.14/x64/lib/python3.10/site-packages/click/core.py", line 1157, in __call__
return self.main(*args, **kwargs)
File "/opt/hostedtoolcache/Python/3.10.14/x64/lib/python3.10/site-packages/rich_click/rich_command.py", line 152, in main
rv = self.invoke(ctx)
File "/opt/hostedtoolcache/Python/3.10.14/x64/lib/python3.10/site-packages/click/core.py", line 1688, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/opt/hostedtoolcache/Python/3.10.14/x64/lib/python3.10/site-packages/click/core.py", line 1434, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/opt/hostedtoolcache/Python/3.10.14/x64/lib/python3.10/site-packages/click/core.py", line 783, in invoke
return __callback(*args, **kwargs)
File "/opt/hostedtoolcache/Python/3.10.14/x64/lib/python3.10/site-packages/lamin_cli/__main__.py", line 103, in delete
return delete(instance, force=force)
File "/opt/hostedtoolcache/Python/3.10.14/x64/lib/python3.10/site-packages/lamindb_setup/_delete.py", line 140, in delete
n_objects = check_storage_is_empty(
File "/opt/hostedtoolcache/Python/3.10.14/x64/lib/python3.10/site-packages/lamindb_setup/core/upath.py", line 814, in check_storage_is_empty
raise InstanceNotEmpty(message)
lamindb_setup.core.upath.InstanceNotEmpty: Storage /home/runner/work/cellxgene-lamin/cellxgene-lamin/docs/test-cellxgene/.lamindb contains 2 objects ('./lamindb/_is_initialized' ignored) - delete them prior to deleting the instance
['/home/runner/work/cellxgene-lamin/cellxgene-lamin/docs/test-cellxgene/.lamindb/RbANRPl53UcoG9W8lLna.h5ad', '/home/runner/work/cellxgene-lamin/cellxgene-lamin/docs/test-cellxgene/.lamindb/_is_initialized']