Jupyter Notebook Binder

Bulk RNA-seq#

Note

More comprehensive examples are provided for these data types:

Setup#

!lamin init --storage test-bulkrna --schema bionty
Hide code cell output
💡 connected lamindb: testuser1/test-bulkrna
import lamindb as ln
from pathlib import Path
import bionty as bt
import pandas as pd
import anndata as ad
💡 connected lamindb: testuser1/test-bulkrna

Ingest data#

Access #

We start by simulating a nf-core RNA-seq run which yields us a count matrix artifact.

(See Nextflow for running this with Nextflow.)

# pretend we're running a bulk RNA-seq pipeline
ln.track(transform=ln.Transform(name="nf-core RNA-seq", reference="https://nf-co.re/rnaseq"))
# create a directory for its output
Path("./test-bulkrna/output_dir").mkdir(exist_ok=True)
# get the count matrix
path = ln.core.datasets.file_tsv_rnaseq_nfcore_salmon_merged_gene_counts(
    populate_registries=True
)
# move it into the output directory
path = path.rename(f"./test-bulkrna/output_dir/{path.name}")
# register it
ln.Artifact(path, description="Merged Bulk RNA counts").save()
Hide code cell output
💡 saved: Transform(uid='aANI03cRVzKqJmuV', name='nf-core RNA-seq', type='pipeline', reference='https://nf-co.re/rnaseq', updated_at=2024-05-01 18:52:50 UTC, created_by_id=1)
💡 saved: Run(uid='dqYpMm5pdGH1FrnYk9H1', transform_id=1, created_by_id=1)
Artifact(uid='vb6cI4jNM58AieqzmTYl', key='output_dir/salmon.merged.gene_counts.tsv', suffix='.tsv', description='Merged Bulk RNA counts', size=3787, hash='xxw0k3au3KtxFcgtbEr4eQ', hash_type='md5', visibility=1, key_is_virtual=False, updated_at=2024-05-01 18:52:53 UTC, storage_id=1, transform_id=1, run_id=1, created_by_id=1)

Transform #

ln.settings.transform.stem_uid = "s5V0dNMVwL9i"
ln.settings.transform.version = "0"
ln.track()
💡 notebook imports: anndata==0.9.2 bionty==0.42.9 lamindb==0.71.0 pandas==1.5.3
💡 saved: Transform(uid='s5V0dNMVwL9i6K79', name='Bulk RNA-seq', key='bulkrna', version='0', type='notebook', updated_at=2024-05-01 18:52:53 UTC, created_by_id=1)
💡 saved: Run(uid='VAtnM2oRC3r1MLf4Bcgy', transform_id=2, created_by_id=1)

Let’s query the artifact:

artifact = ln.Artifact.filter(description="Merged Bulk RNA counts").one()
df = artifact.load()

If we look at it, we realize it deviates far from the tidy data standard Wickham14, conventions of statistics & machine learning Hastie09, Murphy12 and the major Python & R data packages.

Variables are not in columns and observations are not in rows:

df
gene_id gene_name RAP1_IAA_30M_REP1 RAP1_UNINDUCED_REP1 RAP1_UNINDUCED_REP2 WT_REP1 WT_REP2
0 Gfp_transgene_gene Gfp_transgene_gene 0.0 0.000 0.0 0.0 0.0
1 HRA1 HRA1 0.0 8.572 0.0 0.0 0.0
2 snR18 snR18 3.0 8.000 4.0 8.0 8.0
3 tA(UGC)A TGA1 0.0 0.000 0.0 0.0 0.0
4 tL(CAA)A SUP56 0.0 0.000 0.0 0.0 0.0
... ... ... ... ... ... ... ...
120 YAR064W YAR064W 0.0 2.000 0.0 0.0 0.0
121 YAR066W YAR066W 3.0 13.000 8.0 5.0 11.0
122 YAR068W YAR068W 9.0 28.000 24.0 5.0 7.0
123 YAR069C YAR069C 0.0 0.000 0.0 0.0 1.0
124 YAR070C YAR070C 0.0 0.000 0.0 0.0 0.0

125 rows × 7 columns

Let’s change that and move observations into rows:

df = df.T

df
0 1 2 3 4 5 6 7 8 9 ... 115 116 117 118 119 120 121 122 123 124
gene_id Gfp_transgene_gene HRA1 snR18 tA(UGC)A tL(CAA)A tP(UGG)A tS(AGA)A YAL001C YAL002W YAL003W ... YAR050W YAR053W YAR060C YAR061W YAR062W YAR064W YAR066W YAR068W YAR069C YAR070C
gene_name Gfp_transgene_gene HRA1 snR18 TGA1 SUP56 TRN1 tS(AGA)A TFC3 VPS8 EFB1 ... FLO1 YAR053W YAR060C YAR061W YAR062W YAR064W YAR066W YAR068W YAR069C YAR070C
RAP1_IAA_30M_REP1 0.0 0.0 3.0 0.0 0.0 0.0 1.0 55.0 36.0 632.0 ... 4.357 0.0 1.0 0.0 1.0 0.0 3.0 9.0 0.0 0.0
RAP1_UNINDUCED_REP1 0.0 8.572 8.0 0.0 0.0 0.0 0.0 72.0 33.0 810.0 ... 15.72 0.0 0.0 0.0 3.0 2.0 13.0 28.0 0.0 0.0
RAP1_UNINDUCED_REP2 0.0 0.0 4.0 0.0 0.0 0.0 0.0 115.0 82.0 1693.0 ... 13.772 0.0 4.0 0.0 2.0 0.0 8.0 24.0 0.0 0.0
WT_REP1 0.0 0.0 8.0 0.0 0.0 1.0 0.0 60.0 63.0 1115.0 ... 13.465 0.0 0.0 0.0 1.0 0.0 5.0 5.0 0.0 0.0
WT_REP2 0.0 0.0 8.0 0.0 0.0 0.0 0.0 30.0 25.0 704.0 ... 6.891 0.0 1.0 0.0 0.0 0.0 11.0 7.0 1.0 0.0

7 rows × 125 columns

Now, it’s clear that the first two rows are in fact no observations, but descriptions of the variables (or features) themselves.

Let’s create an AnnData object to model this. First, create a dataframe for the variables:

var = pd.DataFrame({"gene_name": df.loc["gene_name"].values}, index=df.loc["gene_id"])
var.head()
gene_name
gene_id
Gfp_transgene_gene Gfp_transgene_gene
HRA1 HRA1
snR18 snR18
tA(UGC)A TGA1
tL(CAA)A SUP56

Now, let’s create an AnnData:

# we're also fixing the datatype here, which was string in the tsv
adata = ad.AnnData(df.iloc[2:].astype("float32"), var=var)

adata
AnnData object with n_obs × n_vars = 5 × 125
    var: 'gene_name'

The AnnData object is in tidy form and complies with conventions of statistics and machine learning:

adata.to_df()
gene_id Gfp_transgene_gene HRA1 snR18 tA(UGC)A tL(CAA)A tP(UGG)A tS(AGA)A YAL001C YAL002W YAL003W ... YAR050W YAR053W YAR060C YAR061W YAR062W YAR064W YAR066W YAR068W YAR069C YAR070C
RAP1_IAA_30M_REP1 0.0 0.000 3.0 0.0 0.0 0.0 1.0 55.0 36.0 632.0 ... 4.357 0.0 1.0 0.0 1.0 0.0 3.0 9.0 0.0 0.0
RAP1_UNINDUCED_REP1 0.0 8.572 8.0 0.0 0.0 0.0 0.0 72.0 33.0 810.0 ... 15.720 0.0 0.0 0.0 3.0 2.0 13.0 28.0 0.0 0.0
RAP1_UNINDUCED_REP2 0.0 0.000 4.0 0.0 0.0 0.0 0.0 115.0 82.0 1693.0 ... 13.772 0.0 4.0 0.0 2.0 0.0 8.0 24.0 0.0 0.0
WT_REP1 0.0 0.000 8.0 0.0 0.0 1.0 0.0 60.0 63.0 1115.0 ... 13.465 0.0 0.0 0.0 1.0 0.0 5.0 5.0 0.0 0.0
WT_REP2 0.0 0.000 8.0 0.0 0.0 0.0 0.0 30.0 25.0 704.0 ... 6.891 0.0 1.0 0.0 0.0 0.0 11.0 7.0 1.0 0.0

5 rows × 125 columns

Validate #

Let’s create a Artifact object from this AnnData.

Almost all gene IDs are validated:

genes = bt.Gene.from_values(
    adata.var.index,
    bt.Gene.stable_id,
    organism="saccharomyces cerevisiae",  # or set globally with bt.settings.organism
)
Hide code cell output
did not create Gene records for 2 non-validated stable_ids: 'Gfp_transgene_gene', 'YAR062W'
# also register the 2 non-validated genes obtained from Bionty
ln.save(genes)

Register #

efs = bt.ExperimentalFactor.lookup()
organism = bt.Organism.lookup()
features = ln.Feature.lookup()
curated_file = ln.Artifact.from_anndata(
    adata,
    description="Curated bulk RNA counts"
)

Hence, let’s save this artifact:

curated_file.save()
Hide code cell output
Artifact(uid='DretKqnrDBQlPedlojKN', suffix='.h5ad', accessor='AnnData', description='Curated bulk RNA counts', size=28180, hash='6bieh8XjOCCz6bJToN4u1g', hash_type='md5', visibility=1, key_is_virtual=True, updated_at=2024-05-01 18:52:55 UTC, storage_id=1, transform_id=2, run_id=2, created_by_id=1)

Link to validated metadata records:

curated_file.features.add_from_anndata(var_field=bt.Gene.stable_id, organism="saccharomyces cerevisiae")
2 terms (1.60%) are not validated for stable_id: Gfp_transgene_gene, YAR062W
curated_file.labels.add(efs.rna_seq, features.assay)
curated_file.labels.add(organism.saccharomyces_cerevisiae, features.organism)
curated_file.describe()
Artifact(uid='DretKqnrDBQlPedlojKN', suffix='.h5ad', accessor='AnnData', description='Curated bulk RNA counts', size=28180, hash='6bieh8XjOCCz6bJToN4u1g', hash_type='md5', visibility=1, key_is_virtual=True, updated_at=2024-05-01 18:52:55 UTC)

Provenance:
  📎 storage: Storage(uid='YVS4MVWsyeKw', root='/home/runner/work/lamin-usecases/lamin-usecases/docs/test-bulkrna', type='local', instance_uid='5LuhnQ7frNiw')
  📎 transform: Transform(uid='s5V0dNMVwL9i6K79', name='Bulk RNA-seq', key='bulkrna', version='0', type='notebook')
  📎 run: Run(uid='VAtnM2oRC3r1MLf4Bcgy', started_at=2024-05-01 18:52:53 UTC, is_consecutive=True)
  📎 created_by: User(uid='DzTjkKse', handle='testuser1', name='Test User1')
Features:
  var: FeatureSet(uid='vXGuNKSaqsb4fbNLgWU6', n=123, type='number', registry='bionty.Gene')
    'SNC1', 'DRS2', 'GCV3', 'MST28', 'UIP3', 'LDS1', 'SPO7', 'SPC72', 'NUP60', 'CYC3', 'GPB2', 'GIP4', 'ADE1', 'EFB1', 'GDH3', 'CNE1', 'DEP1', 'FUN14', 'CCR4', 'YAT1', ...
  external: FeatureSet(uid='0oE3waOjNonjOoWkElZ2', n=2, registry='core.Feature')
    🔗 assay (1, bionty.ExperimentalFactor): 'RNA-Seq'
    🔗 organism (1, bionty.Organism): 'saccharomyces cerevisiae'
Labels:
  📎 organism (1, bionty.Organism): 'saccharomyces cerevisiae'
  📎 experimental_factors (1, bionty.ExperimentalFactor): 'RNA-Seq'

Query data#

We have two files in the artifact registry:

ln.Artifact.df()
uid storage_id key suffix accessor description version size hash hash_type n_objects n_observations transform_id run_id visibility key_is_virtual created_at updated_at created_by_id
id
2 DretKqnrDBQlPedlojKN 1 None .h5ad AnnData Curated bulk RNA counts None 28180 6bieh8XjOCCz6bJToN4u1g md5 None None 2 2 1 True 2024-05-01 18:52:55.407547+00:00 2024-05-01 18:52:55.444065+00:00 1
1 vb6cI4jNM58AieqzmTYl 1 output_dir/salmon.merged.gene_counts.tsv .tsv None Merged Bulk RNA counts None 3787 xxw0k3au3KtxFcgtbEr4eQ md5 None None 1 1 1 False 2024-05-01 18:52:53.587236+00:00 2024-05-01 18:52:53.587268+00:00 1
curated_file.view_lineage()
_images/1d5c52adca0a3ddbe10f8b15de47e98f1efab05409d0a65edb1c84811006104c.svg

Let’s by query by gene:

genes = bt.Gene.lookup()
genes.spo7
Gene(uid='2pkcLeMEB6aS', symbol='SPO7', stable_id='YAL009W', ncbi_gene_ids='851224', biotype='protein_coding', description='Putative regulatory subunit of Nem1p-Spo7p phosphatase holoenzyme; regulates nuclear growth by controlling phospholipid biosynthesis, required for normal nuclear envelope morphology, premeiotic replication, and sporulation ', synonyms='', updated_at=2024-05-01 18:52:55 UTC, organism_id=1, public_source_id=13, created_by_id=1)
# a gene set containing SPO7
feature_set = ln.FeatureSet.filter(genes=genes.spo7).first()
# artifacts that link to this feature set
ln.Artifact.filter(feature_sets=feature_set).df()
uid storage_id key suffix accessor description version size hash hash_type n_objects n_observations transform_id run_id visibility key_is_virtual created_at updated_at created_by_id
id
2 DretKqnrDBQlPedlojKN 1 None .h5ad AnnData Curated bulk RNA counts None 28180 6bieh8XjOCCz6bJToN4u1g md5 None None 2 2 1 True 2024-05-01 18:52:55.407547+00:00 2024-05-01 18:52:55.444065+00:00 1
# clean up test instance
!lamin delete --force test-bulkrna
!rm -r test-bulkrna
Hide code cell output
Traceback (most recent call last):
  File "/opt/hostedtoolcache/Python/3.10.14/x64/bin/lamin", line 8, in <module>
    sys.exit(main())
  File "/opt/hostedtoolcache/Python/3.10.14/x64/lib/python3.10/site-packages/rich_click/rich_command.py", line 360, in __call__
    return super().__call__(*args, **kwargs)
  File "/opt/hostedtoolcache/Python/3.10.14/x64/lib/python3.10/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
  File "/opt/hostedtoolcache/Python/3.10.14/x64/lib/python3.10/site-packages/rich_click/rich_command.py", line 152, in main
    rv = self.invoke(ctx)
  File "/opt/hostedtoolcache/Python/3.10.14/x64/lib/python3.10/site-packages/click/core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/opt/hostedtoolcache/Python/3.10.14/x64/lib/python3.10/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/opt/hostedtoolcache/Python/3.10.14/x64/lib/python3.10/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
  File "/opt/hostedtoolcache/Python/3.10.14/x64/lib/python3.10/site-packages/lamin_cli/__main__.py", line 103, in delete
    return delete(instance, force=force)
  File "/opt/hostedtoolcache/Python/3.10.14/x64/lib/python3.10/site-packages/lamindb_setup/_delete.py", line 140, in delete
    n_objects = check_storage_is_empty(
  File "/opt/hostedtoolcache/Python/3.10.14/x64/lib/python3.10/site-packages/lamindb_setup/core/upath.py", line 814, in check_storage_is_empty
    raise InstanceNotEmpty(message)
lamindb_setup.core.upath.InstanceNotEmpty: Storage /home/runner/work/lamin-usecases/lamin-usecases/docs/test-bulkrna/.lamindb contains 2 objects ('./lamindb/_is_initialized'  ignored) - delete them prior to deleting the instance
['/home/runner/work/lamin-usecases/lamin-usecases/docs/test-bulkrna/.lamindb/DretKqnrDBQlPedlojKN.h5ad', '/home/runner/work/lamin-usecases/lamin-usecases/docs/test-bulkrna/.lamindb/_is_initialized']