Jupyter Notebook

Gene Ontology (GO)#

Pathways represent interconnected molecular networks of signaling cascades that govern critical cellular processes. They provide understandings cellular behavior mechanisms, insights of disease progression and treatment responses. In an R&D organization, managing pathways across different datasets are crucial for gaining insights of potential therapeutic targets and intervention strategies.

In this notebook we manage a pathway registry based on “2023 GO Biological Process” ontology. We’ll walk you through the steps of registering pathways and link them to genes.

In the following Standardize metadata on-the-fly notebook, we’ll demonstrate how to perform a pathway enrichment analysis and track the dataset with LaminDB.

Setup#

!lamin init --storage ./use-cases-registries --schema bionty
Hide code cell output
💡 connected lamindb: testuser1/use-cases-registries
import lamindb as ln
import bionty as bt
import gseapy as gp

bt.settings.organism = "human"  # globally set organism
💡 connected lamindb: testuser1/use-cases-registries

Fetch GO pathways annotated with human genes using Enrichr#

First we fetch the “GO_Biological_Process_2023” pathways for humans using GSEApy which wraps GSEA and Enrichr.

go_bp = gp.get_library(name="GO_Biological_Process_2023", organism="Human")
print(f"Number of pathways {len(go_bp)}")
Number of pathways 5406
go_bp["ATF6-mediated Unfolded Protein Response (GO:0036500)"]
['MBTPS1', 'MBTPS2', 'XBP1', 'ATF6B', 'DDIT3', 'CREBZF']

Parse out the ontology_id from keys, convert into the format of {ontology_id: (name, genes)}

def parse_ontology_id_from_keys(key):
    """Parse out the ontology id.

    "ATF6-mediated Unfolded Protein Response (GO:0036500)" -> ("GO:0036500", "ATF6-mediated Unfolded Protein Response")
    """
    name, id = key.rsplit(" (", 1)
    id = id.rstrip(")")
    return id, name
go_bp_parsed = {}

for key, genes in go_bp.items():
    id, name = parse_ontology_id_from_keys(key)
    go_bp_parsed[id] = (name, genes)
go_bp_parsed["GO:0036500"]
('ATF6-mediated Unfolded Protein Response',
 ['MBTPS1', 'MBTPS2', 'XBP1', 'ATF6B', 'DDIT3', 'CREBZF'])

Register pathway ontology in LaminDB#

bionty = bt.Pathway.public()
bionty
Hide code cell output
PublicOntology
Entity: Pathway
Organism: all
Source: go, 2023-05-10
#terms: 47514

📖 .df(): ontology reference table
🔎 .lookup(): autocompletion of terms
🎯 .search(): free text search of terms
✅ .validate(): strictly validate values
🧐 .inspect(): full inspection of values
👽 .standardize(): convert to standardized names
🪜 .diff(): difference between two versions
🔗 .to_pronto(): Pronto.Ontology object

Next, we register all the pathways and genes in LaminDB to finally link pathways to genes.

Register pathway terms#

To register the pathways we make use of .from_values to directly parse the annotated GO pathway ontology IDs into LaminDB.

pathway_records = bt.Pathway.from_values(go_bp_parsed.keys(), bt.Pathway.ontology_id)
ln.save(pathway_records, parents=False)  # not recursing through parents

Register gene symbols#

Similarly, we use .from_values for all Pathway associated genes to register them with LaminDB.

all_genes = bt.Gene.standardize(list({g for genes in go_bp.values() for g in genes}))
gene_records = bt.Gene.from_values(all_genes, bt.Gene.symbol)
ln.save(gene_records);
❗ found 40 synonyms in Bionty: ['C6ORF89', 'C1ORF109', 'C17ORF99', 'C12ORF29', 'C20ORF173', 'C10ORF71', 'C12ORF57', 'C3ORF33', 'C8ORF17', 'C1ORF68', 'C9ORF72', 'C19ORF12', 'C8ORF88', 'C3ORF70', 'C3ORF38', 'C21ORF91', 'SLC9A3R1', 'C17ORF75', 'HSPB11', 'C1ORF43', 'C12ORF50', 'C11ORF80', 'C1ORF146', 'C12ORF4', 'TRB', 'C15ORF62', 'C11ORF65', 'C1ORF56', 'C2ORF69', 'C1ORF112', 'C17ORF97', 'C18ORF32', 'C18ORF25', 'SLC9A3R2', 'C2ORF49', 'C6ORF15', 'C10ORF90', 'C9ORF78', 'C1ORF131', 'PDZD3']
   please add corresponding Gene records via `.from_values(['C1orf131', 'C3orf70', 'C20orf173', 'NHERF1', 'C12orf50', 'C1orf146', 'C1orf109', 'C15orf62', 'C8orf88', 'C17orf97', 'C3orf38', 'C18orf25', 'C6orf15', 'C11orf65', 'C1orf68', 'C12orf29', 'C9orf72', 'C1orf112', 'C19orf12', 'C21orf91', 'C8orf17', 'C18orf32', 'C12orf4', 'NHERF2', 'C1orf43', 'C1orf56', 'IFT25', 'C10orf90', 'C11orf80', 'C9orf78', 'C6orf89', 'C3orf33', 'C17orf75', 'NHERF4', 'THRB', 'C17orf99', 'C2orf49', 'C2orf69', 'C12orf57', 'C10orf71'])`
❗ ambiguous validation in Bionty for 1082 records: 'CALB2', 'HERC2P3', 'ARHGAP27', 'LEUTX', 'RPL7A', 'RRP7BP', 'NCF1B', 'TRARG1', 'CYP11A1', 'DDX11L8', 'ZBTB12', 'DUSP8', 'CNTNAP2', 'NIPA1', 'TJP1', 'APOB', 'LRP6', 'FTCD', 'SMDT1', 'DUSP29', ...
did not create Gene records for 37 non-validated symbols: 'TRL-AAG2-3', 'MTRNR2L1', 'LOC102723475', 'DGS2', 'TRA', 'MTRNR2L6', 'MTRNR2L12', 'LOC112268384', 'MTRNR2L3', 'MTRNR2L13', 'TAS2R33', 'LOC122539214', 'DUX5', 'MTRNR2L2', 'LOC107984156', 'MTRNR2L8', 'MTRNR2L11', 'DUX3', 'LOC100653049', 'AFD1', ...

Manually register the 37 non-validated symbols:

inspect_result = bt.Gene.inspect(all_genes, bt.Gene.symbol)

nonval_genes = []
for g in inspect_result.non_validated:
    nonval_genes.append(bt.Gene(symbol=g))

ln.save(nonval_genes)
❗ received 14696 unique terms, 1 empty/duplicated term is ignored
37 terms (0.30%) are not validated for symbol: MTRNR2L2, SEPTIN14P20, LOC122513141, MTRNR2L6, MTRNR2L8, TRL-AAG2-3, AZF1, CCL4L1, MTRNR2L5, MDRV, AFD1, MTRNR2L11, DUX3, FOXL3-OT1, LOC107984156, TRA, MTRNR2L10, LOC122319436, MTRNR2L12, LOC344967, ...
   couldn't validate 37 terms: 'TRL-AAG2-3', 'MTRNR2L1', 'LOC102723475', 'TRA', 'DGS2', 'MTRNR2L6', 'MTRNR2L12', 'MTRNR2L3', 'LOC112268384', 'MTRNR2L13', 'TAS2R33', 'LOC122539214', 'DUX5', 'MTRNR2L2', 'LOC107984156', 'MTRNR2L8', 'MTRNR2L11', 'DUX3', 'LOC100653049', 'AFD1', ...
→  if you are sure, create new records via ln.Gene() and save to your registry