Ensembl species -> bionty.Species().df#

Downloaded from: https://www.ensembl.org/info/about/species.html

import lamindb as ln
import pandas as pd
from lnschema_bionty import id

ln.nb.header()
2022-10-24 16:54:15,372:INFO - Note: NumExpr detected 10 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 8.
2022-10-24 16:54:15,373:INFO - NumExpr defaulting to 8 threads.
authorSunny Sun (sunnyosun)
idSH5O08MYHNXe
version1
time_init2022-10-24 10:21
time_run2022-10-24 14:55
consecutive_cellsTrue
pypackagelamindb==0.6.0 lnschema_bionty==0.4.3 pandas==1.5.0

Curate the species table#

df = pd.read_csv("https://bionty-assets.s3.amazonaws.com/Species.csv", dtype=str)

df.head()
Common name Scientific name Taxon ID Ensembl Assembly Accession Genebuild Method Variation database Regulation database
0 Abingdon island giant tortoise Chelonoidis abingdonii 106734 ASM359739v1 GCA_003597395.1 Full genebuild - -
1 African ostrich Struthio camelus australis 441894 ASM69896v1 GCA_000698965.1 Full genebuild - -
2 Agassiz's desert tortoise Gopherus agassizii 38772 ASM289641v1 GCA_002896415.1 Full genebuild - -
3 Algerian mouse Mus spretus 10096 SPRET_EiJ_v1 GCA_001624865.1 External annotation import - Y
4 Alpaca Vicugna pacos 30538 vicPac1 - Projection build - -
df = df[
    ["Common name", "Taxon ID", "Scientific name", "Ensembl Assembly", "Accession"]
].copy()

df.head()
Common name Taxon ID Scientific name Ensembl Assembly Accession
0 Abingdon island giant tortoise 106734 Chelonoidis abingdonii ASM359739v1 GCA_003597395.1
1 African ostrich 441894 Struthio camelus australis ASM69896v1 GCA_000698965.1
2 Agassiz's desert tortoise 38772 Gopherus agassizii ASM289641v1 GCA_002896415.1
3 Algerian mouse 10096 Mus spretus SPRET_EiJ_v1 GCA_001624865.1
4 Alpaca 30538 Vicugna pacos vicPac1 -

Generate bionty species ids#

ids = []
for i in df.index:
    ids.append(id.species())

# make sure the ids are unique
assert df.index.is_unique

df.index = ids
df.index.name = "id"
df.head()
Common name Taxon ID Scientific name Ensembl Assembly Accession
id
MfC Abingdon island giant tortoise 106734 Chelonoidis abingdonii ASM359739v1 GCA_003597395.1
oQH African ostrich 441894 Struthio camelus australis ASM69896v1 GCA_000698965.1
G2P Agassiz's desert tortoise 38772 Gopherus agassizii ASM289641v1 GCA_002896415.1
OC9 Algerian mouse 10096 Mus spretus SPRET_EiJ_v1 GCA_001624865.1
Tns Alpaca 30538 Vicugna pacos vicPac1 -
df.to_parquet("ensembl_species.parquet")

Push to bionty-assets.lndb#

!lndb load bionty-assets
migrate-unnecessary
!lndb login sunnyosun
ingest = ln.db.Ingest()
ingest.add("ensembl_species.parquet", dobject_id="VpdUdouFahpvStwddqTwk");
ingest.commit()
✅ Cell numbers increase consecutively: Awesome!
2022-10-24 16:54:51,715:INFO - Found credentials in shared credentials file: ~/.aws/credentials
Upload /Users/sunnysun/Documents/repos.nosync/bionty-assets/docs/ingest/ensembl_species.parquet: 1.00
ℹ️ Added notebook 'Ensembl species -> `bionty.Species().df`' (SH5O08MYHNXe, 1) by user sunnyosun.
✅ Ingested the following dobjects:
+---+-------------------------------------------------+--------------------------------------------------------------+----------------------+
|   | dobject                                         | jupynb                                                       | user                 |
+---+-------------------------------------------------+--------------------------------------------------------------+----------------------+
| 0 | ensembl_species.parquet (VpdUdouFahpvStwddqTwk) | 'Ensembl species -> `bionty.Species().df`' (SH5O08MYHNXe, 1) | sunnyosun (kmvZDIX9) |
+---+-------------------------------------------------+--------------------------------------------------------------+----------------------+
ℹ️ Set notebook version to 1 & wrote pypackages.

Now on S3: https://bionty-assets.s3.amazonaws.com/VpdUdouFahpvStwddqTwk.parquet