Track data in existing storage locations#
You can track data without copying, if your data is located in the same location configured as the lamindb storage.
import lamindb as ln
ln.track()
ℹ️ Instance: testuser1/mydata
ℹ️ User: testuser1
ℹ️ Added notebook: Transform(id='OEbRXnepeCqE', v='0', name='05-existing', type=notebook, title='Track data in existing storage locations', created_by='DzTjkKse', created_at=datetime.datetime(2023, 3, 30, 23, 16, 7))
ℹ️ Added run: Run(id='obsRwVGjOgOautAvjeiS', transform_id='OEbRXnepeCqE', transform_v='0', created_by='DzTjkKse', created_at=datetime.datetime(2023, 3, 30, 23, 16, 7))
Track existing data from local storage#
Now let’s track a file that is already in the configured local directory: ./mydata
Show code cell content
configured_storage = ln.setup.settings.instance.storage.root
filepath = ln.dev.datasets.file_mini_csv()
filepath = filepath.rename(configured_storage / filepath.name)
assert configured_storage in filepath.parents
!ls ./mydata/mini.csv
./mydata/mini.csv
file = ln.File("./mydata/mini.csv")
ln.add(file)
File(id='oocXPPbvFX1fYuFt5i6D', name='mini', suffix='.csv', size=11, hash='z1LdF2qN4cN0M2sXrcW8aw', source_id='obsRwVGjOgOautAvjeiS', storage_id='8Pj12JLb', created_at=datetime.datetime(2023, 3, 30, 23, 16, 7))
What happens under the hood?#
In the SQL database#
A
File
entryA
Notebook
entryA
Run
entry
All three entries are linked so that you can find the file using any of the metadata fields.
Show code cell source
ln.select(ln.File, name=file.name, suffix=file.suffix).one()
File(id='oocXPPbvFX1fYuFt5i6D', name='mini', suffix='.csv', size=11, hash='z1LdF2qN4cN0M2sXrcW8aw', source_id='obsRwVGjOgOautAvjeiS', storage_id='8Pj12JLb', created_at=datetime.datetime(2023, 3, 30, 23, 16, 7))
Show code cell source
ln.select(ln.schema.Notebook, id=ln.context.transform.id).one()
Transform(id='OEbRXnepeCqE', v='0', name='05-existing', type=notebook, title='Track data in existing storage locations', created_by='DzTjkKse', created_at=datetime.datetime(2023, 3, 30, 23, 16, 7))
Show code cell source
ln.select(ln.schema.Run, id=ln.context.run.id).one()
Run(id='obsRwVGjOgOautAvjeiS', transform_id='OEbRXnepeCqE', transform_v='0', created_by='DzTjkKse', created_at=datetime.datetime(2023, 3, 30, 23, 16, 7))
In storage#
The ./mydata/mini.csv
is untouched, and no additional file is created:
!ls ./mydata
VWybnO7aZxo8NMu80eHp.jpg mini.csv mydata.lndb sfqjeqshOu4n2OCrxGj4.parquet
Find and retrieve data#
file = ln.select(ln.File, name="mini", suffix=".csv").one()
file
File(id='oocXPPbvFX1fYuFt5i6D', name='mini', suffix='.csv', size=11, hash='z1LdF2qN4cN0M2sXrcW8aw', source_id='obsRwVGjOgOautAvjeiS', storage_id='8Pj12JLb', created_at=datetime.datetime(2023, 3, 30, 23, 16, 7))
file.path()
PosixPath('/home/runner/work/lamindb/lamindb/docs/guide/mydata/mini.csv')
file.load()
test | |
---|---|
0 | 1 |
1 | 2 |
2 | 3 |
Show code cell content
assert str(filepath.resolve()) == str(file.path())
Track existing data from cloud storage#
Configure cloud storage#
Tip
If you already have an existing instance with a different storage, you can switch storage by:
lamin.set.storage({storage_path})
Let’s configure an instance with cloud storage (s3):
ln.setup.login("testuser1")
ln.setup.init(storage="s3://lamindb-ci")
✅ Logged in with email testuser1@lamin.ai and id DzTjkKse
ℹ️ Loading instance: testuser1/lamindb-ci
🔶 Schema core v0.30rc5 is not up to date with 0.30.0
🔶 Migrate instance lamindb-ci outside a test (CI) run. Unexpected errors might happen.
'migrate-failed'
Now we’d like to ingest a csv file that is located in this cloud bucket:
cloudpath = "s3://lamindb-ci/test-data/Species.csv"
file = ln.File(data=cloudpath)
id = file.id
ln.add(file)
File(id='Cgfd532Ofh9VBVwi8IYV', name='Species', suffix='.csv', size=32772, source_id='obsRwVGjOgOautAvjeiS', storage_id='osY5NFDT', created_at=datetime.datetime(2023, 3, 30, 23, 16, 15))
Find and retrieve data#
file = ln.select(ln.File, id=id).one()
file
File(id='Cgfd532Ofh9VBVwi8IYV', name='Species', suffix='.csv', size=32772, source_id='obsRwVGjOgOautAvjeiS', storage_id='osY5NFDT', created_at=datetime.datetime(2023, 3, 30, 23, 16, 15))
df = file.load()
df.head()
Common name | Scientific name | Taxon ID | Ensembl Assembly | Accession | Genebuild Method | Variation database | Regulation database | |
---|---|---|---|---|---|---|---|---|
0 | Abingdon island giant tortoise | Chelonoidis abingdonii | 106734 | ASM359739v1 | GCA_003597395.1 | Full genebuild | - | - |
1 | African ostrich | Struthio camelus australis | 441894 | ASM69896v1 | GCA_000698965.1 | Full genebuild | - | - |
2 | Agassiz's desert tortoise | Gopherus agassizii | 38772 | ASM289641v1 | GCA_002896415.1 | Full genebuild | - | - |
3 | Algerian mouse | Mus spretus | 10096 | SPRET_EiJ_v1 | GCA_001624865.1 | External annotation import | - | Y |
4 | Alpaca | Vicugna pacos | 30538 | vicPac1 | - | Projection build | - | - |
Show code cell content
# [Not for users] Clean the test instance
ln.delete(file, delete_data_from_storage=False)
✅ Deleted row [session open] File(id='Cgfd532Ofh9VBVwi8IYV', name='Species', suffix='.csv', size=32772, source_id='obsRwVGjOgOautAvjeiS', storage_id='osY5NFDT', created_at=datetime.datetime(2023, 3, 30, 23, 16, 15)) in table File.