Track data in existing storage locations#

You can track data without copying, if your data is located in the same location configured as the lamindb storage.

import lamindb as ln

ln.track()
ℹ️ Instance: testuser1/mydata
ℹ️ User: testuser1
ℹ️ Added notebook: Transform(id='OEbRXnepeCqE', v='0', name='05-existing', type=notebook, title='Track data in existing storage locations', created_by='DzTjkKse', created_at=datetime.datetime(2023, 3, 30, 23, 16, 7))
ℹ️ Added run: Run(id='obsRwVGjOgOautAvjeiS', transform_id='OEbRXnepeCqE', transform_v='0', created_by='DzTjkKse', created_at=datetime.datetime(2023, 3, 30, 23, 16, 7))

Track existing data from local storage#

Now let’s track a file that is already in the configured local directory: ./mydata

Hide code cell content
configured_storage = ln.setup.settings.instance.storage.root
filepath = ln.dev.datasets.file_mini_csv()
filepath = filepath.rename(configured_storage / filepath.name)
assert configured_storage in filepath.parents
!ls ./mydata/mini.csv
./mydata/mini.csv
file = ln.File("./mydata/mini.csv")
ln.add(file)
File(id='oocXPPbvFX1fYuFt5i6D', name='mini', suffix='.csv', size=11, hash='z1LdF2qN4cN0M2sXrcW8aw', source_id='obsRwVGjOgOautAvjeiS', storage_id='8Pj12JLb', created_at=datetime.datetime(2023, 3, 30, 23, 16, 7))

What happens under the hood?#

In the SQL database#

  1. A File entry

  2. A Notebook entry

  3. A Run entry

All three entries are linked so that you can find the file using any of the metadata fields.

Hide code cell source
ln.select(ln.File, name=file.name, suffix=file.suffix).one()
File(id='oocXPPbvFX1fYuFt5i6D', name='mini', suffix='.csv', size=11, hash='z1LdF2qN4cN0M2sXrcW8aw', source_id='obsRwVGjOgOautAvjeiS', storage_id='8Pj12JLb', created_at=datetime.datetime(2023, 3, 30, 23, 16, 7))
Hide code cell source
ln.select(ln.schema.Notebook, id=ln.context.transform.id).one()
Transform(id='OEbRXnepeCqE', v='0', name='05-existing', type=notebook, title='Track data in existing storage locations', created_by='DzTjkKse', created_at=datetime.datetime(2023, 3, 30, 23, 16, 7))
Hide code cell source
ln.select(ln.schema.Run, id=ln.context.run.id).one()
Run(id='obsRwVGjOgOautAvjeiS', transform_id='OEbRXnepeCqE', transform_v='0', created_by='DzTjkKse', created_at=datetime.datetime(2023, 3, 30, 23, 16, 7))

In storage#

The ./mydata/mini.csv is untouched, and no additional file is created:

!ls ./mydata
VWybnO7aZxo8NMu80eHp.jpg  mini.csv  mydata.lndb  sfqjeqshOu4n2OCrxGj4.parquet

Find and retrieve data#

file = ln.select(ln.File, name="mini", suffix=".csv").one()

file
File(id='oocXPPbvFX1fYuFt5i6D', name='mini', suffix='.csv', size=11, hash='z1LdF2qN4cN0M2sXrcW8aw', source_id='obsRwVGjOgOautAvjeiS', storage_id='8Pj12JLb', created_at=datetime.datetime(2023, 3, 30, 23, 16, 7))
file.path()
PosixPath('/home/runner/work/lamindb/lamindb/docs/guide/mydata/mini.csv')
file.load()
test
0 1
1 2
2 3
Hide code cell content
assert str(filepath.resolve()) == str(file.path())

Track existing data from cloud storage#

Configure cloud storage#

Tip

If you already have an existing instance with a different storage, you can switch storage by:

lamin.set.storage({storage_path})

Let’s configure an instance with cloud storage (s3):

ln.setup.login("testuser1")
ln.setup.init(storage="s3://lamindb-ci")
✅ Logged in with email testuser1@lamin.ai and id DzTjkKse
ℹ️ Loading instance: testuser1/lamindb-ci
🔶 Schema core v0.30rc5 is not up to date with 0.30.0
🔶 Migrate instance lamindb-ci outside a test (CI) run. Unexpected errors might happen.
'migrate-failed'

Now we’d like to ingest a csv file that is located in this cloud bucket:

cloudpath = "s3://lamindb-ci/test-data/Species.csv"
file = ln.File(data=cloudpath)
id = file.id
ln.add(file)
File(id='Cgfd532Ofh9VBVwi8IYV', name='Species', suffix='.csv', size=32772, source_id='obsRwVGjOgOautAvjeiS', storage_id='osY5NFDT', created_at=datetime.datetime(2023, 3, 30, 23, 16, 15))

Find and retrieve data#

file = ln.select(ln.File, id=id).one()

file
File(id='Cgfd532Ofh9VBVwi8IYV', name='Species', suffix='.csv', size=32772, source_id='obsRwVGjOgOautAvjeiS', storage_id='osY5NFDT', created_at=datetime.datetime(2023, 3, 30, 23, 16, 15))
df = file.load()
df.head()
Common name Scientific name Taxon ID Ensembl Assembly Accession Genebuild Method Variation database Regulation database
0 Abingdon island giant tortoise Chelonoidis abingdonii 106734 ASM359739v1 GCA_003597395.1 Full genebuild - -
1 African ostrich Struthio camelus australis 441894 ASM69896v1 GCA_000698965.1 Full genebuild - -
2 Agassiz's desert tortoise Gopherus agassizii 38772 ASM289641v1 GCA_002896415.1 Full genebuild - -
3 Algerian mouse Mus spretus 10096 SPRET_EiJ_v1 GCA_001624865.1 External annotation import - Y
4 Alpaca Vicugna pacos 30538 vicPac1 - Projection build - -
Hide code cell content
# [Not for users] Clean the test instance
ln.delete(file, delete_data_from_storage=False)
✅ Deleted row [session open] File(id='Cgfd532Ofh9VBVwi8IYV', name='Species', suffix='.csv', size=32772, source_id='obsRwVGjOgOautAvjeiS', storage_id='osY5NFDT', created_at=datetime.datetime(2023, 3, 30, 23, 16, 15)) in table File.