Track files#
import lamindb as ln
ln.track()
ℹ️ Instance: testuser1/mydata
ℹ️ User: testuser1
ℹ️ Added notebook: Transform(id='NJvdsWWbJlZS', v='0', name='03-files', type=notebook, title='Track files', created_by='DzTjkKse', created_at=datetime.datetime(2023, 3, 30, 23, 15, 49))
ℹ️ Added run: Run(id='bpttr5hRLo73B4m2co3v', transform_id='NJvdsWWbJlZS', transform_v='0', created_by='DzTjkKse', created_at=datetime.datetime(2023, 3, 30, 23, 15, 49))
Note
Within a Jupyter notebook, the call to ln.context.track_notebook(); ln.Run(load_latest=True)
tracks the notebook run as a data source.
Learn more: Track data lineage: Transform & Run.
Usage#
A local file:
Show code cell content
filepath = ln.dev.datasets.file_jpg_paradisi05().resolve().as_posix()
filepath
'/home/runner/work/lamindb/lamindb/docs/guide/paradisi05_laminopathic_nuclei.jpg'
To start tracking this file, we creates a file
record:
Note
We’ll work with a single class for data objects in memory and on disk: File
. On disk, these are often (but not always, e.g., for zarr
) files.
file = ln.File(filepath)
The file
record captures metadata about the file and will be our way to query and load data.
file
File(id='VWybnO7aZxo8NMu80eHp', name='paradisi05_laminopathic_nuclei', suffix='.jpg', size=29358, hash='r4tnqmKI_SjrkdLzpuWp4g', source_id='bpttr5hRLo73B4m2co3v', storage_id='8Pj12JLb')
We can also access linked metadata records, for instance, the record that stores metadata about this run.
file.source
Run(id='bpttr5hRLo73B4m2co3v', transform_id='NJvdsWWbJlZS', transform_v='0', created_by='DzTjkKse', created_at=datetime.datetime(2023, 3, 30, 23, 15, 49))
As we’re ingesting from a notebook, here, it defaults to the notebook run created upon calling ln.track()
:
assert ln.context.run == file.source
Next, add metadata & data to database & storage, we can do so in a single transaction:
ln.add(file)
File(id='VWybnO7aZxo8NMu80eHp', name='paradisi05_laminopathic_nuclei', suffix='.jpg', size=29358, hash='r4tnqmKI_SjrkdLzpuWp4g', source_id='bpttr5hRLo73B4m2co3v', storage_id='8Pj12JLb', created_at=datetime.datetime(2023, 3, 30, 23, 15, 50))
What happens under the hood?#
In the SQL database#
A
File
entryA
Notebook
entryA
Run
entry
All three entries are linked so that you can find the file using any of the metadata fields.
Show code cell source
ln.select(ln.File, name=file.name).one()
File(id='VWybnO7aZxo8NMu80eHp', name='paradisi05_laminopathic_nuclei', suffix='.jpg', size=29358, hash='r4tnqmKI_SjrkdLzpuWp4g', source_id='bpttr5hRLo73B4m2co3v', storage_id='8Pj12JLb', created_at=datetime.datetime(2023, 3, 30, 23, 15, 50))
Show code cell source
ln.select(ln.schema.Notebook, id=ln.context.transform.id).one()
Transform(id='NJvdsWWbJlZS', v='0', name='03-files', type=notebook, title='Track files', created_by='DzTjkKse', created_at=datetime.datetime(2023, 3, 30, 23, 15, 49))
Show code cell source
ln.select(ln.schema.Run, id=ln.context.run.id).one()
Run(id='bpttr5hRLo73B4m2co3v', transform_id='NJvdsWWbJlZS', transform_v='0', created_by='DzTjkKse', created_at=datetime.datetime(2023, 3, 30, 23, 15, 49))
In storage#
Note
This is your configured storage location (in this instance ./mydata
), which you pass to ln.setup.init(storage=...)
when initiating the instance.
If cloud storage location is configured, the file will be uploaded.
A jpg file with cryptic name:
!ls ./mydata
VWybnO7aZxo8NMu80eHp.jpg mydata.lndb
Tip
If you prefer semantic names, you can easily achieve it by tracking existing data rather than ingesting data into a storage location: Track data in existing storage locations.
Naming data objects in storage by the primary key ID of the File
is typically preferred when facing potential clashes of names at large scale or working with in-memory views.
Retrieve a file#
Getting the data back works through .load()
- here, we get back filepath with the cryptic filename.
file.load()
PosixPath('/home/runner/work/lamindb/lamindb/docs/guide/mydata/VWybnO7aZxo8NMu80eHp.jpg')
Find a file#
You can also query the file-associated File record by its metadata. One of the simplest ways is by name:
file = ln.select(ln.File, name="paradisi05_laminopathic_nuclei").one()
file
File(id='VWybnO7aZxo8NMu80eHp', name='paradisi05_laminopathic_nuclei', suffix='.jpg', size=29358, hash='r4tnqmKI_SjrkdLzpuWp4g', source_id='bpttr5hRLo73B4m2co3v', storage_id='8Pj12JLb', created_at=datetime.datetime(2023, 3, 30, 23, 15, 50))
Learn more: Query data.