Track in-memory data objects#
We’ve learned how to Track files, what about in-memory objects?
Yes! File can also be created from in-memory data, and we’ll store it as its canonical storage format. (e.g. DataFrame
→ .parquet
, AnnData
→ .h5ad
/.zarr
, …)
import lamindb as ln
ln.track()
ℹ️ Instance: testuser1/mydata
ℹ️ User: testuser1
ℹ️ Added notebook: Transform(id='OdlFhFWW7qg3', v='0', name='04-memory', type=notebook, title='Track in-memory data objects', created_by='DzTjkKse', created_at=datetime.datetime(2023, 3, 30, 23, 15, 58))
ℹ️ Added run: Run(id='aUvAkAwVxam9tDQcKfro', transform_id='OdlFhFWW7qg3', transform_v='0', created_by='DzTjkKse', created_at=datetime.datetime(2023, 3, 30, 23, 15, 58))
Usage#
Let’s now ingest an in-memory DataFrame
storing the iris dataset:
Show code cell content
import sklearn.datasets
df = sklearn.datasets.load_iris(as_frame=True).frame
df.head()
sepal length (cm) | sepal width (cm) | petal length (cm) | petal width (cm) | target | |
---|---|---|---|---|---|
0 | 5.1 | 3.5 | 1.4 | 0.2 | 0 |
1 | 4.9 | 3.0 | 1.4 | 0.2 | 0 |
2 | 4.7 | 3.2 | 1.3 | 0.2 | 0 |
3 | 4.6 | 3.1 | 1.5 | 0.2 | 0 |
4 | 5.0 | 3.6 | 1.4 | 0.2 | 0 |
df.shape
(150, 5)
When ingesting in-memory objects, a name
argument needs to be passed:
file = ln.File(df, name="iris")
Next, add metadata & data to database & storage, we can do so in a single transaction:
ln.add(file)
File(id='sfqjeqshOu4n2OCrxGj4', name='iris', suffix='.parquet', size=5629, hash='jUTdERuqlGv_GyqFfIEb2Q', source_id='aUvAkAwVxam9tDQcKfro', storage_id='8Pj12JLb', created_at=datetime.datetime(2023, 3, 30, 23, 15, 58))
What happens under the hood?#
In the SQL database#
A
File
entryA
Notebook
entryA
Run
entry
All three entries are linked so that you can find the file using any of the metadata fields.
Show code cell source
ln.select(ln.File, name=file.name).one()
File(id='sfqjeqshOu4n2OCrxGj4', name='iris', suffix='.parquet', size=5629, hash='jUTdERuqlGv_GyqFfIEb2Q', source_id='aUvAkAwVxam9tDQcKfro', storage_id='8Pj12JLb', created_at=datetime.datetime(2023, 3, 30, 23, 15, 58))
Show code cell source
ln.select(ln.schema.Notebook, id=ln.context.transform.id).one()
Transform(id='OdlFhFWW7qg3', v='0', name='04-memory', type=notebook, title='Track in-memory data objects', created_by='DzTjkKse', created_at=datetime.datetime(2023, 3, 30, 23, 15, 58))
Show code cell source
ln.select(ln.schema.Run, id=ln.context.run.id).one()
Run(id='aUvAkAwVxam9tDQcKfro', transform_id='OdlFhFWW7qg3', transform_v='0', created_by='DzTjkKse', created_at=datetime.datetime(2023, 3, 30, 23, 15, 58))
In storage#
A parquet file with cryptic name:
!ls ./mydata
VWybnO7aZxo8NMu80eHp.jpg mydata.lndb sfqjeqshOu4n2OCrxGj4.parquet
Retrieve data#
Get the dataframe back:
file.load()
sepal length (cm) | sepal width (cm) | petal length (cm) | petal width (cm) | target | |
---|---|---|---|---|---|
0 | 5.1 | 3.5 | 1.4 | 0.2 | 0 |
1 | 4.9 | 3.0 | 1.4 | 0.2 | 0 |
2 | 4.7 | 3.2 | 1.3 | 0.2 | 0 |
3 | 4.6 | 3.1 | 1.5 | 0.2 | 0 |
4 | 5.0 | 3.6 | 1.4 | 0.2 | 0 |
... | ... | ... | ... | ... | ... |
145 | 6.7 | 3.0 | 5.2 | 2.3 | 2 |
146 | 6.3 | 2.5 | 5.0 | 1.9 | 2 |
147 | 6.5 | 3.0 | 5.2 | 2.0 | 2 |
148 | 6.2 | 3.4 | 5.4 | 2.3 | 2 |
149 | 5.9 | 3.0 | 5.1 | 1.8 | 2 |
150 rows × 5 columns