Track in-memory data objects#

We’ve learned how to Track files, what about in-memory objects?

Yes! File can also be created from in-memory data, and we’ll store it as its canonical storage format. (e.g. DataFrame.parquet, AnnData.h5ad/.zarr, …)

import lamindb as ln

ln.track()
ℹ️ Instance: testuser1/mydata
ℹ️ User: testuser1
ℹ️ Added notebook: Transform(id='OdlFhFWW7qg3', v='0', name='04-memory', type=notebook, title='Track in-memory data objects', created_by='DzTjkKse', created_at=datetime.datetime(2023, 3, 30, 23, 15, 58))
ℹ️ Added run: Run(id='aUvAkAwVxam9tDQcKfro', transform_id='OdlFhFWW7qg3', transform_v='0', created_by='DzTjkKse', created_at=datetime.datetime(2023, 3, 30, 23, 15, 58))

Usage#

Let’s now ingest an in-memory DataFrame storing the iris dataset:

Hide code cell content
import sklearn.datasets

df = sklearn.datasets.load_iris(as_frame=True).frame
df.head()
sepal length (cm) sepal width (cm) petal length (cm) petal width (cm) target
0 5.1 3.5 1.4 0.2 0
1 4.9 3.0 1.4 0.2 0
2 4.7 3.2 1.3 0.2 0
3 4.6 3.1 1.5 0.2 0
4 5.0 3.6 1.4 0.2 0
df.shape
(150, 5)

When ingesting in-memory objects, a name argument needs to be passed:

file = ln.File(df, name="iris")

Next, add metadata & data to database & storage, we can do so in a single transaction:

ln.add(file)
File(id='sfqjeqshOu4n2OCrxGj4', name='iris', suffix='.parquet', size=5629, hash='jUTdERuqlGv_GyqFfIEb2Q', source_id='aUvAkAwVxam9tDQcKfro', storage_id='8Pj12JLb', created_at=datetime.datetime(2023, 3, 30, 23, 15, 58))

What happens under the hood?#

In the SQL database#

  1. A File entry

  2. A Notebook entry

  3. A Run entry

All three entries are linked so that you can find the file using any of the metadata fields.

Hide code cell source
ln.select(ln.File, name=file.name).one()
File(id='sfqjeqshOu4n2OCrxGj4', name='iris', suffix='.parquet', size=5629, hash='jUTdERuqlGv_GyqFfIEb2Q', source_id='aUvAkAwVxam9tDQcKfro', storage_id='8Pj12JLb', created_at=datetime.datetime(2023, 3, 30, 23, 15, 58))
Hide code cell source
ln.select(ln.schema.Notebook, id=ln.context.transform.id).one()
Transform(id='OdlFhFWW7qg3', v='0', name='04-memory', type=notebook, title='Track in-memory data objects', created_by='DzTjkKse', created_at=datetime.datetime(2023, 3, 30, 23, 15, 58))
Hide code cell source
ln.select(ln.schema.Run, id=ln.context.run.id).one()
Run(id='aUvAkAwVxam9tDQcKfro', transform_id='OdlFhFWW7qg3', transform_v='0', created_by='DzTjkKse', created_at=datetime.datetime(2023, 3, 30, 23, 15, 58))

In storage#

A parquet file with cryptic name:

!ls ./mydata
VWybnO7aZxo8NMu80eHp.jpg  mydata.lndb  sfqjeqshOu4n2OCrxGj4.parquet

Retrieve data#

Get the dataframe back:

file.load()
sepal length (cm) sepal width (cm) petal length (cm) petal width (cm) target
0 5.1 3.5 1.4 0.2 0
1 4.9 3.0 1.4 0.2 0
2 4.7 3.2 1.3 0.2 0
3 4.6 3.1 1.5 0.2 0
4 5.0 3.6 1.4 0.2 0
... ... ... ... ... ...
145 6.7 3.0 5.2 2.3 2
146 6.3 2.5 5.0 1.9 2
147 6.5 3.0 5.2 2.0 2
148 6.2 3.4 5.4 2.3 2
149 5.9 3.0 5.1 1.8 2

150 rows × 5 columns