Unity Catalog Daft Integration¶
This page shows you how to use Unity Catalog with Daft.
Daft is a library for parallel and distributed processing of multimodal data.
Set up¶
To start, install Daft with the extra Unity Catalog dependencies using:
pip install -U "getdaft[unity,deltalake]"
Then import Daft and the UnityCatalog
abstraction:
import daft
from daft.unity_catalog import UnityCatalog
You need to have a Unity Catalog server running to connect to.
For testing purposes, you can spin up a local server by running the code below in a terminal:
bin/start-uc-server
Connect Daft to Unity Catalog¶
Use the UnityCatalog
abstraction to point Daft to your UC server.
This object requires an endpoint
and a token
. If you launched the UC server locally using the command above then you can use the values below. Otherwise, substitute the endpoint
and token
values with the corresponding values for your UC server.
# point Daft to your UC server
unity = UnityCatalog(
endpoint="http://127.0.0.1:8080",
token="not-used",
)
You can also connect to a Unity Catalog in your Databricks workspace by setting endpoint = "https://<databricks_workspace_id>.cloud.databricks.com"
.
Once you're connected, you can list all your available catalogs using:
> print(unity.list_catalogs())
['unity']
You can list all available schemas in a given catalog:
> print(unity.list_schemas("unity"))
['unity.default']
And you can list all the available tables in a given schema:
print(unity.list_tables("unity.default"))
['unity.default.numbers', 'unity.default.marksheet_uniform', 'unity.default.marksheet']
Load Unity Tables into Daft DataFrame¶
You can use Daft to read Delta Lake tables in a Unity Catalog.
First, point Daft to your Delta table stored in your Unity Catalog:
unity_table = unity.load_table("unity.default.numbers")
Unity Catalog tables are stored in the Delta Lake format.
Simply read your table using the Daft read_deltalake
method:
> df = daft.read_deltalake(unity_table)
> df.show()
as_int as_double
564 188.755356
755 883.610563
644 203.439559
75 277.880219
42 403.857969
680 797.691220
821 767.799854
484 344.003740
477 380.678561
131 35.443732
294 209.322436
150 329.197303
539 425.661029
247 477.742227
958 509.371273
Any subsequent filter operations on the Daft df
DataFrame object will be correctly optimized to take advantage of Delta Lake features.
> df = df.where(df["as_int"] > 500)
> df.show()
as_int as_double
564 188.755356
755 883.610563
644 203.439559
680 797.691220
821 767.799854
539 425.661029
958 509.371273
Daft support for Unity Catalog is under rapid development. Refer to the Daft documentation for more information.