Unity Catalog PuppyGraph Integration¶
This document walks through how to use PuppyGraph to query data from Delta tables registered in Unity Catalog as a graph.
Prerequisites¶
- JDK 17 to build and run Unity Catalog and Spark
- Docker
- This repository
unitycatalog
cloned - Spark downloaded
Build the Unity Server and Spark support¶
Run the command From the cloned repository root directory
build/sbt clean package publishLocal spark/publishLocal
Run the Unity Catalog Server¶
Run the command to start a Unity Server.
./bin/start-uc-server
For the remaining steps, continue in a different terminal.
Create Tables under the Unity Catalog¶
Create a catalog puppygraph
and several Delta tables under the schema modern
in that catalog.
./bin/uc catalog create --name puppygraph
./bin/uc schema create --name modern --catalog puppygraph
./bin/uc table create --full_name puppygraph.modern.person --columns "id STRING, name STRING, age INT" --storage_location /tmp/puppygraph/person/ --format DELTA
./bin/uc table create --full_name puppygraph.modern.knows --columns "id STRING, from_id STRING, to_id STRING, weight DOUBLE" --storage_location /tmp/puppygraph/knowns/ --format DELTA
./bin/uc table create --full_name puppygraph.modern.software --columns "id STRING, name STRING, lang STRING" --storage_location /tmp/puppygraph/software/ --format DELTA
./bin/uc table create --full_name puppygraph.modern.created --columns "id STRING, from_id STRING, to_id STRING, weight DOUBLE" --storage_location /tmp/puppygraph/created/ --format DELTA
Load Data into the Tables¶
Run the command from the Spark folder to start a Spark SQL shell .
./bin/spark-sql \
--packages \
io.delta:delta-spark_2.12:3.2.0,io.unitycatalog:unitycatalog-spark:0.2.0-SNAPSHOT \
--conf spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension \
--conf spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog \
--conf spark.sql.catalog.puppygraph=io.unitycatalog.connectors.spark.UCSingleCatalog \
--conf spark.sql.catalog.puppygraph.uri=http://localhost:8080
Run the following SQL to insert data into the Delta tables.
insert into puppygraph.modern.person VALUES
('v1', 'marko', 29),
('v2', 'vadas', 27),
('v4', 'josh', 32),
('v6', 'peter', 35);
INSERT INTO puppygraph.modern.software VALUES
('v3', 'lop', 'java'),
('v5', 'ripple', 'java');
INSERT INTO puppygraph.modern.created VALUES
('e9', 'v1', 'v3', 0.4),
('e10', 'v4', 'v5', 1.0),
('e11', 'v4', 'v3', 0.4),
('e12', 'v6', 'v3', 0.2);
INSERT INTO puppygraph.modern.knows VALUES
('e7', 'v1', 'v2', 0.5),
('e8', 'v1', 'v4', 1.0);
Exit the Spark SQL shell after data insertion is done.
Querying the Tables as a Graph¶
Start PuppyGraph using Docker. Here we map the PuppyGraph port 8081
to 9081
on the host.
docker run -p 9081:8081 -p 8182:8182 -p 7687:7687 \
-v /tmp/puppygraph:/tmp/puppygraph \
--name puppy --rm -itd puppygraph/puppygraph:stable
Create the schema.json and replace <host-name>
with your host ip address.
{
"catalogs": [
{
"name": "puppygraph",
"type": "deltalake",
"metastore": {
"type": "unity",
"host": "http://<host-name>:8081",
"token": "no-use",
"databricksCatalogName": "puppygraph"
}
}
],
"vertices": [
{
"label": "person",
"attributes": [
{ "type": "String", "name": "name" },
{ "type": "Int" , "name": "age" }
],
"mappedTableSource": {
"catalog": "puppygraph",
"schema": "modern",
"table": "person",
"metaFields": {"id": "id"}
}
},
{
"label": "software",
"attributes": [
{ "type": "String", "name": "name" },
{ "type": "String", "name": "lang" }
],
"mappedTableSource": {
"catalog": "puppygraph",
"schema": "modern",
"table": "software",
"metaFields": {"id": "id"}
}
}
],
"edges": [
{
"label": "knows",
"from": "person",
"to": "person",
"attributes": [ {"type": "Double", "name": "weight"} ],
"mappedTableSource": {
"catalog": "puppygraph",
"schema": "modern",
"table": "knows",
"metaFields": {"from": "from_id", "id": "id", "to": "to_id"}
}
},
{
"label": "created",
"from": "person",
"to": "software",
"attributes": [ {"type": "Double", "name": "weight"} ],
"mappedTableSource": {
"catalog": "puppygraph",
"schema": "modern",
"table": "created",
"metaFields": {"from": "from_id", "id": "id", "to": "to_id"}
}
}
]
}
Upload the schema to PuppyGraph. Note here port is 9081 as 8081 is used by Unity Catalog.
curl -XPOST -H "content-type: application/json" --data-binary @./schema.json --user "puppygraph:puppygraph123" localhost:9081/schema
Start a PuppyGraph Gremlin Console to query the graph.
docker exec -it puppygraph ./bin/console
Input the following query string to get all the software created by people that marko knows.
g.V().has("name", "marko").out("knows").out("created").valueMap()
The output should be like this:
puppy-gremlin> g.V().has("name", "marko").out("knows").out("created").valueMap()
Done! Elapsed time: 0.080s, rows: 2
==>map[lang:java name:lop]
==>map[lang:java name:ripple]
You can also use Web UI for more features like graph visualization and a notebook-style query interface.