Skip to content

Table Formats

This page explains how you can work with various table storage format tables in your Unity Catalog. It will also explain the advantages and drawbacks of working with these storage formats including Parquet, ORC, JSON, CSV, Avro, and TEXT.

Set up Unity Catalog

To follow along, make sure you have a local instance of Unity Catalog running by launching the following command from a terminal window:

bin/start-uc-server

This local UC server will come with some sample data pre-loaded. To list all of the tables in your local Unity Catalog, use:

bin/uc table list --catalog unity --schema default
┌─────────────────┬──────────────┬─────┬────────────────────────────────────┐
│      NAME       │ CATALOG_NAME │ ... │              TABLE_ID              │
├─────────────────┼──────────────┼─────┼────────────────────────────────────┤
│marksheet        │unity         │ ... │c389adfa-5c8f-497b-8f70-26c2cca4976d│
├─────────────────┼──────────────┼─────┼────────────────────────────────────┤
│marksheet_uniform│unity         │ ... │9a73eb46-adf0-4457-9bd8-9ab491865e0d│
├─────────────────┼──────────────┼─────┼────────────────────────────────────┤
│numbers          │unity         │ ... │32025924-be53-4d67-ac39-501a86046c01│
├─────────────────┼──────────────┼─────┼────────────────────────────────────┤
│user_countries   │unity         │ ... │26ed93b5-9a18-4726-8ae8-c89dfcfea069│
└─────────────────┴──────────────┴─────┴────────────────────────────────────┘

As you can see, there are currently four (4) Delta tables pre-loaded in this catalog.

Create a table using a different storage format

To create a table storage format table such as Parquet, ORC, Avro, CSV, JSON, or TEXT, use the bin/uc table create ... command with the --format flag.

The following creates a new table in the path/to/storage LOCATION two colummns: some_numbers and some_letters

bin/uc table create --full_name unity.default.test \
   --columns "some_numbers INT, some_letters STRING"  \
   --storage_location /path/to/storage \ 
   --format PARQUET
bin/uc table create --full_name unity.default.test \
   --columns "some_numbers INT, some_letters STRING"  \
   --storage_location /path/to/storage \ 
   --format JSON
bin/uc table create --full_name unity.default.test \
   --columns "some_numbers INT, some_letters STRING"  \
   --storage_location /path/to/storage \ 
   --format CSV
bin/uc table create --full_name unity.default.test \
   --columns "some_numbers INT, some_letters STRING"  \
   --storage_location /path/to/storage \ 
   --format ORC
bin/uc table create --full_name unity.default.test \
   --columns "some_numbers INT, some_letters STRING"  \
   --storage_location /path/to/storage \ 
   --format AVRO
bin/uc table create --full_name unity.default.test \
   --columns "some_numbers INT, some_letters STRING"  \
   --storage_location /path/to/storage \ 
   --format TEXT

Setting your /path/to/storage

You will need to manually set the /path/to/storage to the correct storage location. If you don't know where Unity Catalog is storing your files, then take a look at the metadata of an existing table using bin/uc table get --full_name <catalog.schema.table> to see its storage location.

After you run the table create command, your output should look similar to the following abridged output of a JSON table:

┌───────────────────┬───────────────────────────────────────────────┐
│        KEY        │                  VALUE                        │
├───────────────────┼───────────────────────────────────────────────┤
│NAME               │test_json                                      │
├───────────────────┼───────────────────────────────────────────────┤
│CATALOG_NAME       │unity                                          │
├───────────────────┼───────────────────────────────────────────────┤
│SCHEMA_NAME        │default                                        │
├───────────────────┼───────────────────────────────────────────────┤
│TABLE_TYPE         │EXTERNAL                                       │
├───────────────────┼───────────────────────────────────────────────┤
│DATA_SOURCE_FORMAT │JSON                                           │
├───────────────────┼───────────────────────────────────────────────┤
│COLUMNS            │{"name":"some_numbers","type_text":"int","type\│
│                   │"nullable\":true,\"metadata\":{}}","type_name"t│
│                   │ype":null,"position":0,"comment":null,"nullabl │
│                   │{"name":"some_letters","type_text":"string","t"│
│                   │,\"nullable\":true,\"metadata\":{}}","type_namr│
│                   │val_type":null,"position":1,"comment":null,"nu │
├───────────────────┼───────────────────────────────────────────────┤
│STORAGE_LOCATION   │file:///tmp/tables/test_json/                  │
├───────────────────┼───────────────────────────────────────────────┤
│...                │...                                            │
├───────────────────┼───────────────────────────────────────────────┤
│TABLE_ID           │9d73eb9c-8d40-46f2-a2c0-4e5d2a3e0611           │
└───────────────────┴───────────────────────────────────────────────┘

This command has multiple parameters:

Parameter Description
full_name The full name of the table, which is a concatenation of the catalog name, schema name, and table name separated by dots (e.g., catalog_name.schema_name.table_name).
columns The columns of the table in SQL-like format "column_name column_data_type". Supported data types include BOOLEAN, BYTE, SHORT, INT, LONG, FLOAT, DOUBLE, DATE, TIMESTAMP, TIMESTAMP_NTZ, STRING, BINARY, DECIMAL. Separate multiple columns with a comma (e.g., "id INT, name STRING")
format [Optional] The format of the data source. Supported values are DELTA, PARQUET, ORC, JSON, CSV, AVRO, and TEXT. If not specified the default format is DELTA.
storage_location The storage location associated with the table. It is a mandatory field for EXTERNAL tables.
properties [Optional] The properties of the entity in JSON format (e.g., '{"key1": "value1", "key2": "value2"}'). Make sure to either escape the double quotes(\") inside the properties string or just use single quotes('') around the same.

Challenges using table storage formats

While popular, each table storage format (e.g., Parquet, ORC, JSON, CSV, Avro, TEXT, etc.) has their own set of distinct advantages.But the challenges when working with these formats include:

  • No ACID transactions for these data lakes meaning it's easier to accidentally corrupt your data
  • It is not easy to delete rows from these tables
  • These table storage foramts do not offer DML transactions
  • They lack advanced features from schema evolution and enforcement to deletion vectors to change data feed
  • Slow file listing overhead when working with cloud object stores such as AWS S3, Azure ADLSgen2, and Google Cloud Storage
  • Potentialy expensive footer reads to gather statistics for file skipping

Open table formats like Apache Iceberg and Delta Lake are specifically designed to overcome these challenges. Storing your data in a lakehouse format is almost always more advantageous than storing it in traditional table storage formats.