30 mins with JSON in MySQL

November 17, 2015

Lire cet article en français

Note: This article is inspired by Morgan Tocker‘s talk MySQL 5.7 + JSON.

Note 2: You may also be interested by 30 mins with MySQL JSON functions

Note 3: Handling JSON documents could be also done with MySQL Document Store.

As you may know MySQL 5.7 is GA and has over than 150 new features. One of them is a Native JSON Data Type and JSON Functions: “Allows for efficient and flexible storage, search and manipulation of schema-less data. Enhancements include a new internal binary format, support for easy integration within SQL, and index management on the JSON Documents using generated columns”.

Sounds interesting! Let’s take half an hour to have a foretaste…

Get JSON documents

First let’s get data in JSON format. Mirco Zeiss provides a really big JSON file representing san francisco’s subdivision parcels (from SF Open Data) at https://github.com/zemirco/sf-city-lots-json

To use these data some tweaks are necessary:

$ grep "^{ \"type" citylots.json > properties.json

$ head -n1 properties.json 
{  
   "type":"Feature",
   "properties":{  
      "MAPBLKLOT":"0001001",
      "BLKLOT":"0001001",
      "BLOCK_NUM":"0001",
      "LOT_NUM":"001",
      "FROM_ST":"0",
      "TO_ST":"0",
      "STREET":"UNKNOWN",
      "ST_TYPE":null,
      "ODD_EVEN":"E"
   },
   "geometry":{  
      "type":"Polygon",
      "coordinates":[  
         [  
            [  
               -122.422003528252475,
               37.808480096967251,
               0.0
            ],
            [  
               -122.422076013325281,
               37.808835019815085,
               0.0
            ],
            [  
               -122.421102174348633,
               37.808803534992904,
               0.0
            ],
            [  
               -122.421062569067274,
               37.808601056818148,
               0.0
            ],
            [  
               -122.422003528252475,
               37.808480096967251,
               0.0
            ]
         ]
      ]
   }
}

Looks good!

Note: The size of JSON documents stored in JSON columns is limited to the value of the max_allowed_packet system variable. (While the server manipulates a JSON value internally in memory, it can be larger; the limit applies when the server stores it.).

Our JSON document will be stored in an InnoDB table: features

CREATE TABLE features (
 id int(11) NOT NULL AUTO_INCREMENT,
 feature json NOT NULL,
 PRIMARY KEY (id)
) ENGINE=InnoDB;

Another way to store JSON documents is to put them in a string (VARCHAR or TEXT).

Let’s see if there are some differences between JSON documents stored in a string or in a JSON column.

CREATE TABLE features_TEXT (
 id int(11) NOT NULL AUTO_INCREMENT,
 feature longtext NOT NULL,
 PRIMARY KEY (id)
) ENGINE=InnoDB;

Note: TEXT type is not large enough to handle our JSON data. (ERROR 1406 (22001): Data too long for column ‘feature’ at row 17360). LONGTEXT will do the job.

Populate tables

Note: In order to have a better idea of query execution time on my old (and not so stable) laptop with a small MySQL config (e.g. Buffer pool = 128MB), I ran the queries many time using mysqlslap: mysqlslap -c1 -i <N> { Concurrency = 1 / Iteration > 20 (depending on the query duration) }

So most of the time I’ll show mysqlslap output e.g. “Minimum number of seconds to run all queries: 59.392 seconds

provides by mysqlslap instead of regular query output e.g. “Query OK, 206560 rows affected (59.39 sec)“.

Copy JSON data in features

LOAD DATA INFILE 'properties.json' INTO TABLE features (feature);

Minimum number of seconds to run all queries: 59.392 seconds

Copy JSON data in features_TEXT

LOAD DATA INFILE 'properties.json' INTO TABLE features_TEXT (feature);

Minimum number of seconds to run all queries: 39.784 seconds

Loading 206560 records into my respective tables shows performance difference about 40% slower in JSON column compare to TEXT.

However, be aware that MySQL JSON data type provides:

  • Automatic validation of JSON documents stored in JSON columns. Meaning that invalid documents produce an error.
  • Optimized storage format. JSON documents stored in JSON columns are converted to an internal format that permits quick read access to document elements. When the server later must read a JSON value stored in this binary format, the value need not be parsed from a text representation. The binary format is structured to enable the server to look up subobjects or nested values directly by key or array index without reading all values before or after them in the document.

Nothing comparable with the TEXT data type, in other words these features have a cost, logic and fair!

Let’s have a look on tables metadata:

SHOW TABLE STATUS LIKE 'features'\G
*************************** 1. row ***************************
          Name: features
        Engine: InnoDB
       Version: 10
    Row_format: Dynamic
          Rows: 184218
Avg_row_length: 1250
   Data_length: 230326272  #220 MB
  Index_length: 0
     Data_free: 3145728
SHOW TABLE STATUS LIKE 'features_TEXT'\G
*************************** 1. row ***************************
          Name: features_TEXT
        Engine: InnoDB
       Version: 10
    Row_format: Dynamic
          Rows: 188784
Avg_row_length: 1370
   Data_length: 258654208  #247 MB
  Index_length: 0
     Data_free: 4194304

Interesting point here, LONGTEXT data type consume more space than the JSON (optimized storage format) data type, about 20% more.

Return data from a JSON document

MySQL 5.7 provides a bunch of JSON functions.

JSON_EXTRACT returns data from a JSON document. Furthermore since MySQL 5.7.9 you can use inlined JSON path expressions that simplifies queries that deal with JSON data and make them more human-readable:

e.g.

JSON_EXTRACT(col, “$.json_field”) is similar to col->”$.json_field”

So how about retrieve our JSON documents?

Table with JSON data type

SELECT DISTINCT feature->"$.type" AS json_extract FROM features\G
*************************** 1. row ***************************
          id: 1
 select_type: SIMPLE
       table: features
        type: ALL
possible_keys: NULL
         key: NULL
     key_len: NULL
         ref: NULL
        rows: 182309
    filtered: 100.00
       Extra: Using temporary

Minimum number of seconds to run all queries: 4.470 seconds

Table with TEXT data type

SELECT DISTINCT feature->"$.type" AS json_extract FROM features_TEXT\G
*************************** 1. row ***************************
          id: 1
 select_type: SIMPLE
       table: features_TEXT
        type: ALL
possible_keys: NULL
         key: NULL
     key_len: NULL
         ref: NULL
        rows: 177803
    filtered: 100.00
       Extra: Using temporary

Minimum number of seconds to run all queries: 29.365 seconds

Get these documents implies a full table scan (no surprise).

However we can see the power of the MySQL JSON internal format that permits quick read access to document elements.

In this example the query execution time is about 7 times faster with JSON data type compare to TEXT.

Generated column

JSON columns cannot be indexed. BUT you can work around this restriction by creating an index on a generated column that extracts a scalar value from the JSON column. Generated columns can either be materialized (stored) or non-materialized (virtual).

Create a generated column is quite easy. And a VIRTUAL one is costless because column values are not stored, but are evaluated when rows are read, immediately after any BEFORE triggers.

A virtual column takes no storage. It’s usually what you’ll need in real life.

Last but not least it’s only about metadata change so adding a virtual column is fast (no table rebuild). It only requires a quick system table update that registers the new metadata.

ALTER TABLE features ADD feature_type VARCHAR(30) AS (feature->"$.type") VIRTUAL;
Query OK, 0 rows affected (0.08 sec)
Records: 0  Duplicates: 0  Warnings: 0
ALTER TABLE features_TEXT ADD feature_type VARCHAR(30) AS  (feature->"$.type") VIRTUAL;
Query OK, 0 rows affected (0.10 sec)
Records: 0  Duplicates: 0  Warnings: 0

Note: The disadvantage of such approach is that values are stored twice; once as the value of the generated column and once in the index.

New table descriptions are now:

CREATE TABLE features (
 id int(11) NOT NULL AUTO_INCREMENT,
 feature json NOT NULL,
 feature_type varchar(30) GENERATED ALWAYS AS (feature->"$.type") VIRTUAL,
 PRIMARY KEY (id)
) ENGINE=InnoDB
CREATE TABLE features_TEXT (
 id int(11) NOT NULL AUTO_INCREMENT,
 feature longtext NOT NULL,
 feature_type varchar(30) GENERATED ALWAYS AS (feature->"$.type") VIRTUAL,
 PRIMARY KEY (id)
) ENGINE=InnoDB

Let’s have a look on table metadata:

ANALYZE TABLE features, features_TEXT;

SHOW TABLE STATUS LIKE 'features'\G
*************************** 1. row ***************************
          Name: features
        Engine: InnoDB
       Version: 10
    Row_format: Dynamic
          Rows: 184218
Avg_row_length: 1250
   Data_length: 230326272  #220 MB
  Index_length: 0
     Data_free: 314572
SHOW TABLE STATUS LIKE 'features_TEXT'\G
*************************** 1. row ***************************
          Name: features_TEXT
        Engine: InnoDB
       Version: 10
    Row_format: Dynamic
          Rows: 188784
Avg_row_length: 1370
   Data_length: 258654208  #247 MB
  Index_length: 0
     Data_free: 4194304

Identical!

As expected data length is respectively the same.

Is there any cost difference between selecting a JSON documents from the virtual column and the JSON_EXTRACT function?

SELECT DISTINCT feature_type FROM features\G
*************************** 1. row ***************************
          id: 1
 select_type: SIMPLE
       table: features
        type: ALL
possible_keys: NULL
         key: NULL
     key_len: NULL
         ref: NULL
        rows: 195195
    filtered: 100.00
       Extra: Using temporary

Minimum number of seconds to run all queries: 2.790 seconds

SELECT DISTINCT feature_type FROM features_TEXT\G
*************************** 1. row ***************************
          id: 1
 select_type: SIMPLE
       table: features_TEXT
        type: ALL
possible_keys: NULL
         key: NULL
     key_len: NULL
         ref: NULL
        rows: 171004
    filtered: 100.00
       Extra: Using temporary

Minimum number of seconds to run all queries: 25.933 seconds

Obviously the QEP is the same: Full Table Scan (FTS).

Anyway 2 comments:

  • MySQL JSON internal format is still more efficient than TEXT data type, in this example query execution time is about 8 times faster with JSON.
  • In this example FTS on the virtual generated column (feature_type) is faster than the usage of json_extract function on the JSON document in the SELECT clause (from 4.470 to 2.790).

Create indexes on generated column

As of MySQL 5.7.8, InnoDB supports secondary indexes on virtual columns.

Adding or dropping a secondary index on a virtual column is an in-place operation.

ALTER TABLE features ADD INDEX (feature_type);
Query OK, 0 rows affected (5.04 sec)
Records: 0  Duplicates: 0  Warnings: 0
ALTER TABLE features_TEXT ADD INDEX (feature_type);
Query OK, 0 rows affected (27.89 sec)
Records: 0  Duplicates: 0  Warnings: 0

New table descriptions are:

CREATE TABLE features (
 id` int(11) NOT NULL AUTO_INCREMENT,
 feature` json NOT NULL,
 feature_type varchar(30) GENERATED ALWAYS AS (feature->"$.type") VIRTUAL,
 PRIMARY KEY (id),
 KEY feature_type (feature_type)
) ENGINE=InnoDB
CREATE TABLE features_TEXT (
 id int(11) NOT NULL AUTO_INCREMENT,
 feature longtext NOT NULL,
 feature_type varchar(30) GENERATED ALWAYS AS (feature->"$.type") VIRTUAL,
 PRIMARY KEY (id),
 KEY feature_type (feature_type)
) ENGINE=InnoDB

Let’s have another look on table metadata:

ANALYZE TABLE features, features_TEXT;

SHOW TABLE STATUS LIKE 'features'\G
*************************** 1. row ***************************
          Name: features
        Engine: InnoDB
       Version: 10
    Row_format: Dynamic
          Rows: 180400
Avg_row_length: 1276
   Data_length: 230326272  #220 MB
  Index_length: 5783552    #6 MB
     Data_free: 5242880
SHOW TABLE STATUS LIKE 'features_TEXT'\G
*************************** 1. row ***************************
          Name: features_TEXT
        Engine: InnoDB
       Version: 10
    Row_format: Dynamic
          Rows: 192445
Avg_row_length: 1344
   Data_length: 258654208  #247 MB
  Index_length: 5783552    #6 MB
     Data_free: 2097152

The index on feature_type column is materialized. Its size it’s approximately 6 MB.

Now because of this index, the query should be more efficient:

SELECT DISTINCT feature_type FROM features\G
*************************** 1. row ***************************
          id: 1
 select_type: SIMPLE
       table: features
        type: index
possible_keys: feature_type
         key: feature_type
     key_len: 33
         ref: NULL
        rows: 193763
    filtered: 100.00
       Extra: Using index

Minimum number of seconds to run all queries: 0.178 seconds

SELECT DISTINCT feature_type FROM features_TEXT\G
*************************** 1. row ***************************
          id: 1
 select_type: SIMPLE
       table: features_TEXT
        type: index
possible_keys: feature_type
         key: feature_type
     key_len: 33
         ref: NULL
        rows: 208134
    filtered: 100.00
       Extra: Using index

Minimum number of seconds to run all queries: 0.178 seconds

As expected the optimizer uses the index (feature_type) and the query execution time is much better in both cases (from 2.790 to 0.178 for JSON column).

Wrapup

MySQL 5.7 implements native JSON data type support and provides a set of function that allows to Create, Search, Modify JSON values and Return JSON attributes values as well. That’s good and I guess many developers will be happy to use this new feature.

Generated columns is also an interesting feature. It could be used among others to simulate functional indexes, as a materialized cache for often used expressions… or like we did to provide index management on the JSON documents.

Give it a try, it definitely worth more than 30 minutes.

Want to know more about MySQL 5.7?

Going further

MySQL Documentation

The JSON data type

https://dev.mysql.com/doc/refman/5.7/en/json.html

JSON Functions

https://dev.mysql.com/doc/refman/5.7/en/json-functions.html

CREATE TABLE and Generated Columns

http://dev.mysql.com/doc/refman/5.7/en/create-table.html#create-table-generated-columns

MySQL Server Blog

Native JSON Data Type and Binary Format

http://mysqlserverteam.com/json-labs-release-native-json-data-type-and-binary-format/

JSON functions

http://mysqlserverteam.com/json-labs-release-json-functions-part-1-manipulation-json-data/

http://mysqlserverteam.com/mysql-5-7-lab-release-json-functions-part-2-querying-json-data/

https://mysqlserverteam.com/new-json-functions-in-mysql-5-7-22/

Inline JSON Path Expressions in MySQL 5.7

http://mysqlserverteam.com/inline-json-path-expressions-in-mysql-5-7/

Getting Started With MySQL & JSON on Windows

http://mysqlserverteam.com/getting-started-with-mysql-json-on-windows/

Effective Functional Indexes in InnoDB

http://mysqlserverteam.com/json-labs-release-effective-functional-indexes-in-innodb/

MySQL 5.7

What Is New in MySQL 5.7

https://dev.mysql.com/doc/refman/5.7/en/mysql-nutshell.html

Complete list of new features in MySQL 5.7

http://www.thecompletelistoffeatures.com/

Thanks for using MySQL!

5 Responses to “30 mins with JSON in MySQL”

  1. […] Read this post in English […]

  2. […] Note: You may also be interested by 30 mins with JSON in MySQL […]

  3. Merci beaucoups pour les info 🙂 !

  4. […] 30 mins with JSON in MySQL […]

  5. How to import GeoJson file in mysql so as to apply spatial queries on it?