GQL Graph Query Language


In September 2019 a proposal for a project to create a new standard graph query language. was approved by a vote of national standards bodies which are members of ISO/IEC Joint Technical Committee 1. JTC 1 is responsible for international Information Technology standards. GQL is intended to be a declarative database query language, like SQL.

Project for a new International Standard Graph Query Language

The GQL project proposal states:
The GQL project is the culmination of converging initiatives dating back to 2016, particularly a private proposal from Neo4j to other database vendors in July 2016, and a proposal from Oracle technical staff within the ISO/IEC JTC 1 standards process later that year.
The GQL project is led by Stefan Plantikow and Stephen Cannan. They are also the editors of the initial early working drafts of the GQL specification.
As originally motivated, the GQL project aims to complement the work of creating an implementable normative natural-language specification with supportive community efforts that enable contributions from those who are unable or uninterested in taking part in the formal process of defining a JTC 1 International Standard. In July 2019 the Linked Data Benchmark Council agreed to become the umbrella organization for the efforts of community technical working groups. The Existing Languages and the Property Graph Schema working groups formed in late 2018 and early 2019 respectively. A working group to define formal denotational semantics for GQL was proposed at the third GQL Community Update in October 2019

The GQL property graph data model

GQL is a query language specifically for property graphs. A property graph closely resembles a conceptual data model, as expressed in an entity–relationship model or in a UML class diagram. Entities or concepts are modelled as nodes, and relationships as edges, in a graph. Property graphs are multigraphs: there can be many edges between the same pair of nodes. GQL graphs can be mixed: they can contain directed edges, where one of the endpoint nodes of an edge is the tail and the other node is the head, but they can also contain undirected edges.
Nodes and edges, collectively known as elements, have attributes. Those attributes may be data values, or labels. Values of properties cannot be elements of graphs, nor can they be whole graphs: these restrictions intentionally force a clean separation between the topology of a graph, and the attributes carrying data values in the context of a graph topology. The property graph data model therefore deliberately prevents nesting of graphs, or treating nodes in one graph as edges in another. Each property graph may have a set of labels and a set of properties that are associated with the graph as a whole.
Current graph database products and projects often support a limited version of the model described here. For example, Apache Tinkerpop forces each node and each edge to have a single label; Cypher allows nodes to have zero to many labels, but relationships only have a single label. Neo4j's database supports undocumented graph-wide properties, Tinkerpop has graph values which play the same role, and also supports "metaproperties" or properties on properties. Oracle's PGQL supports zero to many labels on nodes and on edges, whereas SQL/PGQ supports one to many labels for each kind of element.
The GQL project will define a standard data model, which is likely to be the superset of these variants, and at least the first version of GQL is likely to permit vendors to decide on the cardinalities of labels in each implementation, as does SQL/PGQ, and to choose whether to support undirected relationships.
Additional aspects of the ERM or UML models may be captured by GQL schemas or types that describe possible instances of the general data model.

WG3: Extending SQL and creating GQL

The GQL project has a four-year timespan. Seven national standards bodies have nominated national subject-matter experts to work on the project, which is conducted by Working Group 3 of ISO/IEC JTC 1's Subcommittee 32, usually abbreviated as ISO/IEC JTC 1/SC 32 WG3, or just WG3 for short. WG3 has been responsible for the SQL standard since 1987.

Extending existing graph query languages

The GQL project draws on multiple sources or inputs, notably existing industrial languages and a new section of the SQL standard. In preparatory discussions within WG3 surveys of the history and comparative content of some of these inputs were presented. GQL will be a declarative language with its own distinct syntax, playing a similar role to SQL in the building of a database application. Other graph query languages have been defined which offer direct procedural features such as branching and looping (Apache Tinkerpop's Gremlin, GSQL,, making it possible to traverse a graph iteratively to perform a class of graph algorithms, but GQL will not directly incorporate such features.. However, GQL is envisaged as a specific case of a more general class of graph languages, which will share a graph type system and a calling interface for procedures that process graphs.

SQL/PGQ Property Graph Query

Prior work by WG3 and SC32 mirror bodies, particularly in INCITS DM32, has helped to define a new planned Part 16 of the SQL Standard, which allows a read-only graph query to be called inside a SQL SELECT statement, matching a graph pattern using syntax which is very close to Cypher, PGQL and G-CORE, and returning a table of data values as the result. SQL/PGQ also contains DDL to allow SQL tables to be mapped to a graph view schema object with nodes and edges associated to sets of labels and set of data properties.. The GQL project coordinates closely with the SQL/PGQ "project split" of ISO 9075 SQL, and the technical working groups in the U.S. and at the international level have several expert contributors who work on both projects. The GQL project proposal mandates close alignment of SQL/PGQ and GQL, indicating that GQL will in general be a superset of SQL/PGQ.

Cypher

Cypher is a language originally designed by Andrés Taylor and colleagues at Neo4j Inc., and first implemented by that company in 2011. Since 2015 it has been made available as an open source language description with grammar tooling, a JVM front-end that parses Cypher queries, and a Technology Compatibility Kit of over 2000 test scenarios, using Cucumber for implementation language portability. The TCK reflects the language description and an enhancement for temporal datatypes and functions documented in a Cypher Improvement Proposal.
Cypher allows creation, reading, updating and deleting of graph elements, and is a language that can therefore be used for analytics engines and transactional databases.

Querying with visual path patterns

Cypher uses compact fixed- and variable-length patterns which combine visual representations of node and relationship topologies, with label existence and property value predicates. By matching such a pattern against graph data elements, a query can extract references to nodes, relationships and paths of interest. Those references are emitted as a "binding table" where column names are bound to a multiset of graph elements. The name of a column becomes the name of a "binding variable", whose value is a specific graph element reference for each row of the table.
For example, a pattern will generate a two-column output table. The first column named will contain references to nodes with a label . The second column named will contain references to nodes with a label , denoting the city where the person lives.
The binding variables and can then be dereferenced to obtain access to property values associated with the elements referred to by a variable. The example query might be terminated with a , resulting in a complete query like this:
MATCH -->
RETURN p.first_name, p.last_name, c.name, c.state

This would result in a final four-column table listing the names of the residents of the cities stored in the graph.
Pattern-based queries are able to express joins, by combining multiple patterns which use the same binding variable to express a natural join using the clause:
MATCH -->, -->
RETURN p.first_name, p.last_name, c.name, c.state

This query would return the residential location only of EU nationals.
An outer join can be expressed by :
MATCH --> OPTIONAL MATCH -->
RETURN p.first_name, p.last_name, c.name, c.state, ec.name

This query would return the city of residence of each person in the graph with residential information, and, if an EU national, which country they come from.
Queries are therefore able to first project a sub-graph of the graph input into the query, and then extract the data values associated with that subgraph. Data values can also be processed by functions, including aggregation functions, leading to the projection of computed values which render the information held in the projected graph in various ways. Following the lead of G-CORE and Morpheus, GQL aims to project the sub-graphs defined by matching patterns as new graphs to be returned by a query.
Patterns of this kind have become pervasive in property graph query languages, and are the basis for the advanced pattern sub-language being defined in SQL/PGQ, which is likely to become a subset of the GQL language. Cypher also uses patterns for insertion and modification clauses, and proposals have been made in the GQL project for collecting node and edge patterns to describe graph types.

Cypher implementations

Cypher is implemented in Neo4j's database, in SAP's HANA Graph, by Redis Graph., by Cambridge Semantics' Anzograph, by Bitnine's Agens Graph, by Memgraph, and in open source projects Cypher for Gremlin maintained by Neueda Labs in Riga, and Cypher for Apache Spark, as well as in research projects such as Cypher.PL and Ingraph. Cypher as a language is governed as the openCypher project by an informal community which has held five face-to-face openCypher Implementers' Meetings since February 2017.

Cypher 9 and Cypher 10

The current version of Cypher is referred to as Cypher 9. Prior to the GQL project it was planned to create a new version, Cypher 10 , that would incorporate features like schema and composable graph queries and views. The first designs for Cypher 10, including graph construction and projection, were implemented in the Cypher for Apache Spark project starting in 2016..

PGQL

PGQL
is a language designed and implemented by Oracle Inc., but made available as an open source specification, along with JVM parsing software. PGQL combines familiar SQL SELECT syntax including SQL expressions and result ordering and aggregation with a pattern matching language very similar to that of Cypher. It allows the specification of the graph to be queried, and includes a facility for macros to capture "pattern views", or named sub-patterns. It does not support insertion or updating operations, having been designed primarily for an analytics environment, such as Oracle's PGX product. PGQL has also been implemented in Oracle Big Data Spatial and Graph, and in a research project, PGX.D/Async.

G-CORE

G-CORE is a research language designed by a group of academic and industrial researchers and language designers which draws on features of Cypher, PGQL and SPARQL. The project was conducted under the auspices of the Linked Data Benchmark Council, starting with the formation of a Graph Query Language task force in late 2015, with the bulk of the work of paper writing occurring in 2017. G-CORE is a composable language which is closed over graphs: graph inputs are processed to create a graph output, using graph projections and graph set operations to construct the new graph. G-CORE queries are pure functions over graphs, having no side effects, which mean that the language does not define operations which mutate stored data. G-CORE introduces views. It also incorporates paths as elements in a graph, which can be queried independently of projected paths. G-CORE has been partially implemented in open-source research projects in the LDBC GitHub organization.

GSQL

GSQL is a language designed for TigerGraph Inc.'s property graph database. Since October 2018 TigerGraph language designers have been promoting and working on the GQL project. GSQL is a Turing-complete language that incorporates procedural flow control and iteration, and a facility for gathering and modifying computed values associated with a program execution for the whole graph or for elements of a graph called accumulators. These features are designed to enable iterative graph computations to be combined with data exploration and retrieval. GSQL graphs must be described by a schema of vertexes and edges, which constrains all insertions and updates. This schema therefore has the closed world property of an SQL schema, and this aspect of GSQL is proposed as an important optional feature of GQL.
Vertexes and edges are named schema objects which contain data but also define an imputed type, much as SQL tables are data containers, with an associated implicit row type. GSQL graphs are then composed from these vertex and edge sets, and multiple named graphs can include the same vertex or edge set. GSQL has developed new features since its release in September 2017, most notably introducing variable-length edge pattern matching using a syntax related to that seen in Cypher, PGQL and SQL/PGQ, but also close in style to the fixed-length patterns offered by Microsoft SQL/Server Graph
GSQL also supports the concept of Multigraphs
which allow subsets of a graph to have role-based access control. Multigraphs are important for enterprise-scale graphs that need fine-grain access control for different users.

Morpheus: multiple graphs and composable graph queries in Apache Spark

The opencypher Morpheus project implements Cypher for Apache Spark users. Commencing in 2016, this project originally ran alongside three related efforts, in which Morpheus designers also took part: SQL/PGQ, G-CORE and design of Cypher extensions for querying and constructing multiple graphs. The Morpheus project acted as a testbed for extensions to Cypher in the two areas of graph DDL and query language extensions.
Graph DDL features include
  1. definition of property graph views over JDBC-connected SQL tables and Spark DataFrames
  2. definition of graph schemas or types defined by assembling node type and edge type patterns, with subtyping
  3. constraining the content of a graph by a closed or fixed schema
  4. creating catalog entries for multiple named graphs in a hierarchically-organized catalog
  5. graph data sources to form a federated, heterogeneous catalog
  6. creating catalog entries for named queries
Graph query language extensions include
  1. graph union
  2. projection of graphs computed from the results of pattern matches on multiple input graphs
  3. support for tables as inputs to queries
  4. views which accept named or projected graphs as parameters.
These features have been proposed as inputs to the standardization of property graph query languages in the GQL project.