Newick format


In mathematics, Newick tree format is a way of representing graph-theoretical trees with edge lengths using parentheses and commas. It was adopted by James Archie, William H. E. Day, Joseph Felsenstein, Wayne Maddison, Christopher Meacham, F. James Rohlf, and David Swofford, at two meetings in 1986, the second of which was at in Dover, New Hampshire, US. The adopted format is a generalization of the format developed by Meacham in 1984 for the first tree-drawing programs in Felsenstein's PHYLIP package.

Examples

The following tree:
could be represented in Newick format in several ways
); no nodes are named
; leaf nodes are named
F; all nodes are named
; all but root node have a distance to parent
:0.0; all have a distance to parent
; distances and leaf names '
F; distances and all names
F; a tree rooted on a leaf node
'
Newick format is typically used for tools like PHYLIP and is a minimal definition for a phylogenetic tree.

Rooted, unrooted, and binary trees

When an unrooted tree is represented in Newick notation, an arbitrary node is chosen as its root. Whether rooted or unrooted, typically a tree's representation is rooted on an internal node and it is rare to root a tree on a leaf node.
A rooted binary tree that is rooted on an internal node has exactly two immediate descendant nodes for each internal node.
An unrooted binary tree that is rooted on an arbitrary internal node has exactly three immediate descendant nodes for the root node, and each other internal node has exactly two immediate descendant nodes.
A binary tree rooted from a leaf has at most one immediate descendant node for the root node, and each internal node has exactly two immediate descendant nodes.

Grammar

A grammar for parsing the Newick format :

The grammar nodes

Tree: The full input Newick Format for a single tree
Subtree: an internal node or a leaf node
Leaf: a node with no descendants
Internal: a node and its one or more descendants
BranchSet: a set of one or more Branches
Branch: a tree edge and its descendant subtree.
Name: the name of a node
Length: the length of a tree edge.

The grammar rules

Note, "|" separates alternatives.
TreeSubtree ";" | Branch ";"
SubtreeLeaf | Internal
LeafName
Internal → "" Name
BranchSetBranch | Branch "," BranchSet
BranchSubtree Length
Nameempty | string
Lengthempty | ":" number
Whitespace within number is prohibited. Whitespace within string is often prohibited. Whitespace elsewhere is ignored. Sometimes the Name string must be of a specified fixed length; otherwise the punctuation characters from the grammar are prohibited. The Tree --> Branch ";" production makes the entire tree descendant from nowhere, which can be nonsensical, and is sometimes prohibited.
Note that when a tree having more than one leaf is rooted from one of its leaves, a representation that is rarely seen in practice, the root leaf is characterized as an Internal node by the above grammar. Generally, a root node labeled as Internal should be construed as a leaf if and only if it has exactly one Branch in its BranchSet. One can make a grammar that formalizes this distinction by replacing the above Tree production rule with
TreeRootLeaf ";" | RootInternal ";" | Branch ";"
RootLeafName | "" Name
RootInternal → "" Name
The first RootLeaf production is for a tree with exactly one leaf. The second RootLeaf production is for rooting a tree from one of its two or more leaves.

Dialects

New Hampshire X format

The New Hampshire X format is an extension to Newick that adds key-value data to Newick nodes. This is done by putting the additional data in brackets in the node labels. The brackets are used because they represent comments in the Nexus file format, so any parser not understanding these additional information will ignore them.

Extended Newick

While the standard Newick notation is limited to phylogenetic trees, Extended Newick can be used to encode explicit phylogenetic networks. In a phylogenetic network, which is a generalization of a phylogenetic tree, a node either represents a divergence event or a reticulation event such as hybridization, introgression, horizontal gene transfer or recombination. Nodes that represent a reticulation event are duplicated, annotated by introducing the # symbol into the Newick format, and numbered consecutively.
For example, if leaf Y is the product of hybridisation between lineages leading to C and D in the tree above,

one can express this situation by defining two trees in standard Newick notation
e)f; and f; standard Newick, all nodes are named
or in extended Newick notation
e)f; extended Newick, all nodes are named; 1 is the integer identifying the hybrid node x
The here is a hybrid node. It will be joined by the program into a single node when drawn. The production rules above is modified by the following for labelling hybrid nodes :
LeafName Hybrid
Hybridempty | "#" Type integer -- The #i part is an obligatory identifier for a hybrid node
Typeempty | string -- type of reticulation, e.g., H = hybridisation, LGT = lateral gene transfer, R = recombination.
Extended Newick is backward-compatible: a hybrid node would simply be interpreted as a few strangely-named nodes for legacy parsers.

Rich Newick format

The Rich Newick format, also known as the Rice Newick format, is a further extension of Extended Newick. It adds support for:
Some other programs, like NWX, uses comments starting with to encode additional information in an ad hoc manner:
Several tools have been published to visualize Newick tree data, such as the ETE toolkit and T-REX. Phylogenetic software packages such as SplitsTree and the tree-viewer Dendroscope as well as the online tree viewing tool can handle standard and extended Newick notation, while the phylogenetic network software makes use of both the Extended Newick and Rich Newick format.