AWK
AWK is a domain-specific language designed for text processing and typically used as a data extraction and reporting tool. It is a standard feature of most Unix-like operating systems.
The AWK language is a data-driven scripting language consisting of a set of actions to be taken against streams of textual data – either run directly on files or used as part of a pipeline – for purposes of extracting or transforming text, such as producing formatted reports. The language extensively uses the string datatype, associative arrays, and regular expressions. While AWK has a limited intended application domain and was especially designed to support one-liner programs, the language is Turing-complete, and even the early Bell Labs users of AWK often wrote well-structured large AWK programs.
AWK was created at Bell Labs in the 1970s, and its name is derived from the surnames of its authors: Alfred Aho, Peter Weinberger, and Brian Kernighan. The acronym is pronounced the same as the bird auk, which is on the cover of The AWK Programming Language. When written in all lowercase letters, as
awk
, it refers to the Unix or Plan 9 program that runs scripts written in the AWK programming language.History
AWK was initially developed in 1977 by Alfred Aho, Peter J. Weinberger, and Brian Kernighan; it takes its name from their respective initials. According to Kernighan, one of the goals of AWK was to have a tool that would easily manipulate both numbers and strings.AWK was also inspired by Marc Rochkind's programming language that was used to search for patterns in input data, and was implemented using yacc.
As one of the early tools to appear in Version 7 Unix, AWK added computational features to a Unix pipeline besides the Bourne shell, the only scripting language available in a standard Unix environment. It is one of the mandatory utilities of the Single UNIX Specification, and is required by the Linux Standard Base specification.
AWK was significantly revised and expanded in 1985–88, resulting in the GNU AWK implementation written by Paul Rubin, Jay Fenlason, and Richard Stallman, released in 1988. GNU AWK may be the most widely deployed version because it is included with GNU-based Linux packages. GNU AWK has been maintained solely by Arnold Robbins since 1994. Brian Kernighan's nawk source was first released in 1993 unpublicized, and publicly since the late 1990s; many BSD systems use it to avoid the GPL license.
AWK was preceded by sed. Both were designed for text processing. They share the line-oriented, data-driven paradigm, and are particularly suited to writing one-liner programs, due to the implicit main loop and current line variables. The power and terseness of early AWK programs – notably the powerful regular expression handling and conciseness due to implicit variables, which facilitate one-liners – together with the limitations of AWK at the time, were important inspirations for the Perl language. In the 1990s, Perl became very popular, competing with AWK in the niche of Unix text-processing languages.
Structure of AWK programs
An AWK program is a series of pattern action pairs, written as:condition
condition
...
where condition is typically an expression and action is a series of commands. The input is split into records, where by default records are separated by newline characters so that the input is split into lines. The program tests each record against each of the conditions in turn, and executes the action for each expression that is true. Either the condition or the action may be omitted. The condition defaults to matching every record. The default action is to print the record. This is the same pattern-action structure as sed.
In addition to a simple AWK expression, such as
foo 1
or /^foo/
, the condition can be BEGIN
or END
causing the action to be executed before or after all records have been read, or pattern1, pattern2 which matches the range of records starting with a record that matches pattern1 up to and including the record that matches pattern2 before again trying to match against pattern1 on future lines.In addition to normal arithmetic and logical operators, AWK expressions include the tilde operator,
~
, which matches a regular expression against a string. As handy syntactic sugar, /regexp/ without using the tilde operator matches against the current record; this syntax derives from sed, which in turn inherited it from the ed editor, where /
is used for searching. This syntax of using slashes as delimiters for regular expressions was subsequently adopted by Perl and ECMAScript, and is now common. The tilde operator was also adopted by Perl.Commands
AWK commands are the statements that are substituted for action in the examples above. AWK commands can include function calls, variable assignments, calculations, or any combination thereof. AWK contains built-in support for many functions; many more are provided by the various flavors of AWK. Also, some flavors support the inclusion of dynamically linked libraries, which can also provide more functions.The ''print'' command
The print command is used to output text. The output text is always terminated with a predefined string called the output record separator whose default value is a newline. The simplest form of this command is:;
print
;
print $1
;
print $1, $3
Although these fields may bear resemblance to variables, they actually refer to the fields of the current record. A special case, $0, refers to the entire record. In fact, the commands "
print
" and "print $0
" are identical in functionality.The print command can also display the results of calculations and/or function calls:
/regex_pattern/
Output may be sent to a file:
/regex_pattern/
or through a pipe:
/regex_pattern/
Built-in variables
Awk's built-in variables include the field variables: $1, $2, $3, and so on. They hold the text or values in the individual text-fields in a record.Other variables include:
-
NR
: 'N'umber of 'R'ecords: keeps a current count of the number of input records read so far from all data files. It starts at zero, but is never automatically reset to zero. -
FNR
: 'F'ile 'N'umber of 'R'ecords: keeps a current count of the number of input records read so far in the current file. This variable is automatically reset to zero each time a new file is started. -
NF
: 'N'umber of 'F'ields: contains the number of fields in the current input record. The last field in the input record can be designated by $NF, the 2nd-to-last field by $, the 3rd-to-last field by $, etc. -
FILENAME
: Contains the name of the current input-file. -
FS
: 'F'ield 'S'eparator: contains the "field separator" character used to divide fields in the input record. The default, "white space", includes any space and tab characters. FS can be reassigned to another character to change the field separator. -
RS
: 'R'ecord 'S'eparator: stores the current "record separator" character. Since, by default, an input line is the input record, the default record separator character is a "newline". -
OFS
: 'O'output 'F'ield 'S'eparator: stores the "output field separator", which separates the fields when Awk prints them. The default is a "space" character. -
ORS
: 'O'utput 'R'ecord 'S'eparator: stores the "output record separator", which separates the output records when Awk prints them. The default is a "newline" character. -
OFMT
: 'O'utput 'F'or'M'a'T': stores the format for numeric output. The default format is "%.6g".Variables and syntax
User-defined functions
In a format similar to C, function definitions consist of the keywordfunction
, the function name, argument names and the function body. Here is an example of a function.function add_three
This statement can be invoked as follows:
Functions can have variables that are in the local scope. The names of these are added to the end of the argument list, though values for these should be omitted when calling the function. It is convention to add some whitespace in the argument list before the local variables, to indicate where the parameters end and the local variables begin.
Examples
Hello World
Here is the customary "Hello, world" program written in AWK:BEGIN
Note that an explicit
exit
statement is not needed here; since the only pattern is BEGIN
, no command-line arguments are processed.Print lines longer than 80 characters
Print all lines longer than 80 characters. Note that the default action is to print the current line.length > 80
Count words
Count words in the input and print the number of lines, words, and characters :END
As there is no pattern for the first line of the program, every line of input matches by default, so the increment actions are executed for every line. Note that
words += NF
is shorthand for words = words + NF
.Sum last word
END
s is incremented by the numeric value of $NF, which is the last word on the line as defined by AWK's field separator. NF is the number of fields in the current line, e.g. 4. Since $4 is the value of the fourth field, $NF is the value of the last field in the line regardless of how many fields this line has, or whether it has more or fewer fields than surrounding lines. $ is actually a unary operator with the highest operator precedence.
At the end of the input the END pattern matches, so s is printed. However, since there may have been no lines of input at all, in which case no value has ever been assigned to s, it will by default be an empty string. Adding zero to a variable is an AWK idiom for coercing it from a string to a numeric value. With the coercion the program prints "0" on an empty input, without it an empty line is printed.
Match a range of input lines
NR % 4 1, NR % 4 3
The action statement prints each line numbered. The printf function emulates the standard C printf and works similarly to the print command described above. The pattern to match, however, works as follows: NR is the number of records, typically lines of input, AWK has so far read, i.e. the current line number, starting at 1 for the first line of input. % is the modulo operator. NR % 4 1 is true for the 1st, 5th, 9th, etc., lines of input. Likewise, NR % 4 3 is true for the 3rd, 7th, 11th, etc., lines of input. The range pattern is false until the first part matches, on line 1, and then remains true up to and including when the second part matches, on line 3. It then stays false until the first part matches again on line 5.
Thus, the program prints lines 1,2,3, skips line 4, and then 5,6,7, and so on. For each line, it prints the line number and then the line contents. For example, when executed on this input:
Rome
Florence
Milan
Naples
Turin
Venice
The previous program prints:
1 Rome
2 Florence
3 Milan
5 Turin
6 Venice
Printing the initial or the final part of a file
As a special case, when the first part of a range pattern is constantly true, e.g. 1, the range will start at the beginning of the input. Similarly, if the second part is constantly false, e.g. 0, the range will continue until the end of input. For example,/^--cut here--$/, 0
prints lines of input from the first line matching the regular expression ^--cut here--$, that is, a line containing only the phrase "--cut here--", to the end.
Calculate word frequencies
using associative arrays:BEGIN
END
The BEGIN block sets the field separator to any sequence of non-alphabetic characters. Note that separators can be regular expressions. After that, we get to a bare action, which performs the action on every input line. In this case, for every field on the line, we add one to the number of times that word, first converted to lowercase, appears. Finally, in the END block, we print the words with their frequencies. The line
for
creates a loop that goes through the array words, setting i to each subscript of the array. This is different from most languages, where such a loop goes through each value in the array. The loop thus prints out each word followed by its frequency count.
tolower
was an addition to the One True awk made after the book was published.Match pattern from command line
This program can be represented in several ways. The first one uses the Bourne shell to make a shell script that does everything. It is the shortest of these methods:- !/bin/sh
shift
awk '/'"$pattern"'/ ' "$@"
The
$pattern
in the awk command is not protected by single quotes so that the shell does expand the variable but it needs to be put in double quotes to properly handle patterns containing spaces. A pattern by itself in the usual way checks to see if the whole line matches. FILENAME
contains the current filename. awk has no explicit concatenation operator; two adjacent strings concatenate them. $0
expands to the original unchanged input line.There are alternate ways of writing this. This shell script accesses the environment directly from within awk:
- !/bin/sh
shift
awk '$0 ~ ENVIRON ' "$@"
This is a shell script that uses
ENVIRON
, an array introduced in a newer version of the One True awk after the book was published. The subscript of ENVIRON
is the name of an environment variable; its result is the variable's value. This is like the getenv function in various standard libraries and POSIX. The shell script makes an environment variable pattern
containing the first argument, then drops that argument and has awk look for the pattern in each file.~
checks to see if its left operand matches its right operand; !~
is its inverse. Note that a regular expression is just a string and can be stored in variables.The next way uses command-line variable assignment, in which an argument to awk can be seen as an assignment to a variable:
- !/bin/sh
shift
awk '$0 ~ pattern ' "pattern=$pattern" "$@"
Or You can use the -v var=value command line option.
Finally, this is written in pure awk, without help from a shell or without the need to know too much about the implementation of the awk script, but is a bit lengthy:
BEGIN
$0 ~ pattern
The
BEGIN
is necessary not only to extract the first argument, but also to prevent it from being interpreted as a filename after the BEGIN
block ends. ARGC
, the number of arguments, is always guaranteed to be ≥1, as ARGV
is the name of the command that executed the script, most often the string "awk"
. Also note that ARGV
is the empty string, ""
. #
initiates a comment that expands to the end of the line.Note the
if
block. awk only checks to see if it should read from standard input before it runs the command. This means thatawk 'prog'
only works because the fact that there are no filenames is only checked before
prog
is run! If you explicitly set ARGC
to 1 so that there are no arguments, awk will simply quit because it feels there are no more input files. Therefore, you need to explicitly say to read from standard input with the special filename -
.Self-contained AWK scripts
On Unix-like operating systems self-contained AWK scripts can be constructed using the shebang syntax.For example, a script that prints the content of a given file may be built by creating a file named
print.awk
with the following content:- !/usr/bin/awk -f
It can be invoked with:
./print.awk
The
-f
tells AWK that the argument that follows is the file to read the AWK program from, which is the same flag that is used in sed. Since they are often used for one-liners, both these programs default to executing a program given as a command-line argument, rather than a separate file.Versions and implementations
AWK was originally written in 1977 and distributed with Version 7 Unix.In 1985 its authors started expanding the language, most significantly by adding user-defined functions. The language is described in the book The AWK Programming Language, published 1988, and its implementation was made available in releases of UNIX System V. To avoid confusion with the incompatible older version, this version was sometimes called "new awk" or nawk. This implementation was released under a free software license in 1996 and is still maintained by Brian Kernighan.
Old versions of Unix, such as UNIX/32V, included
awkcc
, which converted AWK to C. Kernighan wrote a program to turn awk into C++; its state is not known.- BWK awk, also known as nawk, refers to the version by Brian Kernighan. It has been dubbed the "One True AWK" because of the use of the term in association with the book that originally described the language and the fact that Kernighan was one of the original authors of AWK. FreeBSD refers to this version as one-true-awk. This version also has features not in the book, such as
tolower
andENVIRON
that are explained above; see the FIXES file in the source archive for details. This version is used by e.g. FreeBSD, NetBSD, OpenBSD, macOS and illumos. Brian Kernighan and Arnold Robbins are the main contributors to a source repository for nawk. - gawk is another free-software implementation and the only implementation that makes serious progress implementing internationalization and localization and TCP/IP networking. It was written before the original implementation became freely available. It includes its own debugger, and its profiler enables the user to make measured performance enhancements to a script. It also enables the user to extend functionality with shared libraries. Some Linux distributions include gawk as their default AWK implementation.
- * gawk-csv. The CSV extension of gawk provides facilities for handling input and output CSV formatted data.
- mawk is a very fast AWK implementation by Mike Brennan based on a bytecode interpreter.
- libmawk is a fork of mawk, allowing applications to embed multiple parallel instances of awk interpreters.
- awka is another translator of AWK scripts into C code. When compiled, statically including the author's libawka.a, the resulting executables are considerably sped up and, according to the author's tests, compare very well with other versions of AWK, Perl, or Tcl. Small scripts will turn into programs of 160–170 kB.
- tawk is an AWK compiler for Solaris, DOS, OS/2, and Windows, previously sold by Thompson Automation Software.
- Jawk is a project to implement AWK in Java, hosted on SourceForge. Extensions to the language are added to provide access to Java features within AWK scripts.
- xgawk is a fork of gawk that extends gawk with dynamically loadable libraries. The XMLgawk extension was integrated into the official GNU Awk release 4.1.0.
- QSEAWK is an embedded AWK interpreter implementation included in the QSE library that provides embedding application programming interface for C and C++.
- libfawk is a very small, function-only, reentrant, embeddable interpreter written in C
- BusyBox includes an AWK implementation written by Dmitry Zakharov. This is a very small implementation suitable for embedded systems.
Books