Streaming algorithm

In computer science, streaming algorithms are algorithms for processing data streams in which the input is presented as a sequence of items and can be examined in only a few passes. In most models, these algorithms have access to limited memory. They may also have limited processing time per item.
These constraints may mean that an algorithm produces an approximate answer based on a summary or "sketch" of the data stream.

History

Though streaming algorithms had already been studied by Munro and Paterson as early as 1978, as well as Philippe Flajolet and G. Nigel Martin in 1982/83, the field of streaming algorithms was first formalized and popularized in a 1996 paper by Noga Alon, Yossi Matias, and Mario Szegedy. For this paper, the authors later won the Gödel Prize in 2005 "for their foundational contribution to streaming algorithms." There has since been a large body of work centered around data streaming algorithms that spans a diverse spectrum of computer science fields such as theory, databases, networking, and natural language processing.
Semi-streaming algorithms were introduced in 2005 as a relaxation of streaming algorithms for graphs , in which the space allowed is linear in the number of vertices, but only logarithmic in the number of edges. This relaxation is still meaningful for dense graphs, and can solve interesting problems that are insoluble in space.

Models

Data stream model

In the data stream model, some or all of the input is represented as a finite sequence of integers which is generally not available for random access, but instead arrives one at a time in a "stream". If the stream has length and the domain has size, algorithms are generally constrained to use space that is logarithmic in and. They can generally make only some small constant number of passes over the stream, sometimes just one.

Turnstile and cash register models

Much of the streaming literature is concerned with computing statistics on
frequency distributions that are too large to be stored. For this class of
problems, there is a vector
that has updates
presented to it in a stream. The goal of these algorithms is to compute
functions of using considerably less space than it
would take to represent precisely. There are two
common models for updating such streams, called the "cash register" and
"turnstile" models.
In the cash register model each update is of the form, so that is incremented by some positive
integer. A notable special case is when
.
In the turnstile model each update is of the form, so that is incremented by some integer. In the "strict turnstile" model, no
at any time may be less than zero.

Sliding window model

Several papers also consider the "sliding window" model. In this model,
the function of interest is computing over a fixed-size window in the
stream. As the stream progresses, items from the end of the window are
removed from consideration while new items from the stream take their
place.
Besides the above frequency-based problems, some other types of problems
have also been studied. Many graph problems are solved in the setting
where the adjacency matrix or the adjacency list of the graph is streamed in
some unknown order. There are also some problems that are very dependent
on the order of the stream, such as counting
the number of inversions in a stream and finding the longest increasing
subsequence.

Evaluation

The performance of an algorithm that operates on data streams is measured by three basic factors:

The number of passes the algorithm must make over the stream.
The available memory.
The running time of the algorithm.

These algorithms have many similarities with online algorithms since they both require decisions to be made before all data are available, but they are not identical. Data stream algorithms only have limited memory available but they may be able to defer action until a group of points arrive, while online algorithms are required to take action as soon as each point arrives.
If the algorithm is an approximation algorithm then the accuracy of the answer is another key factor. The accuracy is often stated as an approximation meaning that the algorithm achieves an error of less than with probability.

Applications

Streaming algorithms have several applications in networking such as
monitoring network links for elephant flows, counting the number of
distinct flows, estimating the distribution of flow sizes, and so
on. They also have applications in
databases, such as estimating the size of a join.

Some streaming problems

Frequency moments

The th frequency moment of a set of frequencies
is defined as.
The first moment is simply the sum of the frequencies
. The second moment is useful for
computing statistical properties of the data, such as the Gini coefficient
of variation. is defined as the frequency of the
most frequent item.
The seminal paper of Alon, Matias, and Szegedy dealt with the
problem of estimating the frequency moments.

Calculating Frequency Moments

A direct approach to find the frequency moments requires to maintain a register for all distinct elements which requires at least memory
of order. But we have space limitations and require an algorithm that computes in much lower memory. This can be achieved by using approximations instead of exact values. An algorithm that computes an approximation of, where is the -
approximated value of. Where ε is the approximation parameter and δ is the confidence parameter.

Calculating ''F''₀ (Distinct Elements in a DataStream)

FM-Sketch Algorithm

Flajolet et al. in introduced probabilistic method of counting which was inspired from a paper by Robert Morris. Morris in his paper says that if the requirement of accuracy is dropped, a counter n can be replaced by a counter which can be stored in bits. Flajolet et al. in improved this method by using a hash function which is assumed to uniformly distribute the element in the hash space.
Let represent the kth bit in binary representation of
Let represents the position of least
significant 1-bit in the binary representation of with a suitable convention for.
Let A be the sequence of data stream of length M whose cardinality need to be determined. Let BITMAP be the
hash space where the are recorded. The below algorithm then determines approximate cardinality of A.


Procedure FM-Sketch:
 for i in 0 to L − 1 do
 BITMAP:=0 
 end for
 for x in A: do
 Index:=ρ
 if BITMAP=0 then BITMAP:=1
 end if
 end for
 B:= Position of left most 0 bit of BITMAP 
 return 2^B

If there are N distinct elements in a data stream.

For then BITMAP is certainly 0
For then BITMAP is certainly 1
For then BITMAP is a fringes of 0's and 1's
K-Minimum Value Algorithm

The previous algorithm describes the first attempt to approximate F₀ in the data stream by Flajolet and Martin. Their algorithm picks a random hash function which they assume to uniformly distribute the hash values in hash space.
Bar-Yossef et al. in introduced k-minimum value algorithm for determining number of distinct elements in data stream. They used a similar hash function h which can be normalized to as. But they fixed a limit t to number of values in hash space. The value of t is assumed of the order . KMV algorithm keeps only t-smallest hash values in the hash space. After all the m values of stream have arrived, is used to calculate. That is, in a close-to uniform hash space, they expect at-least t elements to be less than .


Procedure 2 K-Minimum Value
Initialize first t values of KMV 
for a in a1 to an do
 if h < Max then
 Remove Max from KMV set
 Insert h to KMV 
 end if
end for 
return t/Max

Complexity analysis of KMV

KMV algorithm can be implemented in memory bits space. Each hash value requires space of order memory bits. There are hash values of the order. The access time can be reduced if we store the t hash values in a binary tree. Thus the time complexity will be reduced to.

Calculating

Alon et al. in estimates by defining random variables that can be computed within given space and time. The expected value of random variable gives the approximate value of.
Let us assume length of sequence m is known in advance.
Construct a random variable X as follows:

Select be a random member of sequence with index at,
Let, represents the number of occurrences of within the members of the sequence following.
Random variable.

Assume S₁ be of the order and S₂ be of the order. Algorithm takes S₂ random
variable Y₁,Y₂,...,Y_S₂ and outputs the median Y. Where is the average of
where 1 ≤ j ≤ S₁.
Now calculate expectation of random variable.

Complexity of

From the algorithm to calculate discussed above, we can see that each random variable stores value of and. So, to compute we need to maintain only bits for storing and bits for storing. Total number of random variable will be the.
Hence the total space complexity the algorithm takes is of the order of

Simpler approach to calculate

The previous algorithm calculates in order of memory bits. Alon et al. in simplified this algorithm using four-wise independent random variable with values mapped to.
This further reduces the complexity to calculate to

Frequent elements

In the data stream model, the frequent elements problem is to output a set of elements that constitute more than some fixed fraction of the stream. A special case is the majority problem, which is to determine whether or not any value constitutes a majority of the stream.
More formally, fix some positive constant > 1, let the length of the stream be, and let denote the frequency of value in the stream. The frequent elements problem is to output the set.
Some notable algorithms are:

Boyer–Moore majority vote algorithm
Karp-Papadimitriou-Shenker algorithm
Count-Min sketch
Sticky sampling
Lossy counting
Sample and Hold
Multi-stage Bloom filters
Count-sketch
Sketch-guided sampling
Misra–Gries summary
Event detection

Detecting events in data streams is often done using a heavy hitters algorithm as listed above: the most frequent items and their frequency are determined using one of these algorithms, then the largest increase over the previous time point is reported as trend. This approach can be refined by using exponentially weighted moving averages and variance for normalization.

Counting distinct elements

Counting the number of distinct elements in a stream is another problem that has been well studied.
The first algorithm for it was proposed by Flajolet and Martin. In 2010, Daniel Kane, Jelani Nelson and David Woodruff found an asymptotically optimal algorithm for this problem. It uses space, with worst-case update and reporting times, as well as universal hash functions and a -wise independent hash family where.

Entropy

The entropy of a set of frequencies is
defined as, where.
Estimation of this quantity in a stream has been done by:

McGregor et al.
Do Ba et al.
Lall et al.
Chakrabarti et al.
Online learning

Learn a model by a single pass over a training set.

Feature hashing
Stochastic gradient descent
Lower bounds

Lower bounds have been computed for many of the data streaming problems
that have been studied. By far, the most common technique for computing
these lower bounds has been using communication complexity.

Popular movies

The Hunger Games (film) - 2012 American dystopian action thriller science fiction-adventure film directed by Gary Ross and based on Suzanne Collins’s 2008 novel of the same name. It is the first insta...
untitled Captain Marvel sequel - part of Marvel Cinematic Universe....
Killers of the Flower Moon (film project) - Killers of the Flower Moon - film project in United States of America. It was presented as drama, detective fiction, thriller. The film project starred Leonardo Dicaprio, Robert De Niro. Director of...
Five Nights at Freddy's (film) - Five Nights at Freddy's - film published in 2017 in United States of America. Scenarist of the film - Scott Cawthon....

Popular books

Book of Revelation - The Book of Revelation is the final book of the New Testament, and consequently is also the final book of the Christian Bible. Its title is derived from the first word of the Koine Greek text: apok...
Book of Genesis - account of the creation of the world, the early history of humanity, Israel's ancestors and the origins...
Gospel of Matthew - The Gospel According to Matthew is the first book of the New Testament and one of the three synoptic gospels. It tells how Israel's Messiah, rejected and executed in Israel, pronounces judgement on ...
Michelin Guide - Michelin Guides are a series of guide books published by the French tyre company Michelin for more than a century. The term normally refers to the annually published Michelin Red Guide , the oldest...
Psalms - The Book of Psalms , commonly referred to simply as Psalms , the Psalter or "the Psalms", is the first book of the Ketuvim , the third section of the Hebrew Bible, and thus a book of th...
Ecclesiastes - Ecclesiastes is one of 24 books of the Tanakh , where it is classified as one of the Ketuvim . Originally written c. 450–200 BCE, it is also among the canonical Wisdom literature of the Old Tes...
The 48 Laws of Power - non-fiction book by American author Robert Greene. The book...

Popular television series

The Crown (TV series) - historical drama web television series about the reign of Queen Elizabeth II, created and principally written by Peter Morgan, and produced by Left Bank Pictures and Sony Pictures Tel...
Friends - American sitcom television series, created by David Crane and Marta Kauffman, which aired on NBC from September 22, 1994, to May 6, 2004, lasting ten seasons. With an ensemble cast sta...
Young Sheldon - spin-off prequel to The Big Bang Theory and begins with the character Sheldon...
Modern Family - American television mockumentary family sitcom created by Christopher Lloyd and Steven Levitan for the American Broadcasting Company. It ran for eleven seasons, from September 23...
Loki (TV series) - upcoming American web television miniseries created for Disney+ by Michael Waldron, based on the Marvel Comics character of the same name. It is set in the Marvel Cinematic Universe, shar...
Game of Thrones - American fantasy drama television series created by David Benioff and D. B. Weiss for HBO. It...
Shameless (American TV series) - American comedy-drama television series developed by John Wells which debuted on Showtime on January 9, 2011. It...