A statisticaldatabase is a database used for statistical analysis purposes. It is an OLAP, instead of OLTP system. Modern decision, and classical statistical databases are often closer to the relational model than the multidimensionalmodel commonly used in OLAP systems today. Statistical databases typically contain parameter data and the measured data for these parameters. For example, parameter data consists of the different values for varying conditions in an experiment. The measured data are the measurements taken in the experiment under these varying conditions. Many statistical databases are sparse with many null or zero values. It is not uncommon for a statistical database to be 40% to 50% sparse. There are two options for dealing with the sparseness: leave the null values in there and use compression techniques to squeeze them out or remove the entries that only have null values. Statistical databases often incorporate support for advanced statistical analysis techniques, such as correlations, which go beyond SQL. They also pose unique security concerns, which were the focus of much research, particularly in the late 1970s and early to mid-1980s.
Security in statistical databases
In a statistical database, it is often desired to allow query access only to aggregate data, not individual records. Securing such a database is a difficult problem, since intelligent users can use a combination of aggregate queries to derive information about a single individual. Some common approaches are:
only allowing aggregate queries
rather than returning exact values for sensitive data like income, only return which partition it belongs to
return imprecise counts
don't allow overly selective WHERE clauses
audit all users queries, so users using system incorrectly can be investigated
For many years, research in this area was stalled, and it was thought in 1980 that, to quote: But in 2006, Cynthia Dwork defined the field of differential privacy, using work that started appearing in 2003. While showing that some semantic security goals, related to work of Tore Dalenius, were impossible, it identified new techniques for limiting the increased privacy risk resulting from inclusion of private data in a statistical database. This makes it possible in many cases to provide very accurate statistics from the database while still ensuring high levels of privacy.
Some further reading
An important series of conferences in this field Some key papers in this field:
- Dorothy E. Denning, Jan Schlörer, A fast procedure for finding a tracker in a statistical database, ACM Transactions on Database Systems, Volume 5, Issue 1 . Pages: 88 - 102