LRE Map


The LRE Map is a freely accessible large database on resources dedicated to Natural language processing. The original feature of LRE Map is that the records are collected during the submission of different major Natural language processing conferences. The records are then cleaned and gathered into a global database called "LRE Map".
The LRE Map is intended to be an instrument for collecting information about language resources and to become, at the same time, a community for users, a place to share and discover resources, discuss opinions, provide feedback, discover new trends, etc. It is an instrument for discovering, searching and documenting language resources, here intended in a broad sense, as both data and tools.
The large amount of information contained in the Map can be analyzed in many different ways. For instance, the LRE Map can provide information about the most frequent type of resource, the most represented language, the applications for which resources are used or are being developed, the proportion of new resources vs. already existing ones, or the way in which resources are distributed to the community.

Context

Several institutions worldwide maintain catalogues of language resources
However, it has been estimated that only 10% of existing resources are known, either through distribution catalogues or via direct publicity by providers. The rest remains hidden, the only occasions where it briefly emerges being when a resource is presented in the context of a research paper or report at some conference. Even in this case, nevertheless, it might be that a resource remains in the background simply because the focus of the research is not on the resource per se.

History

The LRE Map originated under the name "LREC Map" during the preparation of LREC 2010 conference. More specifically, the idea was discussed within the FlaReNet project, and in collaboration with and the , the Map was put in place at LREC 2010. The LREC organizers asked the authors to provide some basic information about all the resources, either used or created, described in their papers. All these descriptors were then gathered in a global matrix called the LREC Map.
The same methodology and requirements from the authors has been then applied and extended to other conferences, namely COLING-2010, EMNLP-2010, RANLP-2011, LREC 2012, LREC 2014 and LREC 2016.
After this generalization to other conferences, the LREC Map has been renamed as the LRE Map.

Size and content

The size of the database increases over time. The data collected amount to 4776 entries.
Each resource is described according to the following attributes:
The LRE map is a very important tool to chart the NLP field. Compared to other studied based on subjective scorings, the LRE map is made of real facts.
The map has a great potential for many uses, in addition to being an information gathering tool:
The data were then cleaned and sorted by Joseph Mariani and Gil Francopoulo in order to compute the various matrices of the final FLaReNet reports. One of them, the matrix for written data at LREC 2010 is as follows:
CorpusLexiconOntologyGrammar/Language
Model
Terminology
Bulgarian76111
Czech127211
Danish62020
Dutch178212
English20677181110
Estonian31001
Finnish32010
French4424345
German4315423
Greek103200
Hungarian84011
Irish10000
Italian3216420
Latvian90001
Lithuanian40201
Maltese10010
Polish72121
Portuguese196110
Romanian127110
Slovak20010
Slovene51000
Spanish2919452
Swedish194010
Other Europe1911332
Regional Europe188013
Multilingual53101
Language independent931621
Non applicable20210
Total552229674536

English is the most studied language. Secondly, come French and German languages and then Italian and Spanish.

Future

The LRE Map has been extended to Language Resources and Evaluation Journal and other conferences.