Data extraction


Data extraction is the act or process of retrieving data out of data sources for further data processing or data storage. The import into the intermediate extracting system is thus usually followed by data transformation and possibly the addition of metadata prior to export to another stage in the data workflow.
Usually, the term data extraction is applied when data is first imported into a computer from primary sources, like measuring or recording devices. Today's electronic devices will usually present an electrical connector through which 'raw data' can be streamed into a personal computer.

Data sources

Typical unstructured data sources include web pages, emails, documents, PDFs, scanned text, mainframe reports, spool files, classifieds, etc. which is further used for sales or marketing leads. Extracting data from these unstructured sources has grown into a considerable technical challenge where as historically data extraction has had to deal with changes in physical hardware formats, the majority of current data extraction deals with extracting data from these unstructured data sources, and from different software formats. This growing process of data extraction from the web is referred to as "Web data extraction" or "Web scraping".

Imposing structure

The act of adding structure to unstructured data takes a number of forms