Text Creation Partnership
The Text Creation Partnership is a not-for-profit organization based in the library of the University of Michigan. Its purpose is to produce large-scale full-text electronic resources on behalf of both member institutions and scholarly publishers, under an arrangement calculated to serve the needs of both, and in so doing to demonstrate the value of a business model that sees corporate and non-profit information-providers as potentially amicable collaborators rather than as antagonistic vendors and customers respectively.
Projects
TCP has sponsored four text-creation projects to date. The first and the largest is "EEBO-TCP ", an effort to produce structurally marked-up full-text transcriptions of 25,000+ of the roughly 125,000 books to be found either in the Pollard and Redgrave and Wing short-title catalogues of early English printed books, or among the Thomason Tracts, that is, from among nearly all books, pamphlets, and broadsides published in English or in England before 1700. The books were selected and transcribed from the digital scans produced by ProQuest Information and Learning, and distributed by them as a web-based product under the name "Early English Books Online". The scans from which the texts were transcribed were themselves made from the microfilm copies made over the years by ProQuest and its antecedent companies, including the original University Microfilms, Inc. EEBO-TCP Phase I concluded at the end of 2009, having transcribed about 25,300 titles, and immediately moved into EEBO-TCP Phase II, a sequel project dedicated to converting all the remaining unique English-language monographs.The third TCP project was Evans-TCP, an effort to transcribe 6,000 of the 36,000 pre-1800 titles listed in Charles Evans' American Bibliography, and distributed, again as page images scanned from microfilm copies, by Readex, a division of NewsBank, Inc. under the name "Archive of Americana". Evans-TCP has produced e-texts of nearly 5,000 books.
The final TCP project was ECCO-TCP, an effort to transcribe 10,000 eighteenth-century books from among the 136,000 titles available in Thomson-Gale's web-based resource, "Eighteenth-Century Collections Online". ECCO-TCP ran out of funding in 2010 after transcribing about 3,000 titles.
Project commonalities
All four TCP text projects are very similar. In each case:- The TCP produces text from commercial image files that have in turn been created from microfilm copies of early books.
- The commercial image providers receive what is in effect a full-text index to their image product for much less than it would cost to produce themselves: value added to their product.
- The partner libraries actually own, rather than simply license, the resultant texts, and are free to mount the texts themselves in whatever system they like, or use the texts internally as a tool of scholarship and teaching.
- The texts are created according to library-determined standards, uniform across multiple data-sets and potentially cross-searchable.
- Because they are created collaboratively, the texts are relatively inexpensive and become more so with each library that joins the partnership.
- The texts will eventually be made freely accessible to the public at large.
- The selection of texts to convert, though differing from project to project, in each case follows similar principles: variety, significance, representative quality, avoidance of duplication; specific requests from faculty or scholarly initiatives at member institutions are also generally honored.
- TCP has been hitherto primarily interested in creating texts, not in creating a "product"; though texts from all three projects are or will be mounted on servers at the University of Michigan library, the Michigan site is not the official TCP site: any partner library with adequate resources and safeguards may do the same. EEBO-TCP texts, for example, are served by Michigan, ProQuest, the Oxford University Digital Library, and the University of Chicago.
Organization
The TCP has informal ties to a number of University-based scholarly text projects, especially in helping to provide them with source texts with which to work. Institutions represented include Northwestern University, Oxford University, Washington University, the University of Sydney, the University of Toronto, and the University of Victoria. TCP has also worked with students by sponsoring an Undergraduate Essay Contest every year, convening task forces on the uses of TCP texts in pedagogy, and appealing to scholars and students for ideas on selection and use.
Text production is managed through the University of Michigan's Digital Library Production Service, with its extensive experience in the production of SGML/XML-encoded electronic texts. DLPS is assisted by Oxford University's Bodleian Digital Libraries Systems & Services, including the late Sebastian Rahtz. Small part-time production operations have also been started within two other libraries: the Centre for Reformation and Renaissance Studies in Pratt Library, specializing in Latin books; and the National Library of Wales in Aberystwyth, specializing in Welsh books.
Standards
All four TCP text projects are produced in the same way and to the same standards, which are documented, at least in part, on the TCP web site.- Accuracy. The TCP strives to produce texts that are as accurately transcribed as possible, with a specified overall accuracy rate of 99.995% or better.
- Keying. Given the nature of the material, the only method found to deliver such accuracy economically has been to have the books keyed by data conversion firms under contract.
- Quality control. Accuracy of transcription and aptness of markup are assessed in all cases by a group of library-based proofers and reviewers managed by the University of Michigan DLPS.
- Encoding. All resultant text files are marked up in valid SGML or XML conforming to a proprietary "Document Type Description" derived from the P3/P4 version of the Text Encoding Initiative standard.
- Purposeful markup. Compared to the full TEI, the TCP DTD is very simple and intended to capture only the features most useful for intelligible display, intelligent navigation, and productive searching. The TCP practice is to capture, so far as feasible, the overall hierarchical structure of each book ; the features that tend to mark the beginnings and ends of divisions ; the most significant elements of discourse and organization ; and only the most essential aspects of physical formatting.
- Fidelity to the original. In each case, the text is intended to represent the book as originally printed, so far as that is possible. Printer's errors are preserved, hand-written changes are ignored, duplicate scans are omitted, out-of-order images are keyed in the intended order, and most of the unusual characters of the original are preserved.
- Ease of reading and searching. At the same time, though the transcriptions are carried out character-by-character, TCP, on the theory that all transcription is a kind of translation from one symbolic system to another, tends to define characters in terms more of their meaning than of their form, and to map eccentric letter-forms to meaningful modern equivalents, generally in keeping with the Unicode definition of "character."
- Languages. Though most of the TCP texts are in English, many are not. Books and divisions of books not in English are tagged with an appropriate language code, but are not otherwise distinguished.
- Omitted material. The TCP produces Latin-alphabet text. Non-textual material such as musical notation, mathematical formulae, and illustrations are omitted and their locations marked with a special tag. Extended text in non-Latin alphabets is also omitted.
Accomplishments and prospects
As of Jan 1, 2015, the full text of the EEBO phase I has been released under a Creative Commons License, and can be freely downloaded and distributed.
In 2014 there were 28,466 titles available via Phase II. As of July 2015, ProQuest had the exclusive right for five years to distribute the EEBO-TCP Phase II collection. After those five years the texts will be made freely available to the public.