|Digital Library of the Caribbean||english español français|
|dLOC Home |||About dLOC | Partners | Topical Collections | RSS|
|Country/Place of Publication||Title||Born Digital||Years|
|The Nassau Tribune||n|
|The Spectrum (The College of The Bahamas)||y|
|Barbados||Barbados Advocate (1895-1982 and 1983-2001)||n||- 2001|
|Belize||Belize Ag. Report||y|
|University of Belize Bulletin||y|
|British Virgin Islands||The BVI Beacon|
|Costa Rica||Lankesteriana- International Journal of Orchidology||y|
|Cuba||El Diario de la Marina||n/a||historic|
|Haiti||Haiti en Marche||y (current)|
|Honduras||Honduras This Week||y|
|Jamaica Tourist Magazine||y|
|Martinique||Justice (Justice Website)||n|
|Ultimas Noticias de Quintana Roo||y|
|Puerto Rico||Sargasso||n||all - 2 years|
|St. Barthélemy||Journal de Saint-Barth||y|
|St. Barth Weekly||y|
|St. Eustatius||The Informer||y|
|St. Kitts||Labour Spokesman||y|
|St. Lucia||National Review||y|
|Trinidad||UWI Today (University of the West Indies)||y|
|USVI||Dateline UVI (University of the Virgin Islands)||y|
|UVI Voice (University of the Virgin Islands)||y|
|St. Johns Tradewinds||y|
|Caribbean||All at Sea Magazine||y|
|Caribbean Council Newsletter||y|
|Institute of Nautical Archeology Quarterly||y|
|Caribbean Compass Magazine||y|
|National Gazettes||n||historic - current|
|Other historic Caribbean Newspapers digitized for preservation (based on available funding for paper and from microfilming holdings as permissions are granted for current and prior issues).||n|
The Caribbean Newspaper Digital Library builds on past microfilming efforts, the CNIP project (and it's legacy database) to continue to serve preservation and access needs. Information on past efforts is below. This information is not up to date and is maintained here only for historical purposes
Title & Place of Publication
AMIGOE; Curacao, Netherlands Antilles
BARBADOS ADVOCATE; Bridgetown, Barbados
CARIBE; Santo Domingo, Dominican Republic
2 X per month
ESPECTADOR; Bogota, Colombia
FOLHA DE S. PAULO; Sao Paulo, Brazil
JUSTICE; Fort-de-France, Martinique
KESHER; Mexico City, Mexico
every 2 years
MAGA; Buenos Aires, Argentina
2 X per year
NASSAU GUARDIAN; Nassau, Bahamas
PAIS, EL; Cali, Colombia
SUNDAY CHRONICLE; Georgetown, Guyana
2 X per year
TRIBUNE; Nassau, Bahamas
2 X per month
TRINIDAD GUARDIAN; Port of Spain, Trinidad
WEEKLY GLEANER; Kingston, Jamaica
2 X per year
Below is a partial list of microfilm that is held in the Latin American Collection at the University of Florida. The collection consists of over 50,000 assorted microforms. This list is alphabetized by country. Any of these that have been digitized are available within the Caribbean Digital Newspaper Library.
The Leeward Islands Gazette, 1890-1939
The Worker's Voice, 1965-1971
Bahama Argus, Jul 1831-Dec 1835
Bahama Gazette, 1784-1819
Bahama Herald, 1849-1863
Bahama News, 1899
Nassau Daily Tribune, 1911-1996 (May), 1997(June)-Present*
Nassau Guardian, 1849-1997
Nassau Times, 1874- 1891
Royal Gazette, 1807-1837
Barbados Advocate, 1950-Present*
The Beacon Barbados Labor Party, 1965-1971
The Black Star, 1967-1969
Caribbean Contact, 1973-1994
The Times, 1863-1895
Combate Internacional, 1961
Contra Punto, 1965
Cooperador Cubano, 1964
Diario de la Marina, 1844-1882; 1899-1961
La Discusion, 1924-1925 (June)
Heraldo de Cuba, Mar-Dec 1932
Liberacion, May 1960-Jan 1962
El Caribe, 1948-1998*
El Eco de Opinion, Mar 1879-Dec 1897
La Informacion, 1946-1950
El Listin Diario, 1909-1988*
La Nacion, 1944-1963
Le Combat, 1897-1899
Conscience Guyanais, 1959-1962
Debout Guyane, 1957-1962
La Semaine en Guyane et Dans le Monde, 1944-45, 1947-48
Journal Comercial de la Pointe a Pitre, 1841-1864
Courrier de la Guadeloupe, 1881-1885; 1887-1908
La Democratie, 1900-1906
Echo de la Guadelope Journal des Interests Coloniaux, 1872-1880
Haiti Herald, 1956-1963
Le Manifeste, Apr.1841-May 1844
Le Matin, 1898-1899; 1907-1981
Moniteur Hatien, 1865-1947
Le Nouvelliste, Aug 1899-1914; 1924- Dec 1985
Le Nouveau Monde, Jan 1962-Jun 1972
L'Union, Jun 1837-Sept 1839
Antilles Presse, 1965-1966
Les Colonies, 1881-1902
L'Information Fort de France, 1941-1962
Justice Fort de France, 1965-Present*
La Paix (La Paix Sociale), 1916-1961 1962-1964
Le Propagateur, 1855-1893
The Nation, 1962-1970
Port of Spain Gazette, 1825-1956
San Fernando Gazette, 1850-1891
The Statesman, 1962-1964
Tobago Gazette, 1807-1808
Trinidad Chronicle, July 1956-1958
Trinidad Gazette, 1820-1822
Trinidad Guardian, 1917-Present*
The Virgin Islands Daily News, 1950-1961; 1970-1992
Nacional, 1985-Sep 1998*
The Caribbean Newspaper Imaging Project was a series of demonstration projects, both funded by the Andrew W. Mellon Foundation and the University of Florida Libraries. These projects occurred as two distinct phases:
Imaging and Indexing Model.
A feasibility studies for imaging and indexing. The imaging study examined the efficacy of digitizing microfilms produced in advance of current preservation microfilming standards. It also examined the use of off-the-shelf microfilm-projection scanning, as well as associated costs, benefits and drawbacks. The indexing study examined indexing procedures, application of controlled terminology, and the costs associated with multi-lingual term assignments by human readers.
OCR Gateway to Indexing.
A feasibility study on the application of Optical Character Recognition (OCR). In its current state, the Project is undergoing technological renovation, that is migration from CD-ROM to Internet delivery. At the same time, the Project is developing plans for additional content.
The evolution of microform technology provided an impetus for substantial collection development effort in the 1950s and 1960s. The Libraries went to the Caribbean and other Latin American countries and microfilmed materials, many of which have now disappeared in their original printed formats. The Libraries also established an in-house microfilming program, which systematically converted Latin American newspapers received through subscription. Microfilm technology was a great advance, especially for long term preservation efforts and will remain an appropriate technology in most cases. However, it suffers several the inherent limitations. It must be used in situ or retrieved, copied and moved to another location, and finding aids/indexes/abstracts are usually published separately. Caribbean and other Latin American scholars come to the Collection and wind their way through reel after reel, since few indices exist, or must find funds to obtain copies for themselves or their libraries. This results in many Caribbean and Latin American social scientists and humanists not having access to their national and regional newspapers, in many cases their most important primary research resources. Many North American Latin Americanist scholars have this problem as well.
Through long standing agreements between the University of Florida Libraries and University Microfilms International, copies of most of the collection's microfilmed newspapers may be purchased. For instance, for approximately $20,000 an institution may purchase the 600 35mm microform reels of the Cuban newspaper Diario de La Marina from 1899 through 1961.
Microfilm is difficult to maintain, especially in uncontrolled environments or with inadequate equipment. Experience demonstrates that microfilm deterioration begins whenever optimal environmental conditions are not maintained. Interruptions in the air conditioning and humidity control systems initiates a slow deterioration process which cannot be stopped. In addition to maintaining environmental control, which is quite often beyond a library's control, substantial investments must be made in training staff to detect deterioration of suspect film may be replaced. Though most research libraries in the United States which have purchased copies of the microfilm maintain it under reasonable climate controls, many of the Caribbean libraries that have purchased copies of the microfilm do not have adequate climate controls. These libraries have had to replace copies periodically and at great expense.
Preservation of digital collections also requires substantial effort and equipment; and, the University of Florida Libraries is committed to preservation of its digital collections, including Caribbean Newspaper Imaging Project resources. A separately housed tape collection, with especially designed environmental controls meeting current electronic media storage standards, was established over twenty years ago at the Smathers Libraries and now contains more than 20,000 archived computer tape files. Electronic media deteriorates more quickly than microfilm, but it can be "replaced" more readily than microfilm with data refreshed or restored. Detection, refreshment, and duplication can be scheduled and automated, substantially reducing maintenance expense. Digitized images can be exactly copied with no loss of image quality. Caribbean Newspaper Imaging Project resources are maintained in multiple electronic formats, not withstanding the digital distribution format that the Project's users find online. Each format has established maintenance schedules. In addition, the University of Florida Libraries is committed to preserving the source microfilm negatives.
With funding from the Andrew W. Mellon Foundation
The Caribbean Newspaper Imaging Project catalog included titles from Cuba and Haiti:
Habana; Diario de la Marina (1947 January - 1961 May 6)
Originating as El Noticioso y Lucero de la Habanna in 1832, Diario
de la Marina began publishing in 1844 and is considered an
essential resource for research on Cuba before the 1959 revolution. The years covered by the Caribbean Newspaper imaging Project
are 1899-1961. This
first year covers Cuba's independence from Spain after the
Spanish-American War, followed by U.S. military occupation until
1901 and again in 1906. Subsequent
years track succession of regimes, U.S. influence, and the
preeminence of the sugar industry.
After the revolution, editorials in Diario de la Marina reflect reaction to reforms and the pace of change, as well as concern about freedom of the press, workers' rights, and international relations. The newspaper relocated to Miami in 1960 and became an important voice for the exile community. There was another change of name in 1962, when the paper became a weekly Impressiones: Diario de la Marina.
Port-au-Prince; Nouvelliste (1899 August - 1979 December)
Le Nouvelliste is Haiti's independent voice and throughout its run
has directed its appeal to the most literate audiences. It is particularly notable as an oppostion paper during the
U.S. military occupation years, 1915 through 1934. The 1937 Haitian-Dominican crisis reports are complete. While the newspaper's research value encompasses Caribbean
geopolitics, its focus on internal Haitian matters makes it
particularly important for specialists concentrating on the country
or developing comparisons.
Commercial and cultural information is well developed, especially on the opinion of blackness, Africanism, Afro-Caribbeanism, and its espousal of greater appreciation and recognition of Haiti's African heritage. Important authors and scholars, including the enigmatic Stenio Vincent, the noted historian Stephan Alexis, and intellectuals such as J.B. Romain and Rene Victor contributed articles to the paper. Le Nouvelliste provides local historical context to the country's long and often tortured relationship with the United States. Today, Le Nouvelliste continues as a quality newspaper.
By Erich Kesse, Robert Harrell, Richard Phillips, and Cecilia Botero
This paper describes the University of Florida's Andrew W. Mellon Foundation funded Caribbean Newspaper Imaging Project: its goals, approaches and achievements. The Project, designed to convert newspaper microfilm holdings to electronic images, is described in context with previous preservation effort, together with discussion of the limitations of microfilm as an access technology. A review of progress toward goals establishes Project strategies while modeling the implementation of electronic imaging guidelines and the adaptation of traditional technical skills from both cataloging and analog imaging. Critique, particularly of pitfalls and failures, suggests areas for future consideration.
Florida, its influx of immigrants from and volume of trade with the Caribbean, is almost as much a member state of the Caribbean community as of the United States of America. In Florida's research libraries, emphasis on the collection and preservation of Caribbean resources has a long history rivaling that of Floridiana. The University of Florida, in particular, maintains a large and rich collection of Caribbean archives and publications. The collections are important to building an understanding of the region, bridging cultures, and fostering economic ties. More recently, the collections, by virtue of their preservation in microfilm and the loss of source-documents, have come to represent extensions of various national archives. Legislative reports published by the government of Guyane Française and microfilmed by the University of Florida, for example, continue to exist only in microfilm.
The University of Florida began collecting Latin American and, particularly, Caribbean research resources in the late 1920's. U.S. interest in the region at the time, already attenuated by administration of Cuba at the end of the last century, had been heightened by its occupation of Haiti beginning five years earlier in 1915. Following World War II and the convergence of the Farmington Plan1 with the application of microfilm technology, a dedicated faculty and staff systematically built a vast collection - today, more than 1.5 million items - of Caribbean government documents, journals, manuscripts and archives, maps, monographs, and newspapers. In its Latin American Collection alone, the University of Florida holds more than 300,000 volumes of printed materials; a growing number of electronic resources; nearly 50,000 reels of positive microfilm; and, in preservation storage, more than 8,500 reels of negative microfilm masters. The latter represent more than 5 million exposures or 9.5 million pages. The fact that 7,000 reels of microfilm masters are newspaper holdings indicates the collection development and preservation effort emphasis.
Newspaper microfilming began in earnest in 1953. The Rockefeller Foundation funded a technician, traveling throughout the Caribbean, with a portable microfilm camera to film materials that could not be acquired otherwise. Many of these materials, today, continue to exist only in microfilm. The University of Florida's microfilm masters are the archive of several newspapers among them Cuba's Diario de la Marina and Haiti's Le Nouvelliste.
By the 1960s, supported by state funds, fed by standing orders, and empowered by copyright legislation known as the Inter-American Agreement (1939), a program of microfilming Caribbean and Florida newspapers had been established. Today, the program, which operates under national guidelines and standards for production, duplication and archiving of microfilm for preservation, continues albeit more restricted by changes in international and U.S. copyright legislation. Long standing agreements between the University of Florida and University Microfilms International ensure the availability and continued preservation of these materials as originally envisioned by the Rockefeller Foundation and the Farmington Plan.
Microfilm technology advanced the collection and distribution of resources. Today, it remains a reliable and cost effective means of long term preservation. Microfilm continues to be the medium of choice for stability, life expectancy and image quality and especially for large-format, small-font or fine-line source-documents such as maps and newspapers. Microfilm's several limitations, however, afford it the distinction of least respected information delivery format2. Microfilm must be used in situ and, usually, without the benefits of indexing or relatively immediate image retrieval afforded by newer automated information delivery formats.
Perhaps most limiting, microfilm is difficult to maintain and expensive to replace. Microfilm deterioration begins whenever optimal environmental conditions or microfilm readers are not adequately maintained. Attaining optimal conditions, particularly difficult in Florida and the Caribbean basin countries that rely upon the microfilm, incurs its own high cost;3 the heating, ventilating and air conditioning (HVAC) control systems required are neither inexpensive nor easily maintained. Increasingly, as well, the cost of maintaining readers to service the microfilm is becoming difficult to bear. Once ubiquitous microfilm readers and reader-printers are losing market share to multipurpose and more ubiquitous computers. Replacement parts and service personnel for microfilm readers/reader-printers are increasingly few. Taken together, the costs of acquiring, maintaining, servicing and replacing microfilm is becoming prohibitive particularly throughout the Caribbean where poor climate and weak economies converge.
The challenge, which the University of Florida and the Andrew W. Mellon Foundation seek to manage through the Caribbean Newspaper Imaging Project, is the development of an electronic global resource sharing model, both feasible and economical, for information in newspapers. Born of ideas defined by Yale University's Project Open Book4 and the University of Michigan's now independent Journal Storage Project (JSTOR)5, the Caribbean Newspaper Imaging Project is at once hybrid and new. Stated Project goals6 are these:
- Convert approximately 132,500 microfilm exposures, the record of two newspapers: Cuba's Diario de la Marina and Haiti's Le Nouvelliste,7 to digital images;
- Provide multi-lingual indexing in the newspaper's native language (i.e., Spanish and French) and English;
- Implement cost recovery marketing in order to support conversion of additional titles; and
- Establish efficient, low cost models for facilities and productivity; which would allow other institutions to share the burden of newspaper digitization.
Project completion would require examination of several additional issues.
Conversion issues were several: selection and configuration of a facility; file characteristics and directory structures; source-document definition and condition, and work-force issues among them.
The definition of an archive was primary. As in Project Open Book, the microfilm would remain the archive of source-document; both qualities of images and life expectancy under optimal storage conditions were known.8 Multiple master storage sites and a monitoring program based on national standard9 would ensure continued preservation. Moreover, the resolution of digitized images of newspapers would only approximate that of the microfilm.10
To safeguard investment in the digital product, DAT (i.e., digital tape) would archive the electronic files with an additional copy maintained in CD-ROM, the format elected for distribution. Electronic archives would be placed in storage conditions meeting existing standard and monitored in accord with existing industry standards. In many ways, the management of an electronic archive has been with us for more than a decade in the form of locally held automated catalog-record tapes, census information, and other electronic files.
Distribution of images via the Internet was considered but rejected during the planning process. Internet distribution for both project titles would have required in excess of 197 GB of active storage space not available at the Project's start. Moreover, conveyance of the images had its own problems. Though GIF-on-the-fly software would have made images browsable without additional labor, GIFs were large enough, in terms of bytes-per-image, to render remote access laboriously slow without large and dedicated bandwidth. The graphical size of images was yet another problem. Images would not fit, legibly, within a browser's viewing pane; awkward bi-directional scrolling was required. Further, GIF's "lossy" conveyance reduced image quality. This was evident, particularly, in image areas most dependent on fine resolution such as the classifieds. While we continue to investigate Internet distribution, it was and remains our conclusion that this form of distribution will not be viable until the problems listed above can be resolved. Distribution of TIFF images bundled with a TIFF viewer and an index interface on a CD-ROM, conforming to ISO 9660, 12 was elected.
The decision to build an imaging facility within the University Libraries was made during the planning stage. At the time, the number of commercial facilities offering microfilm conversion services was few and the fees charged by existing services was not considered to be economical. The University's Preservation Department had the requisite managerial and production experience, with its in-house microfilming facility,13 and had been building the networking experience necessary to establish an in-house digitizing facility. Additional knowledge of electronic imaging and digital formats was gained through the Cornell University Digital Imaging Workshop,14 together with an exhaustive program of reading and experimentation. Characteristics of the space needed were similar to that housing the Department's microphotography facility. A vibrationless, dust-free environment, darkened independently of adjacent offices was carved from existing space.
Microfilm scanning equipment selected by the University of Florida would have to support intensive long-term use and produce images meeting a high image quality threshold as suggested by Project Open Book and the Cornell Workshop. Equipment also would have to be affordable in terms of producing images at the lowest possible cost. Several microfilm scanners capable of meeting the quality requirements were available but would have increased the final per-image cost several fold. The Mekel scanner, with software components, used by Project Open Book, cost more than $100,000. The Minolta MS1000 scanner, including software, with which the Caribbean Newspaper Imaging Project was begun, cost less than $25,000. A second scanner, the Minolta MS3000, was added to meet production targets less than one year after purchase of the MS1000 at less than $21,000.
The Minolta products provided acceptable dots-per-inch (dpi) resolution and gray scale. They lacked the Mekel scanner's several automated features, but these were deemed unnecessary owing to characteristics of the selected newspaper microfilms. The Minolta equipment was capable of scanning to a depth of 400 dots per inch (dpi), regardless of filming mode, but depended on resolution of the image projected on screen at the time of imaging. The Mekel equipment, in comparison, was capable of scanning materials filmed in two-up comic mode at 300 dpi and those filmed in two-up cine mode at 600 dpi.15 It had no dependence on projected screen resolution; images were made directly from the film. Characteristics of the microfilm (i.e., two-up comic mode) muted questions of selection. The Minolta equipment was sufficient if not, in some ways, more versatile for scanning newspapers on microfilm in two-up comic mode.
When the project began, a 486 CPU, 66 MHz workstation was the best available computer to drive the scanners. Each workstation ran with 8 MB RAM and temporarily saved scanned images to 2 GB hard-drives. While this configuration was adequate for Project start-up, it was quickly determined that a more powerful configuration was needed to increase productivity over scan-time. Each of the scanner workstations has been up-graded to Intel Pentium CPU, 166 MHz, running with 32 MB RAM. Workstations were also outfitted with 20-inch monochrome monitors to facilitate image quality assessment. In addition, uninterrupted power supplies (UPS) became standard for all scanners and back-up workstations, as well as for the server, guarding against electrical malfunction, lightning strike, etc.
Working under a distributed computing model, other equipment was selected for remote indexing; 4mm DAT backup; and CD-ROM distribution-product creation. Microfilm scanners and other equipment were added to the Preservation Department's existing local area network (LAN), an Intel Pentium CPU, 166 MHz server with 128 MB RAM, running NOVELL 3.11. A subsequent hardware up-grade and migration to a Windows NT platform increased speed and file management capabilities. At the project's start, the LAN consisted of 10 Windows 3.11 and Windows 95 workstations, connected by thin-wire Ethernet, since up-graded to a dedicated hub using twisted pair, fast Ethernet. The server has an 8 GB storage capacity with 4 GB dedicated to image file transfer, assessment, etc. This capacity is sufficient for file processing only and requires nearly constant file archiving. Throughput needs demanded similar attention be paid to bandwidth. Bandwidth limitations necessitated transfer of images from server to the remote mastering workstation equipped for both CD-ROM mastering and DAT backup.
Because of the magnitude of the files and the complexities of maintaining multiple user access for inputting index records, in house digitizing requires a significant commitment of Systems staff. Networking and workstation requirements should be given serious consideration even for those programs that opt to out-source scanning. The facilities and physical support structure required to perform image quality review alone is not insignificant.
File characteristics include scan depth; tonal qualities; file format; and compression. For optimal image quality in library applications, these characteristics are defined by the emerging standard established by the Cornell University Libraries.16 Scan depth (i.e., dpi) and tonal qualities determine resolution.17 File format and compression determine file size and "lossiness."18
It was determined that source-documents would be imaged at 400 dpi with 64 levels of gray, the maximum level allowed by the Minolta scanners.19 The microfilm used for newspaper filming is a high contrast medium which is essentially bitonal. Use of gray scale in imaging would maintain any tonal qualities captured by the film in illustration and fine or small print.20 Scanned files would be saved in the tagged image file format (TIFF), using ITU T.6 (formerly, CCITT Group 4) compression. TIFF images with ITU T.6 compression are "lossless." File sizes, ranging between 0.8 and 1.4 MB compressed, and the number of files to be saved, more than two hundred and sixty five thousand, obviated saving files uncompressed. With compression, there was a nearly one-to-one conversion. Production generated, on average, approximately one CD-ROM for every reel of microfilm converted.
Article data tables used for indexing and abstracting were built as a FoxPro relational database application. Delphi programming was used to build both a multi-user interface for access to index and abstract entries and a viewer for access to images. Data elements allowed record of newspaper and article titles; enumeration, pagination and column numbers; author; subjects/index terms; and publication chronology and event dates, as well as, searchable keyword abstracts in English and the newspaper's native language, French or Spanish.
Newspapers are readily adaptable to a directory structure that is intuitive to any user insofar as their chronology suggests structure. Directories are arranged with title at the top level, followed in cascading order by year of publication, month of publication, date of publication, and section-and-page number. The front page of the Diario de la Marina's June 1, 1956 issue, for example, equates to the file located at [drive letter]:/Nouvelliste/1956/06/01/A01.tif. This scheme works well, in turn, when querying or parsing requests from index-interface (i.e., relational database) and image-viewer programs.
This scheme, however, does not easily accommodate page-name anomalies in the source-document. Failure to anticipate anomalies aside,21 anomalies that occur as a result of printing or publication can be "corrected" only through indexing. Under the distributed computing model employed by the Project, correction through indexing requires coordination among indexing and imaging staff in referencing and naming anomalous files. Misprints resulting in incorrect publication of chronology and pagination require corrective action that is similar to but more proactive than attention shown to correct such problems during microfilming for preservation. Without indexing, directory structure and file naming conventions that do not impose a consecutive image numbering scheme are unforgiving of anomalies in chronology and pagination. At the same time, consecutive image numbering schemes without indexing prohibit intuitive image access; images must be "paged" or viewed image by image. Our experience suggests that a directory structure and file-naming scheme be standardized for serialized information and, particularly, for information in newspapers.
This scheme also is not favorable to the preservation practice of microfilming a single page at multiple densities, one optimized for the capture of text and the other optimized for the capture of illustration. In this Project, exposures optimized for text capture were deemed most important and, therefore, scanned and saved with the standard directory/file-name designation. Illustrations were rarely indexed, though several notable and important illustrations were recorded. If the microfilm does indeed capture graphic information better than the digital version, conversion of the exposure optimized for illustration might not serve its intended purpose. When the exposure optimized for illustration was scanned and saved, it was saved with additional designation, e.g., A01a.tif. Because files saved with the additional designation could not be parsed by the index-interface or viewer programs, their value was almost solely for purposes of quantifying differences between the microfilm and digital versions. No thought was given, beyond an initial test, to pasting the scan of the optimized illustration into the scan of the optimized text; the size of the relative parts was greater than the resources of the individual workstations (i.e., their CPU, RAM and virtual RAM).
It would also be advantageous to standardize, beyond the experience of this Project, the data-elements used during indexing and abstracting of newspapers. While the practice of this Project was to record information in a relational database that treated the image file as an object in a table, this information could be recorded as Standardized General Mark-up Language (SGML), metadata, or other file header information. The database method's advantage is that imaging and indexing can proceed separate from or in advance of imaging, assuming agreed upon methods of relating the image object to the index. It afforded time to review entries by area and language specialists working at their own pace. Image objects could be committed to the electronic archive immediately. Other methods build an index through tagging an existing image. This would have required either indexing as imaging occurred or, more aptly suited to the Project's distributed computing model, maintaining images in active disk space or an intermediary file until tagged. Immediate indexing would have necessitated either in-put ready indexing or a staff with an unlikely combination of imaging, indexing and language skills. Preparation of in-put ready indexing would have required additional start-up time, which was not available. Maintaining images in active disk space would have required additional server and bandwidth resources, which would have slowed progress and decreased cost-efficiency. Use of intermediary files would have necessitated an additional layer of tracking and management.
A software interface, programmed specifically for the project, allows users to search the index and abstracts and to browse images. The directory's structure is used to link index and abstract entries with the image objects. While this software, particularly the freeware TIFF image browser, was necessary to complete the project, it likely, soon, will become a once convenient but no longer necessary tool of the past. Standardization of newspaper article indexing and abstracting data-elements and the subsequent mapping of these elements as a SGML or XML Document Type Definition (DTD), perhaps with crosswalks to other DTDs, will make the software obsolete.
The source-document issues resulting from filming were several. Issues related to the source-documents and the source-document microfilms had to be considered. Printer's effects; shipping, binding and storage effects; embrittlement effects; paper characteristics; and illustration and font sizes were issues of concern regarding the source-document. Source-document lighting; processing and storage effects; orientation; reduction; exposure and density; and resolution were issues of concern regarding the microfilm. Planning for the work required assessment of the source very much as would have been necessary to microfilm a source or generate paper facsimile from a microfilm. Traditional use of random survey and interpolated data was made during planning. In retrospect, more detailed analysis was required. The great variety of source-document and microfilming characteristics proved assumptions based on survey to be inaccurate. The sample's ±10% level of confidence was inadequate.
The adage, "garbage in, garbage out," is a harsh solipsism to say that electronic technologies cannot reverse defects borne onto microfilm. Scanned from source-documents, image defects such as staining or those resulting from creases and folds are discernable from the text they obscure by gradient differentiation techniques.
Once committed to a high contrast medium such as microfilm, however, differentiation between defect and text becomes unlikely. Text readable through stains on the original are often no longer readable on microfilm.
Effects such as bleed-through, transference, and uneven or over-inking had to be noted in order to assess the quality of individual scans. Minor but time consuming corrections to improve hardware and software performance had to be made throughout the Project. The nature and number of corrections demonstrated uniform conversion settings to be arbitrary and would have rendered automated features of the Mekel equipment useless. More detailed assessment might not have reduced this burden but would have assured Project managers both of initially adequate staffing funds and a workforce, trained, from the start, to deal with the broadest range of image defects.
The titles selected for conversion were microfilmed between 1957 and 1987. Some were microfilmed by the University of Florida in their country of origin on portable equipment and others, at the University of Florida on stationary equipment. While the most consistently reported physical defect encountered was scratching, deterioration of the microfilms' acetate base was evidenced by tears, curling and separation of the emulsion from the base throughout the microfilm collection. Every imaginable effect of filming practices also was encountered. The thirty years between 1957 and 1987 was a period of increasing standardization; both the growth toward standard practice and every change in standards can be seen on the microfilms, together with the defects of filming. Even defects such as slight light imbalance on the surface of the source-document during filming become troublesome during scanning of newspapers reduced twenty-one times onto microfilm.
Not all problems noted could be corrected. Image enhancement techniques, e.g., dithering, despeckling, etc, could not be used effectively owing to the nature of high contrast microfilm or the fine resolution of broadsheet newspapers on 35 mm microfilm. Removal of scratches and errant marks, for example, could not be automated without the loss or degradation of text. Manual removal was not cost effective. Moreover, when manual correction was completed, the task often required native language skills. In review, the exercise proved pointless; native language readers were able to adequately discern words from obscured text. Though enhancement and human intervention will likely remain a necessity if intelligent character recognition (ICR) or optical character recognition (OCR)23 are to be employed on scanned newspapers, improvements in software first must make the task more efficient and cost effective. Corrections, which were cost effective, were largely mechanical and similar to those undertaken during microfilming. Alignment problems, for example, required rotation or deskewing. Source-document microfilm density problems, including over and under exposure as well as inking effects, could be minimized by manipulating lighting conditions during scanning.
Some problems were the result of the mechanism. Residue of spent filaments inside the vacuum of the scanner's bulb produced image effects, for example, which required the workforce to build expertise, differentiating the effects of bulb condition from unbalanced inking, wearing or exposure.
Quality control of scanned images was performed through a process of benchmarking, which made visual comparisons between the digital quality of an optimized image and successive images, a method similar to that developed by Yale.24 Differences in method were necessitated by differences in microfilms and source-documents. Project Open Book assumed that scanned microfilm met or closely approximated current standard and contained images of average book size filmed at reduction normal for books. The Caribbean Newspaper Imaging Project microfilm was produced prior to current standard and contained images reduced more than twice that required for book microfilming.
The unit against which image quality comparisons were made was the smallest "e," usually in the classifieds, of the microfilm. Benchmarking required "optimizing"25 the page containing the "e" and comparing the clarity of text on scans subsequent to it. Benchmarking, as "quality e measurement" in microfilming for preservation, was done approximately every tenth image. Images were optimized approximately every 300 scans or as the work-force changed. Benchmarking was partly art, requiring subjective judgment, particularly when image density varied across a single page. Different scan settings only improved the legibility of different parts of a given page.26 As much of the microfilm that libraries depend upon has not been produced to the level of current standard, problems associated with conversion of substandard microfilm require further consideration.
Display size of the scanned source-documents represented additional problems. Display at a one-to-one ratio was too large to fit and easily navigate on screen. Display at reduction to fit or navigate easily on screen rendered text illegible. The solution was programming of a TIFF viewer containing a "magnifying glass".27 Images are opened to fit on screen in a "window" containing a magnifying glass that can be moved by dragging the device over the image. The image area beneath the magnifying glass is displayed legibly in a separate window. This solution also resolved the problem of fees and legal agreements associated with embedding image viewer software on the CD-ROM with the images and index; the use of a viewer programmed by the Project would incur no additional costs.
Other than its indexing component and the use of older microfilms, the Caribbean Newspaper Imaging Project most differed from Yale's Project Open Book in staffing.28 Trained and managed by permanent staff, student assistants were hired to perform the bulk of tasks. Students were available in a large pool, inexpensive, easily trained and often highly computer literate or fluent in French or Spanish. While use of a student workforce had its disadvantages, e.g., high turnover, high levels of supervision, retraining, scheduling, consistency of product, its pay-off was in low cost. Student staffing reduced costs to near two-thirds that which permanent staff would have incurred. Intensive training and review of performance and products assured quality and consistency of product while lowering per image costs from those calculated for the employment of full-time staff during project planning.
Indexing routines were supervised and work reviewed by three Latin American and language specialists who also defined indexing criteria and the select, controlled vocabulary derived from Library of Congress Subject Headings. Approximately 2 FTE part-time staff was employed to index and abstract. Part-time staff was paid $6.50 per hour and did not accrue benefits. Native French speakers, mostly from Haiti but also from the French Caribbean and French north and west Africa, indexed and abstracted articles from Le Nouvelliste. A small pool of available French speakers slowed completion of the task. Native Spanish speakers, largely of Cuban descent, indexed and abstracted the Diario de la Marina. In both cases, indexing and abstracting were done in the native language and later translated into English, completing bilingual indexing requirements. More than 20,000 articles were indexed and a minimum of one article per issue was abstracted. Article selection was at the discretion of the indexer/abstracter within criteria established by the Latin American specialists. Quality control and editing were subsequently completed by the specialists.
Imaging routines were established and images reviewed by a reprographics specialist who also managed DAT archiving and CD-ROM production. Approximately 2 FTE part-time staff, a sufficiently stable workforce, was employed to image the microfilm. Part-time staff was paid $5.00 per hour (i.e., slightly above the minimum wage at that time) and, for the most part, did not accrue benefits. Two-hour shifts were maintained in order to optimize attention and minimize the risks of eye-strain and repetitive stress syndrome. The average employee scanned at a rate of 1.25 images per minute (IPM). Those staff whose productivity was low - 0.5 IPM was the lowest recorded - or whose accuracy or image quality were consistently low were dismissed. The reprographics specialist, who worked regular shifts to maintain skills and demonstrate efficiencies, was frequently able to produce images of acceptable quality at rates in excess of 2.75 IPM. Most efficiencies, other than those gained through networking up-grades, were achieved through mechanical means, e.g., film advance techniques. Other measures such as the two-hour shift, however, resulted in equal gain. With both microfilm scanners operating, an average of 1.5 GB of scanned images was produced each day of operation. Scanners operated between 65 and 120 hours per week.
Systems support staff included FoxPro and Delphi programmers, as well as, a network trouble-shooter. Attempts to hire a computer programmer to develop both the multi-user indexing system and public-user interface were fruitless. State of Florida staffing plans had been unable to compete with corporate market forces, leaving Systems Department programmers to assume responsibility at the cost of delay in other project schedules. Network software was configured by Systems staff but administered by Preservation staff; the network actually pre-dated the Project and was expanded to accommodate it. Insofar as programmers' work may be borrowed or adapted, other projects working from the experience of this Project should not require as much or the same type of programming assistance. Networking requirements, hardware, and bandwidth use grew rapidly throughout the Project and were associated predominantly with up-grades to increase performance. Networking speed was the single most important factor in increasing productivity and decreasing costs.
Caribbean Newspaper Imaging Project digitization of Le Nouvelliste and Diario de la Marina comprises more than 20,000 index entries, 40,000 abstracts and 265,000 images. In total, indices, abstracts and images occupy more than 200 GB. Images, alone, fill 98 archived 2 GB DAT or 329 distribution-ready 650 MB CD-ROMs. CD-ROMs contain images, a viewer, and indices and abstracts for the images on each CD-ROM. Images are available by title, date or subject, supplied on CD-ROM, with other distribution formats negotiable.
Project costs were calculated to include labor, media and equipment costs. Labor costs included wages, salaries and benefits paid to part-time and full-time staff for indexing and abstracting, imaging and related functions, and software development and network support. The table, below, is a summary accounting of expenditures per image.
Media (DAT and CD-ROM)
Hardware & Software
Scanning & Archive Mastering
Indexing & Abstracting
Programming & Systems Support
Total per Image Cost
* Hardware and software costs including purchases and up-grades were based on equipment life of five years and prorated for the life of the Project. Of the total hardware and software costs, $ 0.10 per image supported scanning and archive mastering; $ 0.01 per image supported indexing and abstracting.
** Calculated per article indexed and abstracted, the actual cost of Indexing & Abstracting was $0.56.
In relative terms, imaging costs are comparable to those reported by Yale.29 Comparison with Project Open Book is not exact; differences in the type of source documents, the quality of source microfilms, and the selection of equipment to achieve their ends prohibits true comparisons. Caribbean Newspaper Imaging Project cost reports excluded network storage, transaction, maintenance fees and wire costs which might have been included had a network not been previously owned and operated. These costs also appear to have been excluded from summary data produced by Yale.
The Caribbean Newspaper imaging project is a cost recovery project by design both as incentive to efficiency and as a means of expanding the project to subsequent titles. Assessment of efficiencies is still on-going. Problems experienced as the model was implemented, however, suggest its imperfection. Indexing and abstracting, in particular, proved more costly than anticipated. At fifty-six cents per article indexed and abstracted the model demands an alternate approach. Bilingual abstracting, in particular, appears economically unfeasible.
The Caribbean Newspaper Imaging Project establishes yet another model for digitization, one of the first to deal with newspapers on microfilm. Among Project goals, only cost recovery through sales has yet to be achieved. In some ways, the creation of a large image viewer for example, the Project exceeds its goals. The Project, while not directly comparable to other implementation demonstrations such as Yale University's Project Open Book, provides summarized cost data on par with the most cost efficient of those projects.
The Caribbean Newspaper Imaging Project builds new experience for digitization of texts from microfilm predating current "standard" practice. It suggests means of classifying and naming, indexing and abstracting newspapers and places a price on these practices, albeit high. Building on this Project, related secondary projects, such as the on-going Eric Williams/Trinidad Guardian Project, explore the possibilities and costs associated with optical character recognition (OCR), adding full-text for select, highly significant articles.
The technical experience of this Project and other projects warily suggests that microfilming guidelines be reviewed and revised for the benefit of future digitization. At the time current standards were written, microfilming was a child we wanted to raise correctly. Today, microfilming has entered an adulthood, about to become a parent whose bad habits may be passed on to the next generation of technology's products. In recent months, at its summer 1997 meeting, the Association of Research Libraries has authorized a task force to investigate this suggestion. It is hoped that the reports of this task force will effect changes in the practice of microfilming which will optimize and further reduce costs associated with digitizing microfilmed source-documents including newspapers.
- The Farmington Plan was a cooperative collection development plan begun in 1948 and joined voluntarily by American libraries as a means of increasing the number of resources, largely of foreign origin, available to researchers in the United States. The University of Florida assumed "country responsibilities" for materials published in the Caribbean basin. With its presence in the Seminar on Acquisitions of Latin American Library Materials (SALALM) and the Latin American Microfilming Project (LAMP), the University continues to meet these responsibilities.
- Cf, Anderson, Arthur James. "Faculty to library directory: we hate microfilm." Library Journal, v.113 (Oct. 15, 1988), p.50-52.
- While the University of Florida stores microfilm masters under exacting conditions prescribe by national standards (cf, http://karamelik.uflib.ufl.edu/repro/micrographics/ manuals/storage1.html ), its storage of microfilm for research use is optimized for human comfort and inadequate for microfilm longevity.
- The Commission on Preservation and Access has published information about Project Open Book. Cf,
- Waters, Donald and Shari Weaver. The organizational phase of Project Open Book: a report to the Commission on Preservation and Access on the status of an effort to convert microfilm to digital imagery. (Washington, D.C.: Commission on Preservation and Access, 1992). Reprinted in: Microform Review. v.22,n.4 (Fall 1993), p. 152-159.
- Conway, Paul and Shari Weaver. The setup phase of Project Open Book: a report to the Commission on Preservation and Access on the status of an effort to convert microfilm to digital imagery. (Washington, D.C.: Commission on Preservation and Access, 1994). Reprinted in: Microform Review. v.23,n.3 (Summer 1994), p.110-119.
- Cf, the JSTOR web site at http://www.jstor.com/
- Additional information about the Caribbean Newspaper Imaging Project and its goals may be found at the Project's web site, http://karamelik.uflib.ufl.edu/projects/mellon/
- These titles were selected from the more than 100 in the University's archive of newspaper microfilm masters because of their relevance to current events and the importance of their countries of origin in the affairs of the United States and the history of the Caribbean basin. For more information particular to the selection of each title, see the Project's web site.
- Cf, Lauder, John. "Digitization of microfilm: a Scottish perspective." (Microform Review. v.24, n.4 (Fall 1995), p.178-181.)
- Association for Information and Image Management. Standard for information and image management : recommended practice for inspection of stored silver-gelatin microforms for evidence of deterioration. (ANSI/AIIM MS45-1990) Silver Spring, MD : the Association, 1990.
- White, William. "Image quality in analog and digital microtechniques." (Microform Review. v.20, n.1 (Winter 1991), p.30-32.
- The Minolta Corporation's free TIFF viewer plug-in for Internet Explorer and Netscape (cf, http://www.minoltausa.com/low/static/tiff_plugin/tiff_view.html) alleviates some of the problems associated with both image size and browser access to TIFF files, but does not reduce download time; TIFF files are larger than those of other file formats.
- International Standards Organization. Information processing -- Volume and file structure of CD-ROM for information interchange. [ISO 9660:1988] Geneva, Switzerland: the Organization, 1988.
- The Preservation Department produces more than 500,000 exposures annually. Its managerial staff, who have served on industry and library standards committees, oversee the production of microfilm in compliance with American National Standards Institute (ANSI) and Association for Information and Image Management (AIIM) standards and Research Libraries Group guidelines.
- The Workshop manual, authored by Anne R. Kenney and Stephen Chapman, has been published as Digital imaging for libraries and archives (Ithaca, NY: Cornell University Library, 1996).
- Conway, Paul and Shari Weaver. The setup phase of Project Open Book: a report to the Commission on Preservation and Access on the status of an effort to convert microfilm to digital imagery. (Washington, D.C.: Commission on Preservation and Access, 1994), p.15. Reprinted in: Microform Review. v.23,n.3 (Summer 1994), p.115.
- Kenney, Anne R and Stephen Chapman. Digital imaging for libraries and archives. (Ithaca, NY: Cornell University Library, 1996).
- Resolution as it relates to photographic and electronic imaging. Technical report, TR26-1993. (Silver Spring, MD: Association for Information and Image Management, 1993).
- For definition, see: Glossary of imaging technology. Technical report, TR2-1992. (Silver Spring, MD: Association for Information and Image Management, 1992).
- Initially, Minolta software allowed a maximum of 16 levels of gray. Though early images from the Nouvelliste were made at 16 rather than 64 levels of gray, the difference is minimal, most tonal quality having been lost as a result of microfilm.
- Many images made from the Diario were bi-tonal rather than gray-scale. High contrast microfilming, necessitated for the capture of its faint print, virtually reduced illustrations to black and white. Bi-tonal imaging resulted in savings of file space which out-weighed the slight advantage of gray-scale imaging in this case.
- Conjunction of section letters with page numbers (e.g., A01, A02) in the file name results as a failure to fully review and define the characteristics of publication. While reasonably intuitive, the conjunction requires additional programming in the index-interface and image-viewer programs to distinguish and correctly query and parse numeric and alpha-numeric file names.
- For a more detailed description of source-document issues, see: Conway, Paul and Shari Weaver. The setup phase of Project Open Book: a report to the Commission on Preservation and Access on the status of an effort to convert microfilm to digital imagery. (Washington, D.C.: Commission on Preservation and Access, 1994), p.6-9. Reprinted in: Microform Review. v.23,n.3 (Summer 1994), p.111-112.
- For definition, see: Glossary of imaging technology. Technical report, TR2-1992. (Silver Spring, MD: Association for Information and Image Management, 1992).
Application of ICR or OCR on imaged newspapers, especially those converted from microfilm is problematic also for other reasons, principally, the digital resolution requirements of software currently available. The University of Florida is currently modeling an OCR application for newspapers converted from microfilm; results may be seen in its Eric Williams/Trinidad Guardian Reporting Project web site: http://karamelik.uflib.ufl.edu/williams/guardian/
- Conway, Paul and Shari Weaver. The setup phase of Project Open Book: a report to the Commission on Preservation and Access on the status of an effort to convert microfilm to digital imagery. (Washington, D.C.: Commission on Preservation and Access, 1994), p.10-11.
- "Optimization" entailed clarifying the digital image though manipulation of scan-settings. Scans of the image containing the "e" were enlarged, sometimes to the point of pixelation; the scan with the best settings produced the least blocking. Periodically, images were printed out and compared as described by Yale, but this method produced results no better than had been produced by visual comparison of on-screen enlargements.
- Albeit, as a single setting per frame. Minolta equipment does not support "windowing," i.e., the ability to optimize for illustration with one setting and for text with another setting in one scan. Yale reports similar limitation with Mekel equipment; cf, Conway, Paul and Shari Weaver. The setup phase of Project Open Book: a report to the Commission on Preservation and Access on the status of an effort to convert microfilm to digital imagery. (Washington, D.C.: Commission on Preservation and Access, 1994), p.15. Reprinted in: Microform Review. v.23,n.3 (Summer 1994), p.9.
Use of image composition software, e.g., Adobe Photoshop or Paintshop Pro, to achieve this result both was cost prohibitive and yielded inadequate results. The high contrast medium of microfilm had irrevocably damaged tonal qualities of most illustrations.
- The TIFF viewer was made available only on page-image CDs [no longer available - CNIP contents will be migrated to the Internet in the future]. Its interface has been programmed for the Project, but also allows use with other large digital documents such as maps.
- Cf, Conway, Paul. "Yale University Library's Project Open Book." D-Lib magazine (February 1996) [published electronically at: http://www.dlib.org/dlib/february96/yale/02conway.html] for discussion of staffing.
- Ibid. Project Open Book did not incur indexing and abstracting costs as did the Caribbean Newspaper Imaging Project. Caribbean Newspaper Imaging Project cost reporting separates indexing and abstracting costs from imaging costs in order to establish some degree of comparability.
Users of electronic images come to digital media with a set of expectations greater than those they have of other media. They anticipate extensive indexing, directly and interactively linked to the indexed information. With this second phase of the Caribbean Newspaper Imaging Project (CNIP2), the University of Florida tested the viability and costs associated with use of optical character recognition (OCR) as an alternative to manually indexing electronic newspapers.
With funding support from the Andrew W. Mellon Foundation, the University of Florida has scanned its microfilmed newspaper holdings of the Diario de Ia Marina (Havana, Cuba), 1947-1960, and Le Nouvelliste (Port-au-Prince, Haiti), 1899-1979. In the process, these newspapers were indexed selectively by reviewers knowledgeable in the languages. Selective indexing was not ideal, given that it is highly labor-intensive and far from comprehensive. CNIP2 was undertaken to assess the value and cost effectiveness of OCR indexing of these same newspapers.
CNIP2 evaluated OCR effectiveness within the following target groups:
- OCR software technologies;
- Digital image resolution;
- Bit depth;
- Language of the source newspaper;
- Publication dates; and
- Filming methods and technologies.
While OCR of page images smaller than a newspaper's folio dimensions has been successfully demonstrated and cost-effectively applied, OCR application to newspaper images had not been addressed when CNIP2 began in 1999.
Background. Phase One: The Feasibility of Image Capture.
Today, there are only three effective means of reproducing newspapers: (1) image conversion from film, (2) capture using a very-high resolution digital camera, or (3) rekeying from either source newspapers or from film.
Newspapers continue to be too large for extant flat-bed scanners. Lenzar, the Florida company that manufactured large format linear-array fiat-bed scanners, went out of business in 1997. It was the only manufacturer of such products. Alternately, newspaper stock, with its short fibers, is often too fragile for rotary plotter-scanners. And, historic newspapers, universally embrittled, require a great deal of care in handling. It would be unthinkable to pass these newspapers through a rotary plotter-scanner if not also to place them on a flat-bed scanner if one were available.
Rekeying, another alternative, is a labor intensive chore. Though the costs of rekeying can be minimized by sending this work off-shore to nations with lower costs or standards of living, the costs of reproducing an entire run are enormous. While it might be every e-newspaper vendor's dream to make issues available retrospectively, the demand for retrospective issues would never be immediate enough to pay the bills. Not surprisingly, the backfiles of electronic newspapers maintained by vendors of e-newspapers is limited. None is retrospective to before the date on which they began making current newspapers available electronically.
Map digitization projects such as those at the University of Florida, employing very-high resolution cameras, have demonstrated the ability to capture great detail from oversized source documents. Digital camera-backs such as those manufactured by PhaseOne are capable of well exceeding minimum resolution guidelines promulgated by Cornell University. Yet, at resolution sufficient to meet these guidelines, the exposure time would average approximately 30 minutes per page.
Newspaper on microfilm is problematic for a number of reasons. The defacto "standard" for production of film intermediaries for oversized source documents calls for 105 mm rather than library "standard" 35 mm film on which newspapers are currently microfilmed. Formulas for digitization of images on film, in comparison against scanner manufacturer's literature and claims, show that no microfilm scanner currently available, whether it scans from contact or from projection, can adequately scan newspaper from 35mm microfilm.
Regardless, phase one of the Caribbean Newspaper Imaging Project (CNIP1) demonstrated that readable newspaper images could be captured from film and displayed on computer monitors. The delivery of oversized images and use of scroll compensated for a scanner's inability to meet the resolution guidelines promulgated by Cornell University and commonly employed by library digitization projects. Today, though navigation of newspaper images that scroll vertically and horizontally beyond the average monitor's limits is still problematic, ever increasingly popular high-compression vector image formats (e.g., SID) make viable delivery of these large images via the Internet.
Need. Phase Two: OCR as a Means of Index Construction.
Though CNIP1 demonstrated the ability to economically deliver readable newspaper images, it reported costly, labor intensive indexing effort. At four fifths of the total image delivery cost, indexing also under represented the content of newspaper issues. While CNIP1 indexed only 3 articles per issue -- three more than had been indexed previously, three articles far from met the expectations of researchers using the CNIP product. CNIP1 made obvious the need to explore more cost effective and more representative means of indexing.
If the cost of selective indexing by human readers was expensive, the cost of constructing a comprehensive index through rekeying was out of question. CNIP planners turned to optical character recognition (OCR) as a possible means of index construction. Phase two of the Caribbean Newspaper Imaging Project (CNIP2) would compare the utility of indices created through OCR with that of indices created by human readers. Additionally, CNIP2 would assess various off-the-shelf OCR products, their application with the several languages of the Caribbean Newspaper Collection at the University of Florida, and the extent to which "dirty" text could be cleaned cost effectively.
Targeted titles included Diario de Ia Marina, Le Nouvelliste and Trinidad Guardian. Published in one of the three predominant Caribbean languages and extensive in holdings, each targeted newspaper would afford analysis of OCR application with a variety of language and printing variables. Microfilmed over time to changing standards, comparison of OCR accuracy from images generated from these microfilms also would quantify probabilities of successful OCR.
The Diario and Nouvelliste had been digitized in CNIP1. For this project, select page images were rescanned for test of additional digital methodologies. Select page images of the Trinidad Guardian were digitized and indexed, for the first time, for purposes of this project.
The Trinidad Guardian was selected from among the University of Florida's English language newspaper microfilm holdings for its documentation of the colonial British West Indies and of the various independence and republican movements of the English speaking Caribbean nations. Trinidad and Tobago, persuaded by the rhetoric of Dr. Eric Williams, compelled the Caribbean toward a Caribbean identity and nationhood.
For each of three newspaper titles, target issues were selected as follows:
- For any given test group, 400 page images were selected in order to maintain statistical validity consistent with +5% accuracy.
- For any given sub-sample, 200 page images were selected in order to maintain statistical validity consistent with +10% accuracy.
- To establish data resolution as to afford comparison across titles, issues were selected from comparable dates, e.g., the first issue every fourth month.
Quantity Selection 1,200 Diario de la Marina images
Selected from the CNIP1 project
1,200 Le Nouvelliste images
Selected from the CNIP1 project
1,200 Trinidad Guardian images
New images converted from newspaper microfilm
600 Quarter-page scans
New images: 200 each of the three targeted newspaper titles
4,200 Total images OCR processed
The target represents two categories of images:
- 3,600 whole-page 400 dpi scans, and
- 600 quarter-page 400 dpi scans.
Targeted page images were selected to represent date and language groups evenly within the bounds specified below. Whole and quarter-page images were made of the same page. All images were scans of projected pages using the same Minolta MS1000 and MS3000 microfilm scanners used in CNIP1. A 400 dpi whole-page newspaper image generated using Minolta projection scanning equipment is the equivalent of an image generated at 50% reduction, relative to the original size of the source newspapers. A 400 dpi quarter-page image generated using this equipment afforded an image which, if partial, approximated the resolution recommended by Cornell University.
% LANGUAGE 33% English
Trinidad Guardian (Port-of-Spain, Trinidad)
33% French (Française)
Le Nouvelliste (Port-au-Prince, Haiti)
33% Spanish (Español)
Diario de Ia Marina (Habana, Cuba)
Fonts by Language
FONT NAME ENG FRE ESP Times New Roman
& other serif typefaces
95% 95% 95% Arial, Helvetica
& other sans-serif typefaces
<5% <5% <5% Engravers, Rockwell
& other misc. typefaces
<1% <1% <1%
Fonts by Size (calculated for source newspaper)
FONT NAME Smallest "e"
Times New Roman
& other serif typefaces
1.0 mm 1.0 mm Arial, Helvetica
& other sans-serif typefaces
1.0 mm 3.0 mm Engravers, Rockwell
& other misc. typefaces
3.0 mm 3.0 mm
OCR Accuracy (Summary Findings)
CHARACTERIZATION OF TEXT 33% Article Text
Serif text at 1.0 mm
65% Article Titles
Sans-serif text at 3.0 mm
27% Surnames & Place Names in Article Text
Serif text at 1.0 mm
58% Surnames & Place Names in Article Titles
Sans-serif text at 3.0 mm
Images were processed using each of four major off-the-shelf software packages: TextBridge (v.9), OmniPage Pro (v.9), TypeReader (v.5), and Adobe Capture/Exchange. Because of its cost, Prime Recognition software used by the University of Michigan and JSTOR was not tested in this Phase. Adobe Capture is the software engine used by some electronic newspaper distributors (e.g., NewsExpress) of current newspapers issues.
OCR software is optimized for measures of digital resolution (dpi) associated with the linear CCD arrays found in commonly available scanner hardware. Cornell states that images with dpi not consistent with these measures may not be as accurate as those that are consistent with the capacity of these arrays. Evaluation of the resulting text files found no meaningful statistical variation from one OCR package to the other within either of the two categories: whole and partial-page images. Comparing results of the two categories however, accuracy was greater, regardless of the OCR package used, for whole-pages than for quarter-pages, a finding contra-indicative of the Cornell guidelines. The digital resolution of quarter-page scans using Minolta microfilm projection scanners should have approximated the dpi suggested by Cornell for the source newspaper.
Where bigger-is-better in setting digital resolution measured as dots-per-inch (dpi), microfilm scanners currently manufactured are not capable of meeting an adequate dpi per the Cornell formulas. Metering projected newspapers into segments for optimal capture was a creative solution but, in terms of workflow had this test produced the anticipated results, the cost of human intervention would likely have been prohibitive.
Research at Cornell University suggests that scanning at increased bit-depth may enhance the legibility fine detail from the source document. It should be noted, however, the legibility, here, is relative to the human eye's ability rather than to OCR's ability to read a given document. While Adobe Capture and TypeReader are optimized only for bitonal image conversion, OmniPage Pro and TextBridge process both gray-scale (8-bit) and color (24-bit) images. Regardless, the suggestion has little utility when scanning from high contrast microfilm rather than from the newspapers themselves. While microfilm is high contrast, microfilmed images do capture tone between black and white. The Minolta equipment available to this project, however, was capable of bitonal capture only.
The adaptive use of a Microtek 9600 XL transparency scanner failed predictably as interpolated dpi was unable to resolve newspaper print at 21:1 reduction with sufficient clarity. Using the scanner's interpolation software, 8,400 dpi resolution was theoretically required for a moderately good (Quality Index 5.5) scan using the Cornell formulas. An 8,400 dpi scan from film with 21:1 reduction should have been the equivalent of a 400 dpi scan from the newspaper itself. Interpolation was unable to compensate for the limitation of the native 600 dpi resolution of the CCD.
With the failure of the Microtek, CNIP2 used a sample of 15 grayscale (8-bit) newspaper images procured from a vendor of microfilm conversion services using Sunrise high-speed microfilm scanners. Images were 200 dpi, the equivalent of those produced by the Minolta microfilm projection scanners. They were produced, however, to current library "standard", with good image quality and lighting balance. The source newspaper, though North American, used type faces and font sizes comparable to those of the CNIP newspapers. Though the sample was small and statistically inadequate, the results were worth note. OCR resulting from the grayscale images was less than 10% accurate. OCR resulting from bitonal images of the same pages was 82% accurate.
While the depth of grayscale images made it easier on the human eye to read a given page than were their bitonal duplicates, increased bit-depth was a disadvantage to those OCR packages capable of reading it.
OCR is software. One method of programming that software may be more or less effective than other methods in approach to given image characteristics, including "noise", type face, and language. It is reasonable to suggest that individual software packages are more or less reliable than others. Further, all of the OCR packages studied by CNIP2 are off-the-shelf programs written largely for English-language business and personal applications, working with modern type-faces. While each is enabled with multi-lingual dictionaries, none of those dictionaries are equal. Evaluation of the resulting text files representing the whole-page sub-sample found no meaningful statistical variation from one OCR package to the other for any language tested: English, French, or Spanish. Relative to their dates of publication and a subjective assessments of image quality, no one language was converted any more accurately than the other. Microfilm image quality, particularly lighting issues (e.g., contrast and light balance), was more likely to effect the accuracy of OCR than any particular OCR package.
To assess their spell-check routines and to differentiate among otherwise equal OCR packages, a secondary human pass was made against a sub-sample of 300 text files generated by each OCR package. Human native-language readers, with the aid of Microsoft Word running the appropriate language dictionary, assessed the closeness of spelling mis-matches, counting the number of incorrect letters in a word. Each of the OCR packages tested has the ability to "learn" from corrected errors. The OCR package which most often and most closely approximated the correct spelling of words might have an edge in increasing the accuracy of the resulting text file. This was a tedious chore at best; but, it was complicated by the effects of poor microfilm image quality.
While each OCR package converted areas with good image quality more accurately, within these areas their performance varied. Disabling the spell-check routines, in order to assess character recognition alone, produced "anecdotal" evidence. OCR packages with larger dictionaries, it appeared, were able to correct more text. However, it also appeared that OCR packages with smaller dictionaries (e.g., Adobe Capture) had better noise reduction, line formation, other filters; they did not require larger dictionaries. Ultimately, the sub-set of images with good quality among the sub-sample was so small and so uneven as to language that the data was not meaningful.
Regardless the particular language, OCR accuracy at the word-unit level, not surprisingly, was more accurate the shorter the word-unit. Unfortunately, shorter words -- words including articles (a, the, le, la, les, los, etc.), prepositions (for, from, in, to, à, de, dans, etc.), and pronouns (he, she, il, elle, etc.) -- are usually regarded as stop-words. Such words have virtually no meaning in an index created from "dirty" text. Words least often corrected, particularly among smaller fonts, were surnames and place names not commonly found in dictionaries. In a sub-sample of 400 items, these names were correctly converted to text below the accuracy of text overall. Only 27% of such names in 1.0 mm serif fonts were accurately converted. Names, usually place names, found in the dictionaries were more frequently corrected than names not in the dictionaries. Unfortunately, these words are among the more commonly searched by researchers.
The condition and characteristics of the source newspaper set bounds on the quality of the film image. Microfilm is a non-additive technology; the film image is never better than the source newspaper. Printing technology, print defects, paper color and aging effects, type faces, and font sizes, among others, are all factors in image quality.
CNIP2 made assumptions about the characteristics of the target newspapers printed at different times. It established four date groups for purposes of analysis.
Date Group Date Span Titles in the Group Early Modern 1890-1920 Le Nouvelliste, Trinidad Guardian Modern 1920-1950 Le Nouvelliste, Trinidad Guardian Late Modern 1950-1970 Diario de la Marina, Le Nouvelliste, Trinidad Guardian
The Diario de la Marina is available only between 1947 and 1960.
Le Nouvelliste is available only before 1960.
Contemporary 1970-1997 Trinidad Guardian
The Early Modern period was characterized, in part, by moveable type and type faces worn as a result of repeated use. The Modern period was characterized by set type as was the Late Modern period. Distinction of these two periods is somewhat artificial. The latter saw the increased use of sans-serif and stylized type faces, albeit primarily in article titles. The Contemporary period saw the introduction of electronic type setting and other automation, albeit largely within the last decade. Somewhat arbitrary as well, the Contemporary period serves as a control-group, for which filming methods and techniques are known. Because copyright restrictions limited reproduction of newspapers in this group, the group was small and solely represented by the Trinidad Guardian with which the University of Florida had negotiated copyright permissions.
(See also, this discussion as regards Filming Methods & Techniques, below.)
CNIP2 found very little deviation in type faces or font sizes from one period to the next, and little more than what might be characterized as a standard deviation from one OCR package to the next. Article titles become easier to read and were more accurately converted by OCR to text with the introduction of sans-serif article titles. But, because of their size relative to article text, of the two, article titles are more frequently accurate regardless their age, type face, or OCR package used. Worn type, while more common in the Early Modern period, was in evidence only occasionally, and its detrimental effect on OCR was predicted. But, because article titles and text follow standard formats and sizes, OCR accuracy does not necessarily decrease with the age of the newspaper issue. Again, microfilm image quality is a more accurate predictor of anticipated OCR accuracy than were age and artifacts of printing processes.
Filming Methods & Technologies
Factors in microfilming and film characteristics are fundamental to optimal image capture and subsequent OCR accuracy. In newspaper microfilming, there have been four eras, each defined by a set of standards or the lack there of:
Titles in the Group
Pre-Modern pre-1977 Microfilming defined by "best-practices"
Titles: Diario de la Marina, Le Nouvelliste, Trinidad Guardian
Modern 1977-1986 Microfilming defined by Library of Congress/ANSI standard
Titles: Trinidad Guardian
Post-Modern 1987-present Microfilming defined by Research Libraries Group guidelines and revision of Library of Congress/ANSI standard
Titles: Trinidad Guardian
Microfilming defined by so called, "OCR-optimized" standard, i.e., RLG guidelines modified for allowable 1% skew, fixed reduction, and "one-up" filming
Titles: Trinidad Guardian
Microfilming in the Pre-Modern era was characterized by a set of best practices shared among microfilm technicians. Insofar as imaging practices were documented, they were found in recommendations from Eastman Kodak and the MRD/MRE microfilm camera instruction pamphlets. And, film processing, primarily during the early part of this era and outside the big cities, relied on locally mixed chemicals and the "shake-and-bake" method of fixing and washing exposed films still used today in home dark-rooms. Microfilms produced by the University of Florida during this period, from the early-1950s through the mid-1960s in particular, when a technician with an MRE microfilm was sent packing across the Caribbean on Rockefeller Foundation funding for the Farmington Plan, were subject to environmental conditions, imbalanced lighting, and extended delays between exposure and processing.
Microfilming in the Modern era was marked by concerted effort, centered at the Library of Congress, to standardize practice for newspaper microfilming. In Florida, the era was still without standard and characterized, also, by the use of acetate-base films that deteriorated for lack of cold, dry storage. Deteriorated films were replaced one from another, sometimes in the nick of time. Image quality suffered threefold: (1) inherently detrimental effects of acetate-base aging, (2) deterioration effects associated with climate, and (3) degradation effects of analog-to-analog copying.
Microfilming in the Post-Modern era was distinguished by a more complete set of standards, optimized for image quality and microfilm longevity. In Florida, it was marked by first use of more durable polyester-base films and the adoption of standards for filming, film processing, and film storage. And, the Contemporary era finds the University of Florida's on-going Caribbean newspaper microfilming in lock-step with the preservation standards set revised-for-digitization.
CNIP2 drew primarily from the Early Modern period of microfilming history. Copyright restrictions necessitated that the CNIP project be drawn from the public domain. An exception was made for the Trinidad Guardian. The University of Florida negotiated permissions with the newspaper's parent company, Trinidad Publishing Co. LTD., as part of the University's Dr. Eric Eustace Williams project. Trinidad Guardian microfilms were examined through 1981, the year of Dr. Williams' death in office. This small group of Post-Modern and early Contemporary issues served as a control group.
As stated earlier, microfilm image quality was determined to be the most accurate predictor of anticipated OCR accuracy. Regardless of standards, film image quality is conditioned by focus and depth of field, reduction level, exposure levels and light dispersion, and the density of imaged film. Microfilm is a high-contrast technology optimized for capture of text, but unsuitable exposure or uneven lighting, in particular among these conditions, can erode the legibility of text.
In general, a microfilm's background density (i.e., density in areas without text, in the unprinted areas between letters) appeared to have no effect. Variations of background density within standard were not recorded in the electronic image. As microfilm images were captured, white-and-black points balanced, and saved as bitonal images much of this area became uniformly white, while text became uniformly black. Quality Index assessment of the inner area of lower case letter "e"s was within the tolerance of analog-to-analog reproduction for microfilms with good image quality. In a microfilm image of good quality, contrast between text and paper should accurately reflect the condition of the original newspaper; the density of text and the density of areas without text each should be relatively uniform. OCR, predictably, was less accurate for microfilms with moderately good and poor image quality, but further assessment of these conditions is a discussion of lighting at the time of microfilming. CNIP2 found two conditions most frequently resulted in poor OCR accuracy: depth of field and light balance.
Nearly all microfilm cameras resolve a depth of field up to three and, in many cases, six inches. Text in the gutter margin can be microfilmed legibly, albeit frequently with shadow from the up-swelling of pages from the binding. When microfilmed pages with shadow are captured electronically as bitonal images, shadow is often recorded as noise, distorting the shapes of letters and reducing the accuracy of OCR. In these areas, accuracy of OCR fell to less than 5%. Microfilming practices, inasmuch as possible, should be changed to require disbinding and flattening to facilitate future digitization. With volumes that cannot be disbound, microfilming stations should be equipped with additional near-overhead lighting, transforming microfilming stations into those one might find in use by publication-quality photo-reproduction services. A drawback of this recommendation, however, is need to increase the camera operator's skill set at a time when finding and training microfilm camera operators and supervisors is increasingly difficult. While light meters integrated with the camera station should ensure that an appropriate amount of light reaches the source newspaper, balancing an additional two sets of lights would be more problematic than balancing the sets currently in use. An ideal production workflow requiring microfilm for preservation and an electronic version for access would afford successive or simultaneous analog and digital imaging such as that currently made possible by the Zeutschel 300/301 hybrid microfilm camera.
As has been suggested, the most common image quality issue detrimentally affecting OCR accuracy is light balance. Most microfilming stations are equipped with two sets of lights, one situated on each side of the camera head and source newspaper. Ideally, the lights are directed at areas opposite their position. If the beams of light can be envisioned as straight lines, the would all cross below the camera head, approximately equidistant between the lens and the source newspaper. Current RLG microfilming guidelines require an even illumination target the size of the source document be microfilmed at the start of each document and that this target be evaluated for light balance. Newspapers selected for CNIP2, however, predate this requirement and no studies have been published to independently assess either compliance with the requirement or light balance in the target area of microfilms created since this requirement was established. Again, drawing on a small, statistically inadequate sample of newspapers reportedly microfilmed to current library "standard", CNIP2 derived text that was 82% accurate.
In any case, while Caribbean Newspaper Collection microfilms are legible -- light imbalance is frequently noticeable but does not prevent reading, electronic text in raster images (e.g., TIFF files) and text files resulting from their OCR was degraded. Lighting imbalances on the source microfilm produced a spot-light effect of uneven, sometimes starkly contrasting areas on the electronic images. Images were subjectively classed by the size of spot-light into poor, moderate, and good balance. And, within images, areas were subjectively classed into regions of poor, moderate, and good digital background density. Regions of the raster images with poor digital background density were predominantly illegible. In these areas, OCR was wholly inaccurate. Regions of the raster images with moderate digital background density were comparable to that produced by the up-swelling of pages from the binding. In these regions near the outer corners of the page image, accuracy of OCR was less than 5%. Regions of the raster images with good digital background density were legible, though lights often appear to have been directed toward the center of the microfilm frame. In these regions, the accuracy of OCR was 38.5%. OCR of the subset of Trinidad Guardian microfilms, representing compliance with Library of Congress/ANSI standard and evidencing more control of light balance, produced much higher OCR accuracy: approximately 79% -- a value close to the more anecdotal 82% accuracy reported from the small test of newspapers microfilmed to current library "standard".
The accuracy of OCR on the retrospective newspaper collections targeted by the Caribbean Newspaper Imaging Project was disappointingly low. Overall, 33% of article text was accurately converted without any human intervention. The ills of past microfilming practice and the poor image quality of the target films is largely responsible for this poor rate. Anecdotal evidence drawn from contemporary microfilms created to current library "standard" appears to suggest that higher accuracy results from improved microfilming practice.
Human indexing as employed in CNIP1 indexed merely three articles per issue. Relative to the number of articles published on average in each issue, the percentage of indexed articles is also low.
Diario de la Marina 1 section: 12 pages 72 4.2% Diario de la Marina 3 sections: 36-48 pages 200 1.5% Le Nouvelliste 1 section: 4 pages 50 6%
CNIP2 postulated that keyword searching of the "dirty" text resulting from OCR could provide access to newspaper content greater and at less cost than that provided through human indexing. The comparison may be apples and oranges. Results of tests using a sub-sample of articles with both human and machine indices provided no meaningful comparisons. Searching against a word-base constructed from dirty text/OCR product requires different strategies from those used to search against an analytical index constructed from human interface. Nonetheless, 33% accurate text appears to afford broader, if not more meaningful, access the published newspaper content than did CNIP1's human indexing.
CNIP's networked data entry systems will eventually support both human and "machine" indexing. Currently, CNIP is attempting to build automated systems to remove nearly all human intervention from the process of generating dirty text from the extant image files. It is anticipated that this software will eventually remove unrecognized words lacking capitalized initial letters and stop words (articles, prepositions, and pronouns) in English, French and Spanish. Adding "dirty" text as a search resource should immediately provide the layer of access needed to support additional newspapers and quickly build the content needed to make CNIP economically viable. With the time it buys, we will be able to build the more analytical index entries produced by CNIP1. OCR becomes another tool for indexing but does not necessarily remove the human component at this time.
Currently, the CNIP product is migrating from CD-ROMs to the Internet as a base for delivery of images. The new search resource will be integrated during this migration. As it does so, we will be able to test further the viability of this new resource. CNIP2 still leaves many questions unanswered. There is still no good, cost effective means of providing the researcher with full text or connecting story lines broken by column and page breaks.