Creating Electronic Databases From Historical Periodicals

An article by W.H. Earle

[A version of this article appeared in Key Words, the newsletter of the American Society of Indexers, May/June & July/August, 1997 [vol. 5, nos. 3 and 4]

INTRODUCTION

Newspapers and magazines from bygone eras are among the printed materials that will be digitized and made available electronically in the coming decades. Historians, genealogists, and other researchers are already looking forward to being able to search the full texts of, say, Godey's Lady's Book or The National Intelligencer. If mere "full text" is what these electronic products offer, however, searchers are likely to be disappointed. In this arena, full text -- the summum bonum researchers hope for and database creators strive toward -- isn't good enough.

The problem is that the search strings likely to be used by a searcher will routinely fail to retrieve the material the periodical contains. Archaic spellings, styles, and usages will prevent most students and even most scholars from being able to devise appropriate search strings. As the simplest kind of example, imagine a student searching for references to electricity in periodicals of the late 18th and early 19th century. If the student uses "electricity" as a search term, he will retrieve nothing even if the full texts he is searching do indeed refer to electricity -- because the texts call it "galvanism" rather than "electricity." For the student's search to work, therefore, the editor of the database will have to have inserted the term "electricity" into any appropriate articles. In fact, as illustrated below, it will probably be necessary to insert indexing terms that are meaningful to modern searchers into virtually every single article from a historical periodical in order for a database created from it to be truly usable by a broad audience.

What follows is an outline of some of the problems that the creators of databases from historical periodicals must be prepared to deal with. The illustrations are not exhaustive, but they certainly cover the largest problems the database creator will encounter. They were derived from work on Niles' Register, the newsweekly that was published from 1811 until 1849. The product of that work was a cumulative index to the Register rather than a full-text database, but the problems involved in devising appropriate, meaningful index entries are precisely those that a database creator will encounter in devising artificial search strings to remedy the limitations of a full-text database. For simplicity's sake, the outline is organized into sections on Persons, Places, and Things.

PERSONS

Persons - Changes in Spelling. The spelling of words was much more variable in olden days than it is today, and this freedom extended to personal names. For example, in the earliest years of Niles' Register, James Monroe was generally referred to as James Munroe. Unless the database editor inserts "James Monroe" as an artificial indexing term, references to the individual will not be retrievable by the searcher who doesn't realize that multiple spellings are possible.

Foreign names are particularly prone to this difficulty, especially where a non-Roman alphabet is involved. Niles' Register usually referred to the great native war leader who resisted British incursions in India as "Tippoo" and called the great Egyptian pasha of the early 19th century "Mehemit Ali," but the forms of these names likely to be used by a modern searcher are "Tipu" and "Mohammed Ali" Unless these latter terms are artificially inserted in a full-text database, they will not be retrievable.

Persons - Reference by Title. It was not uncommon in the nineteenth century for newspaper articles to refer to an individual only by a title, without mentioning any name, even a last name. Thus an article about John C. Calhoun might refer to him only as "the Secretary of War." Unless his name is artificially inserted into such an article, the article will not be retrievable by a search for his name. A searcher aware of this problem could attempt to date-limit a search for the term "Secretary of War" to the period when Calhoun was in office, but this is clumsy and not likely to work well. (For example, it will fail to retrieve articles where the title "Secretary of War" appears in the archaic form "Secretary at War," as it does in volume 16 on page 112 (April 3, 1819).) Merely linking "secretary" and "war" together with a Boolean AND would retrieve a fairly complete set of responses, but it would be corrupted with a large number of irrelevant responses, too -- for example, references to actions taken by the "Secretary of the Navy ... during the late war."

Changes in titles can also present problems for searchers in a database constructed from historical periodicals. One version of this problem relates to titles of nobility. If the right years of publication are cumulated together, the Duke of Wellington could appear in a text as Arthur Wesley (until 1798), Arthur Wellesley (1798 to 1809), Viscount Wellington (or Lord Wellington, 1809 to 1812), Earl Wellington (1812), Marquess Wellington or Marquis Wellington (1812 to 1814), and Duke of Wellington (1814 and thereafter). (He does in fact appear in the text of Niles' Register as both "Lord Wellington" and "Duke of Wellington.") Unless a predictable search term -- "Duke of Wellington" -- is artificially inserted into all the articles concerning him, many references to the duke would be effectively unretrievable. Furthermore, if the database's chronological reach is long enough, the search term will need to include such modifiers as "first" and his dates of death and birth -- "first Duke of Wellington, 1769-1852" -- to distinguish references to the great duke from references to his successors who bore the same title.

Another version of this problem relates to military officers. If enough years of text are cumulated together in one database, a given Army officer might be referred to as Lt. Smith, Capt. Smith, Maj. Smith, Col. Smith, and Gen. Smith. All the references would be to the same individual, but a definitive full name would need to be inserted into each article to make that fact apparent to a searcher.

Yet another version of this problem relates to the married names of women. One famous American beauty in the first half of the 19th century was at various times referred to as Mary Ann Caton, Mary Ann Caton Patterson, Mary Ann Patterson, Mrs. Robert Patterson, and the Marchioness of Wellesley. To complicate matters further, she sometimes used the name Marianne. Unless some consistent search string denoting her is inserted in a full-text database artificially, references to her would be extremely difficult to retrieve.

Persons - Abbreviations. Abbreviations of personal names in the text of Niles' Register generally took two forms. One form involved short-hand references like "Mr. E." (volume 66, page 415, August 24, 1844). Contemporary readers of the Register would know that this referred to Edward Everett, American ambassador to Great Britain, but a modern searcher would not be able to retrieve the article unless the full name were artificially inserted.

The more common form of abbreviation, however, involved the use of initials in articles containing letters or reports with signatures on them. For example, on page 300 of volume 30 of Niles' Register (June 24, 1826), there is some correspondence involving Henry Clay in which his name appears only as "H. Clay." This was a standard signature for him. Unless "Henry Clay" is inserted into the text of this article as an artificial search term, the article could easily be overlooked in a search for references to him.

Persons - Periphrastic References. "Periphrastic references" are references to individuals by catch phrases that would have been clear to a contemporary reader, but which will thwart a modern-day searcher seeking information about a given individual. They are not uncommon today in opinion columns in newspapers -- a given columnist, for example, may write about "Slick Willie" without ever mentioning the name Bill Clinton -- but they were more common in earlier eras, when journalistic writing was more florid and self-consciously stylish. For example, an article in volume 17 of Niles' Register (page 10, September 4, 1819) mentions the need for a new edition of Notes on Virginia "from the hands of their illustrious author." The "illustrious author," of course, was Thomas Jefferson, but neither the name Jefferson nor the full name Thomas Jefferson appeared in the article. The article cannot be retrieved by a searcher interested in Thomas Jefferson unless that name is artificially inserted as search tag.

Persons - Ambiguous References. This difficulty is coming to be known as "the Mr. Smith problem." It refers to personal name references which would have been comprehensible to a contemporary reader, but which can only be interpreted with great difficulty nowadays. For example, Niles' Register might quote a speech by "Mr. Smith" in the Senate in a certain year. Contemporary readers would have known that the only Mr. Smith then serving in the Senate was Samuel Smith, the powerful Republican from Maryland. Thus no further identification would be necessary. A modern searcher retrieving that entry, however, would have to consult a dictionary of congressional biography to discover the identity of the "Mr. Smith" referred to. It is a very considerable chore.

Furthermore, the problem gets worse when one begins cumulating multiple volumes of a given historical periodical or of multiple periodicals. In the case of Niles' Register, a reference to "Mr. Smith" in the Senate in one volume might denote Samuel Smith while a reference to Mr. Smith in the House in the same volume might denote Caleb Blood Smith of Indiana. So long as there was only one Mr. Smith in the House and one in the Senate, no further identification would have been necessary, and the text would contain none. Ten volumes later, however, identical references to "Mr. Smith" in the Senate and House might have referred not to Samuel Smith and to Caleb Blood Smith, but to some other senator named Smith and some other representative named Smith. Again, if there was only one "Mr. Smith" in each house at the time, no further identification would have been necessary for a contemporary reader. However, when one begins cumulating multiple years of such references -- whether in an index or in full text -- they quickly become meaningless. Unless the editor has artificially distinguished the references by elaborating all the "Mr. Smith" references into full names (plus dates of birth and death when necessary), tens of thousands of entries will congregate under the ambiguous entry for "Mr. Smith." Such indiscriminate, hodgepodge attributions would of course be utterly useless -- and they will get worse and worse (and more and more useless) as the database is made "more powerful" by the incorporation into it of additional years of a given publication or of multiple publications.

PLACES

Places - Changes in Names. Listed below are some illustrative 19th-century geographic terms as they appeared in Niles' Register and their 20th-century equivalents for the same entities. Some are terms that most searchers would think of in trying to devise a search expression in a historical periodical database, but many are not. In general, therefore, modern equivalents will need to be artificially inserted into a full-text historical periodical database in order to make geographic entries retrievable.

19th Century Term20th Century Term
AlgiersAlgeria
ArabiaSaudi Arabia
AvaBurma (or Myanamar)
Banda OrientalUruguay
CandiaCrete
Cape ColonySouth Africa
Cochin ChinaVietnam
ConstantinopleIstanbul
IspahanEsfahan
JedoTokyo
KandySri Lanka (or Ceylon)
OtaheiteTahiti
PersiaIran
Sandwich IslandsHawaii
Santa FeBogota
SiamThailand
Upper CanadaOntario
Upper PeruBolivia
YeddoTokyo
YorkToronto

Places - Changes in Spelling. Even when the name of a given place has not changed, its spelling might have -- and in ways that will thwart a modern-day searcher who is unaware of the variations. Listed below are some 19th-century geographic names as they appeared in Niles' Register with the altered spellings of their 20th-century equivalents.

19th Century Term20th Century Term
AapopkaApopka
ArkansawArkansas
BeyroutBeirut
ChiliChile
CooloosahatcheeCaloosahatchee
FaxyardoFajardo
MilwaukieMilwaukee
NangasakiNagasaki
NepaulNepal
OuisconsinWisconsin
Oural MountainsUral Mountains
OuthlachuchyWithlacoochee
PekinPeking (or Beijing)
Porto RicoPuerto Rico
WiskonsanWisconsin

Places - Abbreviations. Unless geographical abbreviations are artificially expanded by a database editor, many references containing them will be unretrievable. A searcher cannot reasonably be expected to know that he needs to search "I.T." in order to capture references to "Indian Territory." Furthermore, he can hardly be expected to know that "Ia." was a common abbreviation for "Indiana" or that "Ms." sometimes stood for "Massachusetts."

THINGS

Things - Changes in Spelling. As with persons and places, changes in spelling could throw off a searcher in a mere full-text historical periodical database that has not been appropriately augmented with modern spellings. In Niles' Register, the spelling of "molasses" was generally just that, but in an earlier era the word was spelled "melasses." As far as Niles' Register alone is concerned, a searcher interested in "cigars" would need to search that term and "segars" as well. Someone looking for references to the cloth that we call "crepe" would need to search both for that term and for "crape." A searcher interested in the "Comanche" Indians would need to search that term but also "Camanche."

Things - Abbreviations. There is a particular problem relating to displays of data in Niles' Register that certainly affects other publications as well. In listings of commercial transactions, something like the following might appear:

  • Flour ... $5.25 per barrel
  • Rye do. ... $4.50 per barrel

The entry "do." means "ditto," and it means that the quotation applies to rye flour -- but unless the term "rye flour" is artificially inserted into the text, it will not appear, and it will not be retrievable. Of course, the same problem arises when double quote marks are used to mean ditto.

Sometimes unfamiliarity with the term being abbreviated makes it difficult to interpret an abbreviation. It is not likely that modern searchers will search for such abbreviations as "crim. con." or "P.E. Church" or "M.E. Church," although they will need to do exactly that unless a database editor has expanded the terms to something a modern searcher will grasp. (For "crim. con." the expansion would not be "criminal conversation," an old common law term, but "adultery." "P.E." and "M.E." would translate into "Protestant Episcopal" and "Methodist Episcopal," archaic terms for the Episcopal Church and the Methodist Church.)

Things - Changes in Usage. We don't always use words in the same way that our forefathers did. For example, early in the period covered by Niles' Register, the word "convention" always meant a meeting concerning a constitution. Thus an article about a constitutional convention need not mention the word "constitution" in any form since readers would understand that that was what a "convention" was for. Here are some examples:

Listed below are some common 19th-century terms as they appeared in Niles' Register and their 20th-century equivalents. The list is hardly exhaustive, but it illustrates the difficulty a searcher might have in trying to retrieve relevant items from a mere full-text database.

19th Century Term20th Century Term
administration partyDemocratic Party
amalgamationmixed-race sexual relations or marriage
anniversaryannual meeting
aerostationexperiments with dirigibles
carstrain
caoutchoucrubber
crackerfirecracker
defalcationembezzlement
dutiestaxes
emigrantsimmigrants
freshetflood
friends of Andrew JacksonDemocratic Party
galvanismelectricity or electromagnetism
gum elasticrubber
ice islandiceberg
incendiaryarsonist
incendiary materialsabolitionist publications
inundationflood
national anniversaryFourth of July
sulphureous gasnatural gas
passengersimmigrants
railroad ironrails
receiptrecipe
spermaceti oilwhale oil
torpedonaval mine
Van Buren partyDemocratic party

Things - Periphrastic References. As with persons, articles referring to a thing only metaphorically were common in bygone days. Thus an article on the abolition of slavery in the British West Indies (volume 47, page 17, September 13, 1834) might discuss "the vast project, to result in good or evil, [that] has just commenced" without mentioning "slaves," "slavery," "abolition," or "emancipation" anywhere in the article. Such terms would need to be inserted in a full-text database for the article to be appropriately retrievable.

Similarly, here is an example that refers to Centre College in Danville, Kentucky, without mentioning Centre College:

  • She [a recently deceased individual] also left property worth 50,000 dollars to the college at Danville. [volume 29, page 144, October 29, 1825]

Another form of this problem commonly applies to such entities as railroads, canals, and wars in Niles' Register. A modern searcher will generally need a text with a proper name in it to retrieve articles about a given railroad, canal, or war, but articles in Niles' frequently omitted such entries. Thus, in volume 41, page 249 (December 3, 1831), there is an article about the opening of the Baltimore and Ohio Railroad from Baltimore to Frederick, Maryland, that identifies the railroad only as "our rail road." It wasn't necessary to insert a full proper name since the B&O was the only railroad in Maryland at the time. Similarly, the term "the war" was generally used to describe the events of the War of 1812 while these events were going on because it wasn't necessary to tell the reader which war one was talking about (and because the term "War of 1812" hadn't been invented yet). Such references will thwart many retrievals from a historical periodical database unless some appropriate search term -- the proper name of the railroad or canal, the name historians have bestowed on the war in question, etc. -- is inserted by the database editor.

CONCLUSION

The discussion above identifies at least some of the difficulties that an editor will need to confront in order to convert the text of a historical periodical into a searchable electronic database that yields satisfactory search results to the vast majority of searchers. Future editors of this type of work will perhaps be able to identify additional problem areas, and they will certainly be able to add to the examples used as illustrations in this essay, but attention to these problems will lead to no insurmountable difficulties in creating databases in which users can actually find what they are seeking.