In the midst of endless report-writing, I was faced with an interesting challenge at work this week. We are trying to aggregate e-book usage data for the members of our consortium, and we were interested in figuring out how well the French language content is faring compared to the English titles that make the bulk of the collection.
Unfortunately, one of our vendors do not include language data in either their title lists or the usage reports. Before trying to recoup the usage reports with the full e-book metadata I could get from the MARC records, I tried to run the title list through the guess_language library by way of a simple Python script:
from guess_language import guess_language import csv with open('2015-01_ProQuest_titles.csv', 'rb') as csvfile: PQreader = csv.DictReader(csvfile) for row in PQreader: title = row['Title'] language = guess_language(title.decode('utf-8')) print language, title
The results were a disaster:
pt How to Dotcom : A Step by Step Guide to E-Commerce en My Numbers, My Friends : Popular Lectures on Number Theory en Foundations of Differential Calculus en Language and the Internet en Hollywood & Anti-Semitism : A Cultural History, 1880-1941 de Agape, Eros, Gender : Towards a Pauline Sexual Ethic la International Law in Antiquity fr Delinquent-Prone Communities en Modernist Writing & Reactionary Politics
guess_language works by identifying trigrams, combinations of three characters that are more prevalent in one language than another. While it works reasonably well on whole sentences and short text snippets, the particular construction of a book title seems to throw the method entirely off-kilter.
As I was pondering the next steps, I came to realize that I could also filter titles based on language directly in the vendor database and then export to a CSV file… which solved my issue in seconds but wasn’t half as fun as playing around with computational linguistics. Back to writing reports, I guess.