After exploring OCR a bit more and ruminating on potential projects for the Arab world, two projects arose as particularly intriguing ideas. The first would mirror the digital project Linguistic Landscapes of Beirut but in the Emirates. As multinational as the Emirates is, linguistic data, drawing from either ambience recordings of conversations (which may be illegal) or available texts (in the form of signs, newspapers, etc.), could cast a light on the nature of the Emirate’s multinationalism. For instance, the much older nations of France and the United States debate whether their nations have become assimilationist with all immigrants conforming to the national ideal, “melting pots” where immigrants assimilate to some extent and influence their new society, or “salad bowls” with distinct cultures. The persistence of native languages in daily conversation or leisure reading (newspapers) would reflect the identities of the foreigners in the Emirates. Such a project may be particularly appealing because it can be crowd-sourced, both in terms of the raw data–photos–and the analysis–determination of the language in the photo. Depending on the results, or possibly even independent of the results, the largest problem would likely be legal, as the Emirates and other Gulf nations don’t tend to enjoy other peoples analyzing their identity or the identities of people within their countries (no offense to them for this decision, of course). Another challenge would be the language for publicity and the interface to provide data. Indeed, the utilization of English or Arabic alone for the interface may prevent those living mostly in their native language and culture from discovering the project and participating, thereby biasing the results.
An alternative idea would be to create a corpus of unknown or hard-to-access Arabic literature. Indeed, from foreign nations, the US for example, foreign texts are not often readily available to begin with and Arabic novels are particularly difficult to find. Consequently, running a project to digitize Arabic texts, as is being done for early English literature through TypeWright, could facilitate the acquisition, analysis and ultimately translation of Arabic texts for other scholars and eventually, inshallah, a non-Arabic speaking public.
Working with OCR for digitizing texts on Thursday, I experienced the applications of OCR for the first time outside of the typical Chinese learner’s experience with Pleco. Compared to recognition for dictionary purposes, the challenges in document digitization are similar, but more pronounced due to the extent of the project. For example, the lack of spaces between characters–although more than one character may form a word–is oft cited as a language-specific challenge for OCR in Chinese. However, in Pleco, these issues are rarely prohibitive of successful use because the user can manually edit the combination of characters until they make sense. Conversely, for document digitization, the OCR guesses the combination for you, prepares the information as a document, and then you must correct any errors post-facto. Additionally, the ease of OCR to identify Chinese characters in Pleco contrasts sharply with the OCR recognition of Arabic in Abbyy FineReader. This distinction testifies to the utility and progress of machine learning. However, with the struggles in recognizing handwriting, one must wonder if the ancient, more stylized Chinese texts have similar problems to Arabic. As OCR can read not even English handwriting yet, it will be interesting to see which language acquires this capability first. Indeed, if/once accomplished, the ability for OCR to recognize handwriting could facilitate building a corpus of author’s notes, about their literature or their personal lives, and presenting this added contextual information alongside the novels already available.
As the field of digital humanities continues to grow, the field becomes harder and harder to describe, as a concept initially vague becomes increasingly broad. In a review of digital projects, our class covered mapping projects, including language mapping and animations of ancient sites laid out over a timeline, as well as digitization of pamphlets from a former French colony. Even this variety of projects doesn’t begin to represent the field of digital humanities, in which one can also include heritage gaming museums. And, compared to traditional “ivory tower” academia, the benefits are also endless, as digital humanities provides ample opportunity for collaboration, public peer review, transparency and increased dissemination. Despite all these factors, since beginning to study digital humanities just a few weeks ago, the largest takeaway for me has been the accessibility of the digital. Coming from a humanities background, the digital realm has always seemed to be something populated by math geniuses, an idea enforced by the math course requirements for Computer Science majors in university. I’m encouraged now to see that not only are numerous free sources and plug-ins available to streamline the process, but also ample resources are provided by universities like mine to the student body at large to gain higher digital literacy.
Welcome to WordPress. This is your first post. Edit or delete it, then start blogging!