Improving Accessibility of
Archived Raster Dictionaries of
Complex Script Languages

Sawood Alam
Computer Science Department, Old Dominion University
Norfolk, Virginia - 23529

Fateh ud din B Mehmood
National University of Sciences and Technology
Islamabad, Pakistan

Michael L. Nelson
Computer Science Department, Old Dominion University
Norfolk, Virginia - 23529

The Time Travel

OK Google, Define Dictionary

a book or electronic resource that lists the words of a language (typically in alphabetical order) and gives their meaning, or gives the equivalent words in a different language, often also providing information about pronunciation, origin, and usage.

Dictionaries Are Different

  • Read: random access
  • Write: maintain sort order
  • The most compact mode to preserve a language

Problem: English Dictionary

Johnson's English dictionary

Problem: Urdu Dictionary

Farhang-e-Asifiyah

Related Work

Unicode Collation

  • Ordered assembly of written information
  • Unicode values != natural collation
  • Arabic script: U+0600 to U+06FF
  • Out of order alphabets in derived languages
  • Common Locale Data Repository (CLDR)

Collation Discrepancies

  • Compound letters
  • Diacritical marks
  • Half letters
  • Prefixes

Nested Ordering

  • Root word sorting (Arabic)
    • Morphological derivation
    • Derived word simplification
  • Radicals and strokes (Chinese)

Indexing: Ordered Pages

Indexing: Sparse Index

Indexing: Full Index

Indexing: Location Index

Indexing State Transition

Annotation

Digitization

Dictionary Explorer

  • Multilingual Multi-dictionary Lookup
  • Searching and Exploring
  • Annotation and digitization
  • User Contribution and Feedback
  • Open Source => GitHub:/urduweb/DictionaryExplorer

Dictionary Explorer: English

Dictionary Explorer: English

Dictionary Explorer: Urdu

Dictionary Explorer: Urdu

Indexing Time

Dictionary Pages Index Mode Time
English to Urdu 180 Sparse Manual and Script 10 minutes
Monolingual Urdu 2,500 Sparse Manual 2 hours
Monolingual Classic Urdu 3,200 Full* Crowdsource** 60 days

  • * 75,000 words, phrases, proverbs, and idioms
  • ** 13 contributors

Prefix Permutations

Prefix: One

Prefix: Two

Prefix: Three

Prefix: Four

Prefix: Five

Prefix: Six

Conclusions and Future Work

  • Identified issues
    • Too many matches
    • Lack of fielded searching
    • Lack of OCR support
    • No input method assistance
  • Collation chalanges
  • Accessibility levels: Ordered Pages, Sparse, Full, and Location indexes, annotation, and digitization
  • Implemented a multi-lingual multi-dictionary explorer
  • Effort and prefix evaluation
  • In future: elastic index and automatic region estimste
  • GitHub:/urduweb/DictionaryExplorer

Sawood Alam

@ibnesayeed