Improving Accessibility of
Archived Raster Dictionaries of
Complex Script Languages
Sawood Alam
Computer Science Department, Old Dominion University
Norfolk, Virginia - 23529
Fateh ud din B Mehmood
National University of Sciences and Technology
Islamabad, Pakistan
Michael L. Nelson
Computer Science Department, Old Dominion University
Norfolk, Virginia - 23529
The Time Travel
OK Google, Define Dictionary
a book or electronic resource that lists the words of a language (typically in alphabetical order) and gives their meaning, or gives the equivalent words in a different language, often also providing information about pronunciation, origin, and usage.
Dictionaries Are Different
- Read: random access
- Write: maintain sort order
- The most compact mode to preserve a language
Related Work
Unicode Collation
- Ordered assembly of written information
- Unicode values != natural collation
- Arabic script: U+0600 to U+06FF
- Out of order alphabets in derived languages
- Common Locale Data Repository (CLDR)
Collation Discrepancies
- Compound letters
- Diacritical marks
- Half letters
- Prefixes
Nested Ordering
- Root word sorting (Arabic)
- Morphological derivation
- Derived word simplification
- Radicals and strokes (Chinese)
Indexing: Ordered Pages
Indexing: Sparse Index
Indexing: Full Index
Indexing: Location Index
Indexing State Transition
Annotation
Digitization
Dictionary Explorer
- Multilingual Multi-dictionary Lookup
- Searching and Exploring
- Annotation and digitization
- User Contribution and Feedback
- Open Source => GitHub:/urduweb/DictionaryExplorer
Indexing Time
Dictionary |
Pages |
Index |
Mode |
Time |
English to Urdu |
180 |
Sparse |
Manual and Script |
10 minutes |
Monolingual Urdu |
2,500 |
Sparse |
Manual |
2 hours |
Monolingual Classic Urdu |
3,200 |
Full* |
Crowdsource** |
60 days |
- * 75,000 words, phrases, proverbs, and idioms
- ** 13 contributors
Prefix Permutations
Prefix: One
Prefix: Two
Prefix: Three
Prefix: Four
Prefix: Five
Prefix: Six
Conclusions and Future Work
- Identified issues
- Too many matches
- Lack of fielded searching
- Lack of OCR support
- No input method assistance
- Collation chalanges
- Accessibility levels: Ordered Pages, Sparse, Full, and Location indexes, annotation, and digitization
- Implemented a multi-lingual multi-dictionary explorer
- Effort and prefix evaluation
- In future: elastic index and automatic region estimste
- GitHub:/urduweb/DictionaryExplorer