The following attributes will be expressed using a uniform set of integer measurement units: page@width, page@height, bounding box attributes @l, @r, @t, @b (wherever occurring), @fs (font size) (wherever ocurring), para@li, para@ri
New required attribute page@meter=“n” where n is the number of measurement units contained in 1 meter.
Question: Why specified for each page instead of for entire document?
Answer: Allows us to easily splice together pages from different sources (e.g., if we get a doc where some pages, possibly containing the sf298 form, are textPDF while the remaining pages came from OCR
Example: Omnipage output uses measurement units of 1/300 in, so a page from Omnipage OCR would be marked as <page meter=“11811” …>
A common calculation will be converting from measurement units to pts when describing fonts, which would be computed as @fs * 2835 / (page@meter). That is, if @fs is in “measurement units”, 2835 = the number of pts in a meter, and the new attribute page@meter is in measurement units/meter
New optional attribute page@source=“string” where the string names the primary processor used to produce that page from
PDF. Likely values would be “Omnipage”, “Abbyy”, “textPDF”
The bounding box attributes @l, @r, @t, @b should be required on wd, not optional.
Rename the region atttributes @left, @top, @right, @bottom, as @l, @t, @r, @b for consistency with other elements.
Add new element <phrase> that can be used in the same context as <wd> with the exact same attributes. The difference between them is semantic:
a “wd” denotes a unit of text that is bounded on each side by whitespace or by the edge of the bounding box of its container.
A “phrase” may or may not be so separated. A “wd” should normally not contain internal whitespace. A “phrase” may.
Allow regions to contain other regions (as well as paragraphs, tables, etc).
Add an optional attribute @base=“yvalue” to wd, phrase, & line. When present, indicates the y-value of the baseline on which the text was written.
Certain attributes are considered inherited. If the attribute is legal in some element E and in one or more of its descendents, the attribute may be supplied a value in E. Any descendents lacking an explicit value fo that attribute are treated as if they have the value given in E.
The use of <vert-white-space> elements is deprecated as these are easily derived from a comparison of the @t and @b values of adjacent components of the region.
Elements within a region/para/line/table/ are sorted geometrically in a fashion intended to approximate the “natural” reading order. Note that the use of nested regions can be used to “force” an ordering over complicated columns and rows.