Embeddable Spreadsheet -- Requirements Definition
Steven J Zeil:
1 Overview
The spreadsheet is a now familiar model of calculation. Spreadsheets are typically envisioned as a rectangular arrangement of cells, each cell containing a mathematical expression. These expressions frequently contain references to other cells, and it is a defining characteristic of the spreadsheet model that the resulting computational order dependencies are resolved automatically rather than by explicit programming (e.g., as in a general programming language like C++ or Java).
Commonly, the GUI for a spreadsheet is often seen as integral to this model. But, in fact, there are many applications that could benefit from the ability to incorporate user-supplied calculations. The spreadsheet model suggests a means of doing so without requiring the embedding of an entire scripting language.
The embeddable spreadsheet is envisioned as a library that can be incorporated into applications to add a spreadsheet-like capability.
2 Background
A spreadsheet is generally envisioned as a rectangular arrangement of cells, each cell being either empty or containing a mathematical expression, possibly one referencing other cells. These expressions may range from simple constants to arbitrarily complex algebraic expressions. Built-in sets of functions provide common calculation such as square roots as well as more specialized calculations including simple conditionals and sums and other reducing operators over rectangular sub-regions of the spreadsheet.
In this rectangular view, cells are named by column (typically an alphabetic string) and row (numeric), for example, A1
or ABC23
.
The most common spreadsheets go beyond this simple 2D rectangular model. Most add a third dimension, the “sheet”, also identified by alphabetic strings, written before the column and row, with a separating character (‘!’) used to indicate the end of the first string and the start of the column string, e.g., Data!A1
or Summary!BB170
. In the absence of a sheet designation, the appropriate sheet is determined by context.
Commonly, this third dimension is treated differently in both the GUI and in calculations (one can, for example, use functions that operate over multiple rows or over multiple columns, but not over multiple sheets).
There is, however, no fundamental reason in the computation model why this third dimension should be treated differently from the other two. Nor, for that matter, is there any to limit the arrangments of cells to 3 dimensions, nor to preclude the possibility of one-dimensional naming (e.g., ‘X’, ‘Y’, etc.)
Setting aside the GUI, we could describe an entire spreadsheet as an unordered sequence of assignments, e.g.,
\[ \begin{align*} B4 & = sum(B1:B3) \\ A1 & = 23 \\ A2 & = 27 \\ A3 & = A2+4 \\ B1:B3 & = sqrt(2*a1) \\ \end{align*} \]
The final line in the above example is understood as a shorthand for
\[ \begin{align*} B1 & = sqrt(2*a1) \\ B2 & = sqrt(2*a2) \\ B3 & = sqrt(2*a3) \\ \end{align*} \]
In conventional spreadsheet terms, it suggests that we have entered the given assignment regarding B1
in terms of a1
, then copied and pasted that formula into B2
and B3
.
The embeddable spreadsheet should provide facilities for accepting computation descriptions in this format, via an API, or via a conventional GUI interface.
3 Requirements
3.1 Model
3.1.1 Cells and Cell Names
The system should model spreadsheet-style calculations of dimension $2\ldots N$, for some $N > 12$.
Cell names shall express a location in this N-dimensional space as follows:
- The dimensions appear in the cell name from right to left.
- The first dimension (rightmost in the cellname) is indexed by non-negative integers.
- These may optionally be omitted from a cell name, in which case a value of 0 is assumes. For example,
X
andA
are valid cell names, synonymous withX0
andA0
.
- These may optionally be omitted from a cell name, in which case a value of 0 is assumes. For example,
- The second dimension is indexed by an alphabetic string. For example,
A1
,BB23
,X
, andCOUNTER1
are all valid cell names. These strings are case-insensitive:AB12
andab12
denote the same cell.- For ordering purposes shorter strings in this dimension precede longer strings. Two strings of the same lengths are ordered by conventional alphabetic (lexicographic) ordering, treating all letters as upper case.
- All higher dimensions are indexed by alphanumeric strings and are separated from subsequent components within the name by a
!
character, e.g.,Summary!B12
,Data!Raw0!Count12
,Data!Selected!Count
, or0!00!1!A12
.- For ordering purposes, these dimensions are ordered by conventional lexicographic ordering rules, treating any alphabetic characters as upper-case.
- In a spreadsheet of dimension N, a cell name with fewer than $N$ dimensions is considered to be padded with enough leading "" strings to supply the correct dimensionality. For example, if the spreadsheet has dimension 4, then the cell name
A
is shorthand for!!A0
The dimension of a spreadsheet is the largest explicitly named dimension of any non-empty cell within it.
3.1.2 Cell References
Closely related to the idea of a cell name is a cell reference - the invocation of a cell name within a formula. Cell references differ from cell names in the following respects:
- Any dimension of the cell name may be preceded with
$
(e.g.,A$1
,Data!$X$3
,$Data!c0
. That dimension is said to be frozen, affecting its behavior when a formula is copied to a cell in a different locations. - Missing specifiers for dimensions $3\ldots$ are filled in “by context” from the corresponding dimensions of the cell containing that formula. For example, the sequence
Data!Raw!B1 = A1; Summary!B1 = A1;
is interpreted as shorthand for
Data!Raw!B1 = Data!Raw!A1; !Summary!B1 = !Summary!A1;
3.1.3 Areas
Both cell names and cell references can be generalized into area names and area references describing a rectangular (prismatic) region by replacing a sequence of one or more dimension values with a pair of such sequences, separated by ‘;’, denoting opposite corners of the desired region. . For example, A1:A20
describes part of a column, as does A1:20
. A three-dimensional region might be specified as Adams!A1:Zeil!B120
.
3.1.4 Data Types
Cells shall be capable for storing values holding numeric values and strings, and should be easily extendable to other data types as well. Internally, spreadsheets will also need to deal with values representing areas, although storing an area within a cell is not possible.
3.2 Input
The system shall accept binding of formulas to cells by API or by input from a file in the format illustrated in the various examples above. This format presents a sequence of assignments of the form
cell-or-area-name =
expression
where each expression is terminated by the end of line. The syntax for expressions is conventional algebraic expressions combined with functions written as a function name followed by a parenthesized, comma-separated parameter list.
- A useful consequence of this simple input format is that one can extend a spreadsheet by simply appending new assignments onto the end (or concatenating two files containing sets of assignments).
The spreadsheet should support, at a minimum, the usual arithmetic and relational operations on numbers, concatenation and relation operators on strings, and functions if
, sqrt
, min
, max
, count
, sum
, average
and stddev
. (Note that all but the first two operate on area references.)
Optional inputs would be formats designed for compatibility with other programs, such as CSV (Comma-Separated Value) formats.
3.3 Output
The system will permit queries of cell values by API and output in at least two reports. One report, the assignment report, would be the same as the main input format, and should in fact serve as a means to save and later re-load altered spreadsheet.
Another would be an evaluation report that would list all non-empty cells and give their values (instead of the formulas used to compute those values), in the same format as the assignment report but listing only constants in the place of the expressions.
3.4 Interface
The system should be accessible through a Java API that would allow construction, loading and saving, and modification of spreadsheets and the querying of values in the evaluated spreadsheet.
It should also be possible to launch a command line application to load, evaluate, and report on a spreadsheet.
Optional, but highly desirable would be a GUI to aid in the construction of spreadsheets, even if the primary subsequent use of those spreadsheets would subsequently be evaluation via the API or via command line operation.
3.5 Additional Requirements
The system should run on both Windows and Linux systems.
It should be capable of handling at least ten thousand non-empty cells, and should not be significantly slowed if the arrangement of non-empty cells within the multi-dimension space is exceedingly sparse.