Knuth linebreaking elements for Formatting Objects

Simon Pepping

Revision History
Revision 1.02006-04-01
First publication
Revision 1.12006-04-29
Small updates, together with a new release of the software
Revision 1.32006-05-03
Added box-penalty; added section on the basic building block; changed ‘generalized’ to ‘FO’
Revision 1.42006-05-27
Bells and whistles, and text-align implemented
Revision 1.52006-06-24
Text-align-last, a complex index, documentation of the implementation

Abstract

In this article I develop a new abstract representation of a paragraph of text for the purpose of line breaking. This reprepresentation is a reformulation of the abstract representation of Knuth and Plass, adapted to the features of FO texts.


Table of Contents

Introduction
The basic building block
Building blocks for FO texts
Text alignment
A superset of the original Knuth elements
Examples
A normal paragraph
A paragraph with borders at the start and end of each line
A paragraph consisting of a single non-breaking space
Spaces before and after a border
A complex index
The FO linebreaking algorithm
Changes compared to Knuth and Plass
Implementation
Bells and Whistles
The test class
The print representation of the test result
The complex index in the test class
Acknowledgements
Bibliography

Introduction

The line breaking algorithm of Donald E. Knuth and Michael F. Plass [KP] takes as its input an abstract representation of the text. The building blocks of this abstract representation are box, glue and penalty elements. A word or part of a word is represented by a box, and a white space character or a series of white space characters is represented by a glue element. A glue element is a legal linebreak if it immediately follows a box. Linebreak opportunities other than white space are represented by penalties. Boxes have a width, glue elements have an elastic width, which consists of a width, a stretch and a shrink, and penalties have a penalty value and also a width. In addition to this abstract representation, Knuth and Plass formulated rules to determine the amount of demerits of a line. This amount is based on the penalty value of the break point, on the amount of stretching or shrinking of the line, on the class of the line compared to that of the surrounding lines, and on the number of consecutive hyphenated lines. With this representation and these rules, they were able to formulate a total-fit algorithm which determines very good line breaks for a paragraph of text.

The linebreaking algorithm of Knuth and Plass, together with its abstract representation of the text using its three building blocks, has been applied in the Formatting Object Processor FOP, generally with good results. FOP not only uses this algorithm to determine its linebreaks but also to determine its page breaks. However, a more refined line breaking process must take into account the precise requirements of the XSL-FO specification for whitespace treatment and of the Unicode Standard Annex 14 for linebreaking opportunities. Efforts to achieve this have run into difficulties.

FOP's developers have identified the following problem cases, which they could fit into the Knuth linebreak algorithm only with great difficulty or not at all:

  1. …<fo:inline border="…">some text </fo:inline> more … (note the spaces before and after the end of the inline element)

  2. <fo:block>&nbsp;</fo:block>

  3. table header and footer repetition around line breaks

  4. white-space-treatment policy (preserve/ignore-if-before/surrounding/after-linefeed) and suppress-at-line-break property

  5. letter spacing

  6. The Unicode Annex 14 (UAX#14) specifies that linebreaks come between characters. According to the XSL-FO specification by default white space is suppressed around line breaks. In the Knuth algorithm, linebreaks come at the space, and the space is suppressed. Normally, the two are equivalent, but not always. For example, when white-space-collapse="preserve", the two are not equivalent.

Most of these problem cases can be treated with specially crafted sequences of elements. But the suppression of white space before a line break turns out to be inachievable. Knuth's algorithm suppresses whitespace and penalties after a linebreak but not before. Indeed, this non-suppression of whitespace before a linebreak has been turned into a feature, upon which the procedure to obtain centered and ragged-left texts relies. For the usual suppression of white space around a linebreak, the algorithm relies on the fact that a consecutive series of white space characters can be represented as a single glue element, which is dropped when it is the selected linebreak. This condition is not always satisfied in the texts of a FO processor, see item 1 above.

The basic building block

Every linebreak can be represented by the triple of ‘pre-break text’, ‘post-break text’, ‘nobreak text’. Note that this is exactly TeX's \discretionary control sequence. A paragraph of text can be represented by a series of such triple elements, provided their texts do not contain other linebreak opportunities than the one they represent.

There are a few extreme cases in which the triple elements reduce to simpler elements:

  1. The pre- and post-break texts are empty, and the triple does not represent a linebreak opportunity. This case reduces to a box. Note that the nobreak text is free to contain stretch or shrink.

  2. The nobreak text is empty. This case reduces to a penalty. Note that it may contain both a pre-break and a post-break text.

  3. The pre- and post-break texts are empty, and the triple does represent a linebreak opportunity. This case reduces to a suppressible box which also is a penalty. Let us call it a box-penalty. The glue element of Knuth and Plass is such a box-penalty. Note that a box-penalty contains suppressible items and the linebreak opportunity. In order to determine such box-penalty elements, the process that builds the abstract representation of the text must know the actual suppression policy.

  4. It is convenient to define another element, which is not directly derived from a triple element, viz. a suppressible box. A suppressible box is a box which contains suppressible items. The box may be suppressed if it is adjacent to a chosen linebreak, or has only other suppressible boxes between itself and a chosen linebreak. Whether the box is really suppressed is determined during the linebreaking process. It depends not only on the chosen linebreak, but also on the suppression policy. The building process can determine suppressible items without knowledge of the actual suppression policy, and the same building process can be applied regardless of the actual suppression policy.

Building blocks for FO texts

Knuth and Plass designed their set of building blocks for the types of text they were dealing with: Western text with a single white space character between words. The XSL-FO specification addresses a much wider range of texts, among others non-Western texts. These texts come with new features: non-collapsible sequences of white space characters, suppression of characters before, after or around line breaks. For such texts a modified set of building blocks is required.

Here I propose such a set:

  1. Box, with elastic width. A box has two boolean properties:

    1. suppress-at-linebreak, default value false. According to the FO specification, in the default case in an FO text it is true for the space character U+0020. The user may deviate from the default and set it to false for the space character, and to true for other characters.

    2. is-BP, default value false. This property indicates whether a box corresponds to a border and/or a padding width. It is true for boxes which are generated by padding widths and borders.

  2. Penalty, with a penalty value and two elastic widths. When the penalty element is the chosen linebreak, it contributes the first elastic width before the linebreak and the second elastic width after the linebreak.

  3. Box-penalty, with a penalty value, and three elastic widths. When the box-penalty element is the chosen linebreak, it behaves as a penalty, otherwise it behaves as a box.

Penalties and box-penalties are legal breakpoints. Boxes are not.

An elastic width is a set of (width, stretch, shrink). In FO terms this would be (optimum, maximum, minimum), where stretch = maximum − optimum, shrink = optimum − minimum.

The actual treatment of boxes with suppress-at-linebreak=true is subject to the value of the policy property white-space-treatment. Its possible values are:

ignore

All XML white space characters except linefeeds are suppressed. This should be taken care of in the construction of the abstract representation. The resulting abstract representation does not contain suppressible boxes, and the suppression policy during linebreaking is irrelevant.

preserve

Suppressible boxes are not suppressed before or after linebreaks.

ignore-if-surrounding-linefeed

Suppressible boxes are suppressed both before and after linebreaks.

ignore-if-before-linefeed

Suppressible boxes are suppressed before but not after linebreaks.

ignore-if-after-linefeed

Suppressible boxes are suppressed after but not before linebreaks.

Penalties are treated like boxes with suppress-at-linebreak=true.

In the following text I will denote these building blocks for FO texts as FO building blocks or FO elements.

Text alignment

FO defines the usual four values for text alignment: start, center, end and justify. start and end are also known as ragged right or flush right and ragged left or flush left. These alignments can easily be produced by inserting penalties at every linebreak opportunity. For start we insert a penalty whose width before the linebreak is zero with a non-zero stretch value. For center we insert a penalty whose widths before and after the linebreak are zero with a non-zero stretch value. For end we insert a penalty whose width after the linebreak is zero with a non-zero stretch value. For justify we do nothing special. Some linebreak opportunities may already have a penalty for other reasons, for example a hyphen. Then the stretch value for text alignment has to be added to the widths of that penalty.

The stretch values are in principle arbitrary. Knuth and Plass use a value of 18 machine units = 1 em. For the stretch values in the center case, Knuth and Plass use the same value before and after the linebreak. I prefer to use half that value, so that the total stretch added to a line due to text alignment is the same in the three non-justified cases. This has the consequence that the linebreak calculations in the three cases are identical. Only the stretch is distributed differently.

FO allows the user to specify a different value for the alignment of the last line. If the user does not specify a value, then it is equal to the text alignment value. The case of justified text alignment is an exception. In this case the default value for the alignment of the last line is start, which corresponds to the usual incomplete last line.

A different value for the alignment of the last line cannot be represented in the abstract representation of the paragraph, because the building blocks have not yet been assigned to a line. It is possible to take the alignment of the last line into account in the algorithm, as follows. All linebreak opportunities have a penalty whose width after the linebreak contains a stretch component corresponding to the text alignment value. For the linebreak opportunity of the last line but one, the value of this stretch component needs to be changed to that corresponding to the value for the alignment of the last line. I store the difference between these two values in the final penalty as the last-line-additional-before-stretch member.

When in the main loop the current linebreak opportunity is the last linebreak opportunity, I use this value. For each active node that is considered in the inner loop, I increase the line length with this additional stretch value. This provides the linebreak calculation with the desired stretch for the last line. After the main loop, when the best node for the paragraph has been determined, I modify the penalty of its previous node, and add the additional stretch value to its width after the linebreak. In this way, the following typesetting phase sees an abstract representation that has all the correct stretch values.

There is one feature that the above trick cannot get right. In the justified case it is a good idea to use stretchable boxes for white space characters. In the non-justified case it is better to use fixed width boxes for them, because all the stretch is placed at the start and/or end of the line. The stretch of the white space character boxes in the last line corresponds to the setting of the text alignment of the body of the paragraph and not to that of the last line. As a consequence, this method will usually not find a linebreaking solution for non-justified paragraphs with a justified last line. In this case the last line will often not have any stretch, and unless it happens to fit exactly, it will be infinitely bad. In my opinion such combinations are unrealistic anyway.

It is possible to extend the trick so that also the stretch in the last line agrees with the text alignment for that line. To achieve that one assigns an alignment dependent expression to the stretch of white space characters. The linebreaking calculation must know the values of the text alignment of the body of the paragraph and of that for the last line. During the linebreak calculations the alignment dependent expression is evaluated according to whether the linebreak opportunity is the last one of the paragraph or not. I will not explore this complicated possibility.

When the parameters of the building blocks are made line-dependent, one should take care that the dynamic programming “principle of optimality” remains valid. This is the case here because we modify the parameters only in the last line.

A superset of the original Knuth elements

The FO building blocks presented above form a superset of the original building blocks of Knuth and Plass. That means that any set of Knuth elements can be mapped to a set of our FO elements, with the same widths, elastic or not, and the same legal linebreak points and penalty values.

The mapping is as follows. We map each Knuth box to an FO box with the same width. The box has no stretch and shrink. We map each glue to an FO box with the same elastic width. The box's property suppress-at-linebreak is true. If the glue is a legal linebreak, that is, if it immediately follows a box, we insert a penalty before the corresponding FO box. The penalty has zero value and widths. Each Knuth penalty is mapped onto an FO penalty with the same penalty value. The width before the break is equal the original width. The width after the break is zero. Because the Knuth approach only suppresses glues and penalties after a linebreak, we use the whitespace treatment policy ignore-if-after-linefeed.

It is obvious that the above mapping produces the same legal linebreaks: Each glue that is a legal linebreak and each penalty are mapped onto an FO penalty, and no other FO penalties are generated. Each glue that is a legal linebreak is also mapped onto a suppressible FO box following an FO penalty. Therefore, if that FO penalty is a chosen linebreak, the following FO box is suppressed, which is equivalent to the dropping of the glue when it is a chosen linebreak.

After a linebreak a consecutive series of glues and penalties are dropped, up to the first Knuth box. Such a series is mapped onto a series of suppressible FO boxes and penalties. The glues do not generate FO penalties in the mapping, because none of them follows a box, and therefore none of them is a legal linebreak. Because our whitespace treatment policy is ignore-if-after-linefeed, all such suppressible FO boxes and penalties are suppressed, up to the first non-suppressible FO box, corresponding to the first Knuth box. This shows that also the suppression of elements around linebreaks is equivalent in the two cases.

This proves that any sequence of building blocks of Knuth and Plass is equivalent to a sequence of FO building blocks, and that the FO building blocks form a superset of the original building blocks. Our FO approach can therefore deal with any situation which can be dealt with using the approach of Knuth and Plass.

Before closing this section, I want to point out a different mapping between the two sets of building blocks. Its validity is limited to the case of a paragraph consisting of words and single white space characters. The words may be hyphenated. The alignment policy for the paragraph is justified. This is the most common case of western text. In this case the abstract representation of Knuth and Plass consists of a series of boxes interrupted by a penalty for a hyphen or a glue for a word space.

We map the boxes to FO boxes and the penalties to FO penalties, as before. However, now we map the glues (which are all legal linebreaks) to a suppressible FO box followed by an FO penalty. And our whitespace treatment policy is ignore-if-surrounding-linefeed. When a hyphenation penalty is a chosen linebreak, both cases are identical. When a glue is a chosen linebreak in the Knuth case, the corresponding FO penalty is the chosen linebreak in the FO case, and the preceding suppressible box is suppressed. This makes the behaviour in the two cases identical.

This shows that for default text, the Knuth case can also be mapped to the default whitespace treatment policy of XSL-FO and to linebreak opportunities following spaces, as prescribed in the Unicode Annex 14. It also shows that a Knuth glue can be thought of as a combination of a suppressible box for the space with a penalty for the linebreak opportunity. Only in special cases, such as centered text alignment, does the approach by Knuth and Plass deviate markedly from the XSL-FO and Unicode defaults.

Examples

A normal paragraph

The list of elements is:

(box, box(suppressible, elastic), penalty(0, 0, 0))+.

Linebreaks occur at some of the penalties. When the whitespace treatment policy is ignore-if-surrounding-linefeed or ignore-if-before-linefeed, the suppressible box before each linebreak (corresponding to a wordspace) is suppressed. Otherwise, when the whitespace treatment policy is preserve or ignore-if-after-linefeed, each line ends in a whitespace.

A paragraph with borders at the start and end of each line

The list of elements is:

(box, box(suppressible, elastic), penalty(0, (x,y,z), (x,y,z)))+.

Here x, y, and z are the width, stretch and shrink of the border. Linebreaks occur at some of the penalties. There is a border at the end and start of each line. When the whitespace treatment policy is ignore-if-surrounding-linefeed or ignore-if-before-linefeed, the suppressible box before each linebreak (corresponding to a wordspace) is suppressed. Otherwise, when the whitespace treatment policy is preserve or ignore-if-after-linefeed, each line ends in a whitespace.

The same approach can be used for borders and padding that are repeated around page breaks, and for table headers and footers that are repeated around page breaks. In the table case, the width of the header and footer may be different for each pagebreak opportunity, due to their interaction with the cell and row borders.

A paragraph consisting of a single non-breaking space

The list of elements is:

box(elastic).

The only linebreak comes at the end of the paragraph. The non-breaking space is not suppressible, and will be the only (non-printing) content of the line.

Spaces before and after a border

The list of elements around the border is:

…, box, box(suppressible,elastic), box(BP), box(suppressible,elastic), penalty(0), ….

There is no linebreak opportunity before the border because it may be considered as a close operator in the sense of UAX#14. There is no linebreak opportunity after the border, because it is forbidden by the following space. If the penalty shown is chosen as a linebreak and the whitespace treatment policy is ignore-if-surrounding-linefeed or ignore-if-before-linefeed, both suppressible boxes shown will be suppressed. The line will end in a word followed by the border.

This is the only example that can really not be solved with the approach of Knuth and Plass. That is so because there is no way to turn the space before the border into a glue. Therefore it must be suppressed before a linebreak, which the approach of Knuth and Plass does not provide.

A complex index

Under this title Knuth and Plass present their showcase, a complex index with precise typesetting requirements. Can the complex index be modelled with the FO building blocks? Above I have proven than any situation which can be modelled with the building blocks of Knuth and Plass, can be modelled with the FO building blocks. Therefore, the answer is yes, the complex index can be modelled by mapping Knuth and Plass' solution according to the above derived mapping recipe. But in this section I will model it by constructing a building block sequence that produces the same results in the breaking and non-breaking cases as the sequence of Knuth and Plass, with the default whitespace treatment policy ignore-if-surrounding-linefeed.

Each index entry represents a separate paragraph for the linebreaking algorithm. It consists of two parts, a names and a references part.

The names part is modelled as follows.

  1. A white space character is represented by a suppressible box with (min opt max) = (667 1000 1000). This corresponds to (width stretch shrink) = (6 0 2), i.e. glue(6,0,2) in the units of Knuth and Plass.

  2. A linebreak opportunity is represented by a penalty with a width before the linebreak equal to (1500 1500 4500). This corresponds to (width stretch shrink) = (9 18 0), i.e. glue(w2,18,0) in the units of Knuth and Plass, with w2 = 9.

When the white space character is the chosen linebreak, its box is suppressed, and the penalty's width-before is inserted. These results agree with those of Knuth and Plass:

  1. When the white space character is not the chosen linebreak, it contributes a glue(6,0,2).

  2. When the white space character is the chosen linebreak, it contributes a glue(w2,18,0) before the linebreak.

The references part is modelled as follows.

  1. A white space character is represented by a suppressible box with (min opt max) = (667 1000 1000). This corresponds to (width stretch shrink) = (6 0 2), i.e. glue(6,0,2) of Knuth and Plass.

  2. A linebreak opportunity is represented by a penalty with a value of 999 and a width after the linebreak equal to (0 0 3000). This corresponds to (width stretch shrink) = (0 18 0), i.e. glue(0,18,0) in the units of Knuth and Plass.

When the white space character is the chosen linebreak, its box is suppressed, and the penalty's width-after is inserted at the start of the next line. These results agree with those of Knuth and Plass:

  1. When the white space character is not the chosen linebreak, it contributes a glue(6,0,2).

  2. When the white space character is the chosen linebreak, it contributes a glue(0,18,0) after the linebreak.

The transition part is modelled as follows.

  1. A white space character, which is represented by a suppressible box with (min opt max) = (667 1000 1000). This corresponds to (width stretch shrink) = (6 0 2), i.e. glue(6,0,2) of Knuth and Plass.

  2. A linebreak opportunity, which is represented by a penalty with a width before the linebreak equal to(min opt max) = (1500 1500 4500). This corresponds to (width stretch shrink) = (9 18 0), i.e. glue(w2,18,0) in the units of Knuth and Plass, with w2 = 9.

  3. A leader box, which is represented by a box with (min opt max) = (0 3600 large-number). This corresponds to a leader box with (width stretch shrink) = (3600 large-number 3600), i.e. leaders(3w3, large-number, 3w3) in the units of Knuth and Plass, with w3 = 1200.

  4. A linebreak opportunity, which is represented by a penalty with a width before the linebreak equal to (min opt max) = (7500 7500 7500) and a width after the linebreak equal to (min opt max) = (0 0 3000). The width before the linebreak corresponds to (width stretch shrink) = (45 0 0), i.e. glue(w1,0,0) in the units of Knuth and Plass, with w1 = 45. The width after the linebreak corresponds to (width stretch shrink) = (0 18 0), i.e. glue(0,18,0) in the units of Knuth and Plass.

When none of the linebreak opportunities is the chosen linebreak, the penalties are dropped, and the result is as follows:

  1. The white space character contributes a glue(6,0,2).

  2. The leader box contributes leaders(3w3, large-number, 3w3).

This is also the result of Knuth and Plass.

When the first linebreak opportunity is the chosen linebreak, the white space character is suppressed, and the penalty's width-before is inserted, and the result is as follows:

  1. The penalty contributes a glue(w2,18,0) before the linebreak.

  2. The leader box contributes leaders(3w3, large-number, 3w3).

This is also the result of Knuth and Plass.

When the second linebreak opportunity is the chosen linebreak, the result is as follows:

  1. The white space character contributes a glue(6,0,2).

  2. The leader box contributes leaders(3w3, large-number, 3w3).

  3. The penalty contributes a glue(w1,0,0) before the linebreak.

  4. The penalty contributes a glue(0,18,0) after the linebreak.

This is also the result of Knuth and Plass.

The conclusion is that the complex index can easily be modelled in terms of the FO building blocks with the default white space treatment policy. The resulting sequences are simpler than those of Knuth and Plass. This is the result of the property that a penalty may contain a width after the linebreak.

The FO linebreaking algorithm

Changes compared to Knuth and Plass

The linebreaking algorithm of Knuth and Plass works for their elements: boxes, glues and penalties. The FO building blocks have a few properties which require changes to the algorithm.

  1. Elements are not only suppressible after but also before a linebreak. As a consequence the last element that contributes to the line no longer coincides with the linebreak element. The linebreak nodes must track this.

  2. Penalties not only contribute width before but also after the linebreak. As a consequence there is a contribution to the linewidth before the first element of the line. The linebreak nodes must track this.

  3. The sequence of elements that are suppressed at the start or end of a line may contain border and padding boxes, i.e. boxes for which is-BP is true. These boxes are not suppressed. The linebreak nodes must track this.

In Knuth and Plass' case a linebreaking node needs to track a line from its first element to the linebreaking element. In the FO case, a linebreaking node needs to track a line from its first to its last element, it needs to register the contributions of the linebreaking elements of the previous and of this line, and it needs to register the widths of the border and padding boxes before the first and after the last line elements.

Implementation

Together with this essay I publish a simple implementation of the FO linebreaking algorithm. It acts on an abstract representation of a paragraph consisting of FO building blocks. It suppresses elements both before and after a linebreak, according to the white-space-treatment policy set on the paragraph. Its nodes are modified to satisfy the above tracking needs.

The implementation follows the algorithm of Knuth and Plass closely. It is, however, a completely new implementation, programmed in an object oriented style. Knuth and Plass describe their algorithm in great detail. Here I will highlight the main points.

Before I start the description of the algorithm, I will define a few terms.

Linebreak opportunity

A position at which a linebreak may occur. Called legal linebreak by Knuth and Plass.

Feasible linebreak

A linebreak opportunity for which there exists a tolerable paragraph layout up to that point.

Node

A programmatic object that is created for each feasible linebreak. It contains information needed by the algorithm to evaluate further linebreak opportunities, and by the typesetting process to typeset the laid out paragraph.

Active list

A list of nodes for which there is a chance that the line between the linebreak opportunity of the node and the current linebreak opportunity is tolerable. At the start of the algorithm, a node is created for the pseudo-linebreak opportunity at the start of the paragraph, and the active list is seeded with it. For each feasible linebreak a node is created and added to the active list. Nodes may also be deactivated, i.e. removed from the active list. Deactived nodes may still be referred to by active nodes and therefore may still play a role in the layout.

Line elements

The consecutive series of building blocks in a line that are not suppressed. The border and padding items in the suppressed region are not counted as line elements.

The core of the algorithm consists of two nested loops. The main loop is a loop over the linebreak opportunities of the paragraph. For each linebreak opportunity, a loop is made over the nodes in the active list. For each active node, the line between the linebreak opportunity to which the node refers, and the current linebreak opportunity is evaluated. If the main loop has so far progressed through the paragraph that the line from the active node to the current linebreak opportunity has become too long and must be shrunken beyond its minimum length, the active node is deactivated. Otherwise, if the line from an active node to the current linebreak opportunity falls within the tolerance, a node is created for it. At the end of the nested loop all nodes created for this linebreak opportunity are compared to each other. The best node represents the best layout of the paragraph up to that linebreak opportunity. It is added to the active list.

When the main loop has reached the end node of the paragraph, the best node represents the best layout of the whole paragraph. It is the result of the linebreaking calculation. Through its reference to the previous node, i.e. the node describing the linebreak of the preceding line, and of all nodes to their previous nodes, the program is able to trace the linebreaks for all lines of the paragraph.

During the main loop it is possible that the active list becomes empty, because nodes are removed and insufficient new nodes are added. That means that the paragraph has no possible layout within the requested tolerance. In this situation my implementation gives up. Typesetting systems must define a strategy for such cases, e.g. they accept the last considered line although it exceeds the tolerance.

A node contains the following information:

previous

The node of the linebreak of the preceding line.

lineBeforeEndPos

The index of the paragraph element after the line elements on the line before the linebreak

totalBoxWidthBefore

The total width of all boxes up to the element at lineBeforeEndPos

BPWidthBefore

The border/padding width after the line elements of the line before the linebreak

lbPos

The index of the linebreak opportunity

BPWidthAfter

The border/padding width before the line elements of the line after the linebreak

lineAfterStartPos

The index of the first line element of the line after the linebreak

totalBoxWidthAfter

The total width of all boxes up to the element at lineAfterStartPos

demerits

The amount of demerits of the line before the linebreak

adjRatio

The adjustment ratio of the line before the linebreak

adjClass

The adjustment class of the line before the linebreak

lineNumber

The number of the line that is ended by the linebreak

There are three items before the linebreak, and three corresponding items after the linebreak. They hold calculated values so that these need not be recalculated each time they are needed. The last two items are needed by the ‘Bells and Whistles’ which will be described below.

Figure 1. The parts of a line

========================================================================

   |←previous.lbPos                                         lbPos→|
   +-+-------+--------------------------------------------+-------+-+
   | |  BAP  |               line elements                |  BAP  | | ← the line
   +-+-------+--------------------------------------------+-------+-+
    ↑    ↑   |                                            |   ↑    ↑
    |    |   |←lineAfterStartPos         lineBeforeEndPos→|   |    |
    |    |   |←totalBoxWidthAfter     totalBoxWidthBefore→|   |    |
    |    BPWidthAfter                             BPWidthBefore    |
    WidthAfter                                           WidthBefore

========================================================================

Bells and Whistles

Bells and Whistles

Knuth and Plass add a few bells and whistles to their algorithm:

  1. The assignment of extra demerits to consecutive hyphenated lines.

  2. The ability to deal with variable line lengths.

Under the title “More Bells and whistles” they present a few more features which improve the linebreaking result.

  1. The distinction of four classes of lines, according to their tightness.

  2. The ability to loosen a paragraph, that is, to make it one or more lines longer or shorter than the optimal solution.

All these bells and whistles are implemented.

Fitness classes

The line classes in item 3 are called fitness classes. When two consecutive lines differ by more than one fitness class, they are considered contrasting lines and get an extra amount of demerits. The adoption of fitness classes makes that the inner loop no longer results in a single best node. There may be a best node for each fitness class. The algorithm applies a small optimization: If the minimum amount of demerits of a certain fitness class differs from the minimum amount of demerits of all classes by more than the extra amount of demerits for contrasting lines, the best node of that class will never be part of the best layout. Therefore it need not be remembered.

Variable line lengths

Item 2 adds some complexity to the algorithm. The inner loop now may only compare nodes for the same line number with each other. Therefore it must only run over active nodes with the same line number. It may result in a best node for each line number for which the linebreak opportunity is a feasible linebreak.

Knuth and Plass implement this by keeping the nodes in the active list ordered on line number. They let the inner loop run over a part of the active list whose nodes have the same line number. I have chosen a more explicit mechanism. The program holds a list of active lists. The inner loop runs over one active list. The best node found by the inner loop is added to the next active list. This inner loop is repeated for all active lists.

There are always a limited number of line lengths; let us call this number L. Because the lines with line number L and higher have the same length, their nodes may be compared with each other. Therefore we may put the active nodes with line number L − 1 and higher together.

The program holds a list of L active lists. The first list contains only the start node, which pseudo-ends line number −1. The following property holds: At any time during the process each active list contains only nodes with the same line number. At the start this is trivially true because it is true for the first active list. The algorithm adds to each active list only nodes whose line number is 1 higher than the line number of the preceding active list. This keeps the above property true. The last active list is an exception. In the last active list we collect the active nodes with line number L − 1 and higher.

When the main loop has reached the final penalty, all best nodes are compared together, whether we have reached line L or not. As before, the algorithm results in a single best node.

Looseness

For item 4 the algorithm needs to know the line number of the final nodes. When it has determined the best node, it applies the required looseness by selecting the best node whose line number is ‘looseness’ higher or lower than that of the overall best node. Such a node is not always available, because there may not be a tolerable paragraph layout with that number of lines. Then the best node is selected whose line number is 1 closer to that of the overall best node, etc.

The implementation reuses the mechanism of variable line lengths. But there is no highest line number L. We keep the active nodes with different line numbers separate for all line numbers, and the number of active lists is unbounded. This ensures that we always select the best node for a certain line number, and in the end know the best final node for each line number.

The test class

The implementation contains a test class which can be used from the command line: java nl.leverkruid.spepping.gkplinebreaking.XMLTestCase file.xml. Here file.xml is an XML file which contains p elements in the no-name namespace. These p elements may have one or more of the following attributes:

  1. tolerance, with an integer value indicating the desired value for the tolerance. The default value is 5.

  2. white-space-treatment, with the values listed above. The default value is ignore-if-surrounding-linefeed. The value ignore is not implemented, because the test class does not suppress white space during building of the abstract representation.

  3. linewidth, with an integer value indicating the desired value for the linewidth. Each character gets a width of 1000. Therefore a linewidth of 50000 is equal to the width of 50 characters. If there is no linewidth attribute, the linewidths attribute is used. If that is also absent, the default value of 50000 is used.

  4. linewidths, with a space-separated list of integer values. The first value is the desired linewidth of the first line, and so on. The last value is used for all remaining lines. The linewidths attribute is not used if there is also a linewidth attribute.

  5. looseness, with an integer value indicating the desired number of lines by which the paragraph should be loosened (positive value) or tightened (negative value). The default value is 0. Due to Java, positive values must be entered without a plus sign.

  6. text-align, with the same values as the FO attribute of the same name. In the centered case, Knuth and Plass use the same stretch value as in the ragged-left and -right cases, but apply it both to the start and to the end of the line. I apply half that value to both ends. As a result, the stretch value in a line is the same in all three non-justified cases.

  7. text-align-last, with the same values as the FO attribute of the same name.

  8. ragged-stretch, with an integer value. Knuth and Plass use a value of 1 em, which corresponds to 3 times their width of a hyphen, for the stretch of the linebreak penalty for the non-justified cases. I use a similar default value, viz. 3 times 1000. But this attribute allows one to experiment with other values. A high value (up to Java's Integer.MAX_VALUE) produces a markedly different result.

The p elements may of course contain text. The text may contain span elements in the no-name namespace. These span elements may have three attributes:

  1. BAP-width, whose value should be one integer or a list of three integers separated by spaces, indicating the desired value for the width of the border and padding. If the value consists of three integers, they indicate, respectively, the minimum, optimum and maximum, of that width. The default value is 0.

  2. BAP-conditionality, with value "discard" or "retain", indicating the desired conditionality for the border and padding. The default value is "discard".

  3. letterspacing, whose value should be one integer or a list of three integers separated by spaces, indicating the desired value for the letterspacing. If the value consists of three integers, they indicate, respectively, the minimum, optimum and maximum of the letterspacing. The default value is no letterspacing.

The span elements may of course contain text. It is a limitation of the implementation of the test class that a span element should not be followed immediately by another span element.

Any other elements and attributes will be ignored. The p elements mimick fo:block elements, and the span elements mimick fo:inline elements in the fo:block element.

Linefeeds are treated according to the setting linefeed-treatment = "space". White space is treated according to the setting white-space-collapse = "false". white-space-treatment = "ignore" is not applied.

Linebreaks are determined by a LineInstance of class BreakIterator of the ICU library. The ICU4J library must be in the classpath. It can be obtained from http://icu.sourceforge.net/. Hyphenation must be inserted into the test file explicitly, by use of hard (U+2D) or soft (U+AD) hyphens. Hard hyphens will of course always be printed.

It is hoped that this arrangement allows easy testing of a wide range of texts.

The print representation of the test result

The test class uses several symbolic notations to give an impression of the linebreaking result.

  1. The end of a line is indicated by a vertical stroke ‘|’. This allows one to see unsuppressed spaces at the end of the line.

  2. The stretch is indicated by four spaces in start and end text alignment, and by two spaces on either side of the line in centered text alignment. When the adjustment ratio is zero, the stretch evaluates to zero and it is not printed.

  3. The infinite stretch in the last line is indicated by six spaces in start and end alignment of the last line, and by three spaces on either side of the line in centered text alignment of the last line.

  4. The start and end of a span are indicated by angle brackets ‘〈’ and ‘〉’.

  5. Letterspacing is indicated by spaces between the letters. The space at the start and end of a word is indicated by a ‘+’. In this way it can be seen whether the latter is suppressed around a linebreak.

The complex index in the test class

The test class has additional provisions to allow index test cases similar to the complex index example. Each index entry consists of a names and a references element. It is good XML practice to embed a pair of these elements in an entry element, and to use an index element as the container of the entry elements. But the test class does not enforce that. It acts on each names and each references element. It will only produce good results when one names element is followed by a references element. A references element finishes the entry.

The names element recognizes two attributes: linewidth or linewidths, and tolerance, with the same values as for the paragraph element. These values are also applied to the following references element. The index example has a large number of parameters, but the test class has no provisions to modify them in the test file by attributes. All parameters are coded as static values of the test class.

Due to a special treatment of infinite stretch values in the width calculations, the test class is able to use a leader box with an infinite stretch. In the test print-out the leaders are represented by an ellipsis …. Note that there is no space between the leaders and the following text, but there is a linebreak opportunity at that point.

There are two possible ways to implement hanging indentation: by means of a width-after in the penalties, of by means of different line lengths. The test class does not implement the penalties method. The user has to indicate the desired amount of indentation by indicating a second linewidth which is smaller than the first by that amount. The test print-out does not know that the shorter linewidth is an indentation. The linebreak calculations, however, do use the desired line lenghts.

Acknowledgements

This work would not have been possible if I had not been a member of the development team of Apache FOP. Through the work of the team I learnt much about programming of digital printing. My work directly builds on the work of Luca Furini, who brought the linebreaking algorithm of Knuth and Plass to FOP, and provided its implementation both in linebreaking and in page breaking. I was much inspired by the work of Jeremias Märki, who implemented much of FOP, partially based on Luca's work. My work was most directly prompted by the work of Manuel Mall, who tirelessly insisted on the remaining problems in FOP's white space handling, and by that of Andreas Delmelle, who together with Manuel worked on a solution for these problems.

Finally, I owe many thanks to Donald E. Knuth and Michael F. Plass for their linebreaking algorithm. Their clear description of it made much of my implementation easy; I simply followed their rules.

Bibliography

[KP] Donald E. Knuth and Michael F. Plass, Breaking Lines into Paragraphs, Software—Practice and Experience, 11 (1981) 1119–1184; reprinted in: Digital Typography, Ch. 3, pp. 67–155.