Chapter 8: Error Recovery

Usually it is not acceptable to have a program terminate on a parse error. For example, a compiler should recover sufficiently to parse the rest of the input file and check it for errors; a calculator should accept another expression. Such errors violate the grammar for which the parser was constructed and are called syntactical errors. Other types of errors are called semantical errors: here the intended meaning of the language is not observed. For example, a division by too small a numeric constant (e.g., 0) may be detected by the parser compile time. In general, what can be detected compile time should not left for the run-time to detect, and so the parser should flag an error when it detects a division by a very small numerical constant. Bisonc++'s parsers may detect both syntactical and semantical errors. Syntactical errors are detected automatically, while the parser performs its parsing-job, semantical errors must explicitly be defined when the grammar is constructed. The following sections cover the way Bisonc++'s parser may handle syntactical errors and semantical errors, respectively.

8.1: Syntactical Error Recovery

In a simple interactive command parser where each input is one line, it may be sufficient to allow parse() to return PARSE_ABORT on error and have the caller ignore the rest of the input line when that happens (and then call parse() again). But this is inadequate for a compiler, because it forgets all the syntactic context leading up to the error. A syntactical error deep within a function in the compiler input should not cause the compiler to treat the following line like the beginning of a source file.

It is possible to specify how to recover from a syntactical error by writing rules recognizing the special token error. This is a terminal symbol that is always defined (it must not be declared) and is reserved for error handling. The Bisonc++ parser generates an error token whenever a syntactical error is detected; if a rule was provided recognizing this token in the current context, the parse can continue. For example:

        
    statements:  
        // empty
    | 
        statements '\n'
    | 
        statements expression '\n'
    | 
        statements error '\n'
        
The fourth rule in this example says that an error followed by a newline makes a valid addition to any statements.

What happens if a syntactical error occurs in the middle of an expression? The error recovery rule, interpreted strictly, applies to the precise sequence of a statements, an error and a newline. If an error occurs in the middle of an expression, there will probably be some additional tokens and subexpressions on the parser's stack after the last statements, and there will be tokens waiting to be read before the next newline. So the rule is not applicable in the ordinary way.

Bisonc++, however, can force the situation to fit the rule, by discarding part of the semantic context and part of the input. When a (syntactical) error occurs the parsing algorithm will try to recover from the error in the following way: First it discards states from the stack until it encounters a state in which the error token is acceptable (meaning that the subexpressions already parsed are discarded, back to the last complete statements). At this point the error token is shifted. Then, if the available look-ahead token is not acceptable to be shifted next, the parser continues to read tokens and to discard them until it finds a token which is acceptable. I.e., a token which can follow an error token in the current state. In this example, Bisonc++ reads and discards input until the next newline was read so that the fourth rule can apply.

The choice of error rules in the grammar is a choice of strategies for error recovery. A simple and useful strategy is simply to skip the rest of the current input line or current statement if an error is detected:


    statement: 
        error ';'  // on error, skip until ';' is read 
        
Another useful recovery strategy is to recover to the matching close-delimiter of an opening-delimiter that has already been parsed. Otherwise the close-delimiter will probably appear to be unmatched, generating another, spurious error message:
    
    primary:  
        '(' expression ')'
    | 
        '(' error ')'
    |
        ...
    ;
        
Error recovery strategies are necessarily guesses. When they guess wrong, one syntactical error often leads to another. In the above example, the error recovery rule guesses that an error is caused by bad input within one statement. Suppose that instead a spurious semicolon is inserted in the middle of a valid statement. After the error recovery rule recovers from the first error, another syntactical error will be found straightaway, since the text following the spurious semicolon is also an invalid statement.

To prevent an outpouring of error messages, the parser may be configured in such a way that no error message will be generated for another syntactical error that happens shortly after the first. E.g., only after three consecutive input tokens have been successfully shifted error messages will be generated again. This configuration is currently not available in Bisonc++'s parsers.

Note that rules using the error token may have actions, just as any other rules can.

The token causing an error is re-analyzed immediately when an error occurs. If this is unacceptable, then the member function clearin may be called to skip this token. The function can be called by any member function of the Parser class. For example, suppose that on a parse error, an error handling routine is called that advances the input stream to some point where parsing should once again commence. The next symbol returned by the lexical scanner is probably correct. The previous token ought to be discarded using clearin.

8.2: Semantical Error Recovery

Semantical error recovery once again requires judgment on the part of the grammar-writer. For example, an assignment expression may be syntactically defined as

    expr '=' expr
        
The left-hand side must be a so-called lvalue. An lvalue is simply an addressable location, like a variable's identifier, a dereferenced pointer expression or some other address-expression. The right-hand side is a so-called rvalue: this may be any value: any expression will do.

A rule like the above leaves room for many different semantical errors:

A parser that should be able to detect semantic errors will normally use a counter counting the number of semantic errors, e.g., size_t/*unsigned*/ d_nSemanticErrors. It may be possible to test this counter's value once the input has been parsed, calling ABORT() (see section 6.3) if the counter isn't zero anymore. When the grammar's start symbol itself has multiple alternatives, it is probably easiest to augment the grammar with an additional rule, becoming the augmented grammar's start symbol which simply calls the former start symbol. For example, if input was the name of the original start-symbol, augment the grammar as follows to ensure a PARSE_ABORT return value of the parse() member when either syntactical or semantical errors were detected:

    semantic_input:                 // new start-symbol
        input
        {
            if (d_nSemanticErrors)  // return PARSE_ABORT
                ABORT();            // on semantic errors too.
        }
        
Returning from the parser's parse() member the number of syntactical and semantical errors could then be printed, whereupon the program might terminate.