ConfigScript

07 Nov 2022

Once upon a time I thought OGRE was overly too complicated - all those unnecessary script files, custom formats, a ton of setup hassle. That was until I tried figuring out an entire modern 3D rendering stack from scratch. That’s when I realized configuring a rendering process (rendering pipeline) can be tricky. And that’s when I realized configuring application with config files can be helpful. OGRE suddenly became very appealing to me and I really began to appreciate all the work the devs have put into it.

One good aspect of OGRE was the “unnecessary” script and configuration files. But the syntax of those files looked much cleaner than that of JSON:

// This is a comment
object_keyword Example/ObjectName
{
    attribute_name "some value"

    object_keyword2 "Nested Object"
    {
        other_attribute 1 2 3
        // and so on..
    }
}

I thought if I could harvest anything from OGRE into my OpenGL application, that would be the configuration based on this format (rather than Lua or whatever scripts).

Hence I crafted this simple grammar in ANTLR4 to parse these files:

grammar ConfigScript;

config : (object | comment)* EOF ;

object
    : Identifier '{' property* '}'
    | Identifier STRING '{' (property)* '}'
    ;

property : Identifier propertyValue ;

propertyValue
    : vector
    | INT
    | FLOAT
    | BOOL
    | STRING
    | objectValue
    ;

objectValue
    : '{' property* '}'
    | STRING '{' (property)* '}'
    ;

vector
    : INT+
    | FLOAT+
    ;

comment : LINE_COMMENT | BLOCK_COMMENT ;

STRING : DOUBLE_QUOTED_STRING | SINGLE_QUOTED_STRING ;

BOOL : 'true' | 'false' ;

DOUBLE_QUOTED_STRING : '"' DoubleQuoteStringChar* '"' ;
SINGLE_QUOTED_STRING : '\'' SingleQuoteStringChar* '\'' ;

Identifier : ALPHA (ALPHA | NUM)* ;

fragment SingleQuoteStringChar : ~['\r\n] ;
    // : ~['\\\r\n]
    // | SimpleEscapeSequence ;

fragment DoubleQuoteStringChar : ~["\r\n] ;
    // : ~["\\\r\n]
    // | SimpleEscapeSequence ;

// fragment SimpleEscapeSequence : '\\' ['"?abfnrtv\\] ;

INT : '0'
    | '-'? [1-9] [0-9]*
    ;

FLOAT : ('+' | '-')? NUM+ '.' NUM+ ;

WHITESPACE : [ \r\n\t]+ -> skip ;
ALPHA : [a-zA-Z_] ;
NUM : [0-9] ;

LINE_COMMENT : '//' ~[\r\n]* -> skip ;
BLOCK_COMMENT : '/*' .*? '*/' -> skip ;

The only difference is that the object name can only be a quoted string:

// This is a comment
object_keyword "Example/ObjectName" // <--- this can not be just Example/ObjectName
{
    attribute_name "some value"

    object_keyword2 "Nested Object"
    {
        other_attribute 1 2 3
        // and so on..
        /* block comment */
    }
}

The only thing left with this parser thing is to compile it and use in a project.

ANTLR version incompatibility

It’s bragging time! For C++ users, life was never easy. Yet, some good samaritan created the CMake thing. It is ugly as heck, it is barebones and it does not really simplify the life by a whole lot, yet it allows devs to somehow manage build process a bit better than Makefiles. But it is one more tool to learn, keep up-to-date and hate. Then, Microsoft came up with a package manager for C++, vcpkg. And they decided to integrate it with CMake, which became quite popular back then. One more tool to learn, keep up-to-date and hate. But not as much as CMake - this one actually helps quite a bit. Like when you have to add a library or two to your project, it becomes much easier than using barebones CMake. And that’s where we come to a point of keeping the thing up-to-date. The vcpkg repository is technically community-driven. Meaning people from all around the world are responsible for keeping repository in good shape, but only Microsort-approved (employed?) people can approve merges to the repo. And often this only happens when somebody has to use the port (dependency from vcpkg repo) and sees an issue and has enough time and passion to go and fix it up.

That’s a long brag, but the thing is: if you go to ANTLR4 website and just download the JAR and use it to generate the parser & lexer sources for your C++ project, you won’t get it to work with the runtime from vcpkg (unless somebody stands up and updates the port). The thing is: vcpkg port provides runtime for ANTLR4 4.10.1, while the generator on ANTLR4 website has version 4.11.1. This is an issue, since between these two versions, there were breaking changes in the ANTLR4 runtime source, so the code generated by the newer version can not be used with the newer version of runtime. Simply because classes were moved around or even created, like ::antlr4::internal::OnceFlag, which never existed in 4.10.1 but was added in 4.11.1 together with an entire header file and namespace.

Luckily, some good guy by a gesture of sheer good will created a PR to fix the issue (by updating the version of ANTLR in vcpkg repository), which should be merged in by the time this blog will be published.

You have to match the ANTLR4 source generator version with the version of the runtime provided by the vcpkg port.

Actual implementation

That thing aside, there are few tricks when using the port in the code.

For the vcpkg port, the vcpkg.json file should contain the antlr4 dependency:

{
  "$schema": "https://raw.githubusercontent.com/microsoft/vcpkg/master/scripts/vcpkg.schema.json",
  "name": "antlr-configscript-cpp",
  "version-string": "0.1.0",
  "dependencies": [
      "antlr4"
  ]
}

The CMakeFile.txt however should look for antlr4-runtime package, link the antlr4_shared or antlr4_static library and explicitly add the include directories from the internal variable, ${ANTLR4_INCLUDE_DIR}:

cmake_minimum_required(VERSION 3.20 FATAL_ERROR)

project(antlr_configscript_cpp VERSION 0.1.0 LANGUAGES CXX)

add_executable(antlr_configscript_cpp
    "main.cpp"
    "gen_parser/ConfigScriptLexer.cpp"
    "gen_parser/ConfigScriptLexer.h"
    "gen_parser/ConfigScriptParser.cpp"
    "gen_parser/ConfigScriptParser.h"
    "gen_parser/ConfigScriptBaseListener.cpp"
    "gen_parser/ConfigScriptBaseListener.h"
)

set_property(TARGET antlr_configscript_cpp PROPERTY CXX_STANDARD 20)

// linking ANTLR4
find_package(antlr4-runtime CONFIG REQUIRED)
target_link_libraries(antlr_configscript_cpp PRIVATE antlr4_shared)
target_include_directories(antlr_configscript_cpp PRIVATE ${ANTLR4_INCLUDE_DIR})

With those issues sorted out, the code to parse the source for the grammar is relatively straightforward:

#include <iostream>

#include <antlr4-runtime.h>

#include "gen_parser/ConfigScriptLexer.h"
#include "gen_parser/ConfigScriptParser.h"
#include "gen_parser/ConfigScriptBaseListener.h"

#pragma execution_character_set("utf-8")

int main()
{
    const auto configSource = R"(
// This is a comment
object_keyword "Example/ObjectName" // <--- this can not be just Example/ObjectName
{
    attribute_name "some value"

    object_keyword2 "Nested Object"
    {
        other_attribute 1 2 3
        // and so on..
        /* block comment */
    }
}
)";

    antlr4::ANTLRInputStream input(configSource);

    ConfigScriptLexer lexer(&input);

    antlr4::CommonTokenStream tokens(&lexer);

    ConfigScriptParser parser(&tokens);

    antlr4::tree::ParseTree* tree = parser.config();

    auto s = tree->toStringTree(&parser);

    std::cout << "Parse Tree: " << s << std::endl;

    return 0;
}

This program would yield an abstract syntax tree looking like this:

(config (object object_keyword "Example/ObjectName" { (property attribute_name (propertyValue "some value")) (property object_keyword2 (propertyValue (objectValue "Nested Object" { (property other_attribute (propertyValue (vector 1 2 3))) }))) }) <EOF>)

Now, what do we do with it? Well, first we need to implement our own version of ConfigScriptBaseListener:

class MyConfigListener : public ConfigScriptBaseListener {
public:
    virtual void enterConfig(ConfigScriptParser::ConfigContext* ctx) override { }
    virtual void exitConfig(ConfigScriptParser::ConfigContext* ctx) override { }

    virtual void enterObject(ConfigScriptParser::ObjectContext* ctx) override { }
    virtual void exitObject(ConfigScriptParser::ObjectContext* ctx) override { }

    virtual void enterProperty(ConfigScriptParser::PropertyContext* ctx) override { }
    virtual void exitProperty(ConfigScriptParser::PropertyContext* ctx) override { }

    virtual void enterPropertyValue(ConfigScriptParser::PropertyValueContext* ctx) override { }
    virtual void exitPropertyValue(ConfigScriptParser::PropertyValueContext* ctx) override { }

    virtual void enterObjectValue(ConfigScriptParser::ObjectValueContext* ctx) override { }
    virtual void exitObjectValue(ConfigScriptParser::ObjectValueContext* ctx) override { }

    virtual void enterVector(ConfigScriptParser::VectorContext* ctx) override { }
    virtual void exitVector(ConfigScriptParser::VectorContext* ctx) override { }

    virtual void enterComment(ConfigScriptParser::CommentContext* ctx) override { }
    virtual void exitComment(ConfigScriptParser::CommentContext* ctx) override { }

    virtual void enterEveryRule(antlr4::ParserRuleContext* ctx) override { }
    virtual void exitEveryRule(antlr4::ParserRuleContext* ctx) override { }
    virtual void visitTerminal(antlr4::tree::TerminalNode* node) override { }
    virtual void visitErrorNode(antlr4::tree::ErrorNode* node) override { }
};

Then we need to pass an instance of this new class to the TreeWalker so that we can process the tree node-by-node:

auto listener = std::make_unique<MyConfigListener>();

auto walker = std::make_unique<antlr4::tree::ParseTreeWalker>();

walker->walk(listener.get(), tree);

But for that to properly work, we’d need to handle every context separately - meaning whenever we enter a nested tree node (like IntVector), we would want to have the pointer to a this temporary vector to be able to add elements to. And when we exit this node, we would need a pointer to whatever the parent of that vector was, to be able to add this vector to that parent.

This might get out of hand quite quickly.

Alternatively, and arguably more convenient way to handle this is using attributes and actions in the grammar itself:

grammar ConfigScript;

@header {
    #include <variant>
    #include <any>
}

config : objects=object* EOF ;

object
    returns [
        std::string name
    ]
    : Identifier objectValue { $name = $Identifier->getText(); }
    ;

property
    returns [
        std::string name,
        std::any value
    ]
    : Identifier propertyValue { $name = $Identifier->getText(); antlrcpp::downCast<ObjectValueContext*>(_localctx->parent)->propertyMap[$name] = $value; }
    ;

propertyValue
    : intVector { antlrcpp::downCast<PropertyContext*>(_localctx->parent)->value = $intVector.elements; }
    | floatVector { antlrcpp::downCast<PropertyContext*>(_localctx->parent)->value = $floatVector.elements; }
    | INT { antlrcpp::downCast<PropertyContext*>(_localctx->parent)->value = std::stoi($INT.text); }
    | FLOAT { antlrcpp::downCast<PropertyContext*>(_localctx->parent)->value = std::stof($FLOAT.text); }
    | BOOL { antlrcpp::downCast<PropertyContext*>(_localctx->parent)->value = static_cast<bool>($BOOL.text == "true"); }
    | STRING { antlrcpp::downCast<PropertyContext*>(_localctx->parent)->value = $STRING.text; }
    | objectValue { antlrcpp::downCast<PropertyContext*>(_localctx->parent)->value = $objectValue.propertyMap; }
    ;

objectValue
    returns [
        std::string classifier,
        std::map<std::string, std::any> propertyMap
    ]
    : '{' property* '}'
    | STRING '{' (property)* '}' { $classifier = $STRING.text; }
    ;

intVector
    returns [ std::vector<int> elements ]
    : INT+ { auto v = $ctx->INT(); std::for_each(v.begin(), v.end(), [&](auto* node) { _localctx->elements.push_back(std::stoi(node->getText())); }); }
    ;

floatVector
    returns [ std::vector<float> elements ]
    : FLOAT+ { auto v = $ctx->FLOAT(); std::for_each(v.begin(), v.end(), [&](auto* node) { _localctx->elements.push_back(std::stof(node->getText())); }); }
    ;

This way we couple the parser with the specific language (C++ in this case), but this allows us to write bare minimum code afterwards:

auto objects = antlrcpp::downCast<ConfigScriptParser::ConfigContext*>(tree)->object();

However, there quite a few tricks involved.

First of all, the huge benefit is that this way we are free to specify code to be run on entering each rule and we have an access to all of the rule context:

propertyValue
    : intVector { antlrcpp::downCast<PropertyContext*>(_localctx->parent)->value = $intVector.elements; }
    | floatVector { antlrcpp::downCast<PropertyContext*>(_localctx->parent)->value = $floatVector.elements; }
    | INT { antlrcpp::downCast<PropertyContext*>(_localctx->parent)->value = std::stoi($INT.text); }
    | FLOAT { antlrcpp::downCast<PropertyContext*>(_localctx->parent)->value = std::stof($FLOAT.text); }
    | BOOL { antlrcpp::downCast<PropertyContext*>(_localctx->parent)->value = static_cast<bool>($BOOL.text == "true"); }
    | STRING { antlrcpp::downCast<PropertyContext*>(_localctx->parent)->value = $STRING.text; }
    | objectValue { antlrcpp::downCast<PropertyContext*>(_localctx->parent)->value = $objectValue.propertyMap; }
    ;

See how the nested rules are referenced by their names (intVector is accessed with $intVector, INT is accessed with $INT, rule text is accessed via $RULE.text). There is also the current rule’ context available through _localctx.

All of this logic is injected into the code generated by ANTLR, so you can access it as-is and see what is available and what not.

But the issue is: whatever the code is that you write in the rules, it won’t be compiled when generating the parser code - it would be inserted into the generated code as-is. You will have to make sure it compiles as a separate step.

Some of the rules allow you to define what other data will can be accessed (like the integer value of an INT rule or elements of the intVector rule):

intVector
    returns [ std::vector<int> elements ]
    : INT+ { auto v = $ctx->INT(); std::for_each(v.begin(), v.end(), [&](auto* node) { _localctx->elements.push_back(std::stoi(node->getText())); }); }
    ;

This also acts as the initialization of the additional context fields.

Second, all of these antlrcpp::downCast<Context*>(_localctx->parent) are required to access the parent context (so no $parent or something).

Accessing attribute: $vector.value even though it is not valid C++ code, since $vector will be resolved to a context pointer (VectorContext*), hence we have to use all those antlrcpp::downCast.

Accessing repeated rules as a vector: $INT.begin() and $INT.end() would resolve in something else (see the above, accessing attribute). Hence a trick is to access it via context: $ctx->INT().begin().

Accessing parent attribute:

propertyValue
    : vector { $property::value.emplace($vector.elements); }

The $property::value won’t work and will throw missing code generation template NonLocalAttrRefHeader. I am unsure how to fix this correctly (this should be valid, according to documentation), so I hacked my way through:

propertyValue
    : vector { antlrcpp::downCast<PropertyContext*>(_localctx->parent)->value = $vector.elements; }

And a simple program that prints out the parsed config:

#include <iostream>
#include <map>
#include <variant>

#include <antlr4-runtime.h>

#include "gen_parser/ConfigScriptLexer.h"
#include "gen_parser/ConfigScriptParser.h"
#include "gen_parser/ConfigScriptBaseListener.h"

#pragma execution_character_set("utf-8")

void printAny(std::any value)
{
    if (value.type() == typeid(std::string))
    {
        std::cout << "(str){ " << std::any_cast<std::string>(value) << " };";
    }
    else if (value.type() == typeid(int))
    {
        std::cout << "(int){ " << std::any_cast<int>(value) << " };";
    }
    else if (value.type() == typeid(float))
    {
        std::cout << "(float){ " << std::any_cast<float>(value) << " };";
    }
    else if (value.type() == typeid(std::vector<int>))
    {
        auto vec = std::any_cast<std::vector<int>>(value);

        std::cout << "(int[]){ ";

        for (auto val : vec)
        {
            std::cout << val << ", ";
        }

        std::cout << " };";
    }
    else if (value.type() == typeid(std::vector<float>))
    {
        auto vec = std::any_cast<std::vector<float>>(value);

        std::cout << "(float[]){ ";

        for (auto val : vec)
        {
            std::cout << val << ", ";
        }

        std::cout << " };";
    }
    else if (value.type() == typeid(std::map<std::string, std::any>))
    {
        auto vec = std::any_cast<std::map<std::string, std::any>>(value);

        std::cout << "(obj){ ";

        for (auto& val : vec)
        {
            std::cout << val.first << " = ";

            printAny(val.second);
        }

        std::cout << " };";
    }
}

int main()
{
    const auto configSource = R"(
// This is a comment
object_keyword "Example/ObjectName" // <--- this can not be just Example/ObjectName
{
    attribute_name "some value"

    object_keyword2 "Nested Object"
    {
        other_attribute 1 2 3
        // and so on..
        /* block comment */
    }
}
)";

    antlr4::ANTLRInputStream input(configSource);

    ConfigScriptLexer lexer(&input);

    antlr4::CommonTokenStream tokens(&lexer);

    ConfigScriptParser parser(&tokens);

    antlr4::tree::ParseTree* tree = parser.config();

    auto s = tree->toStringTree(&parser);

    std::cout << "Parse Tree: " << s << std::endl;

    auto objects = antlrcpp::downCast<ConfigScriptParser::ConfigContext*>(tree)->object();

    std::cout << "Objects found: " << objects.size() << std::endl;

    for (auto o : objects)
    {
        std::cout << "[" << o->name << "] { ";

        auto objectValue = o->objectValue();
        auto propertyMap = objectValue->propertyMap;

        for (auto& p : propertyMap)
        {
            std::cout << p.first << " = ";

            printAny(p.second);

            std::cout << std::endl;
        }

        std::cout << " } " << std::endl;
    }

    return 0;
}

Hope this gives you a brief introduction into ANTLR4 and some of the tips on implementing your own parsers.