Parser monad
Few years ago I did a course project for the “Functional programming” course at Jagiellonian University. The project was to implement a program for solving systems of equations… in Haskell.
For me back then, the entire concept of monads was quite foreign and scary, so I implemented a very crude state machine to parse the equations as strings from STDIN:
parseRow :: String -> Integer -> [Integer] -> [String] -> ([Integer], [String])
parseRow [] state coefficients var_names
| state == 7 = (reverse coefficients, reverse var_names)
| otherwise = error ("Invalid equation (state: " ++ show state ++ "; coefficients: " ++ show coefficients ++ "; var_names: " ++ show var_names ++ ")")
parseRow (c:cs) state coefficients var_names
| (state == 1) && (c == '-') = parseRow cs 2 ((-1) : coefficients) var_names
| state == 1 && isNum c = parseRow cs 3 (c_int : coefficients) var_names
| state == 1 && isAlpha c = parseRow cs 4 (1 : coefficients) ([c] : var_names)
| state == 2 && isNum c = parseRow cs 3 (repl_k : drop 1 coefficients) var_names
| state == 2 && isAlpha c = parseRow cs 4 coefficients ([c] : var_names)
| (state == 3) && (c == '=') = parseRow cs 5 coefficients var_names
| state == 3 && isAlpha c = parseRow cs 4 coefficients ([c] : var_names)
| state == 3 && isNum c = parseRow cs 3 (new_k : drop 1 coefficients) var_names
| (state == 4) && (c == '-') = parseRow cs 2 ((-1) : coefficients) var_names
| (state == 4) && (c == '+') = parseRow cs 2 (1 : coefficients) var_names
| state == 4 && isAlphaNum c = parseRow cs 4 coefficients (new_v : drop 1 var_names)
| (state == 4) && (c == '=') = parseRow cs 5 coefficients var_names
| (state == 5) && (c == '-') = parseRow cs 6 ((-1) : coefficients) var_names
| state == 5 && isNum c = parseRow cs 7 (c_int : coefficients) var_names
| state == 6 && isNum c = parseRow cs 7 (repl_k : drop 1 coefficients) var_names
| state == 7 && isNum c = parseRow cs 7 (new_k : drop 1 coefficients) var_names
| otherwise = parseRow cs state coefficients var_names
where
k = abs $ head coefficients
k_sign = sign (head coefficients)
c_int = atoi c
new_k = k_sign * ((k * 10) + c_int)
repl_k = k_sign * c_int
v = if not (null var_names) then head var_names else []
new_v = v ++ [c]
parseLine :: String -> ([Integer], [String])
parseLine s = parseRow s 1 [] []
However, recently I (hopefully) got a grip of monads, so I decided to modify my old project in Haskell to make it prettier.
One of the improvements I made was the parsing part. Since one of the requirements for the course project was to not use any third-party libraries, I decided to stick to this and implement a parser monad.
This proved to be an interesting task and made the entire parsing part quite neat:
equationFactor :: Parser Fraction
equationFactor = do
_sign <- factorSign
zeroOrMore (sat isSpace)
factor <- fmap (fromMaybe (1%1)) (zeroOrOne rationalFactor)
return (_sign * factor)
equationMember :: Parser (Fraction, String)
equationMember = do
factor <- equationFactor
zeroOrMore (sat isSpace)
zeroOrOne (sat (== '*'))
zeroOrMore (sat isSpace)
nameFirst <- oneOrMore (sat isAlpha)
nameRest <- zeroOrMore (sat isAlphaNum)
zeroOrMore (sat isSpace)
return (factor, nameFirst ++ nameRest)
-- An equation consists of a list of pairs (factor, variable name) and a free member
equation :: Parser ([(Fraction, String)], Fraction)
equation = do
members <- oneOrMore equationMember
zeroOrMore (sat isSpace)
sat (== '=')
zeroOrMore (sat isSpace)
freeMember <- rationalFactor
return (members, freeMember)
This is all possible thanks to the Parser
monad. Under the cut - the implementation details.
This material is heavily based on the Dave Sands’ lectures about parsing in Haskell.
The main idea is that a parser is an annotated function from an input string to maybe something (parsing result) and a string (the unparsed part):
newtype Parser a = P (String -> Maybe (a, String))
Here, P
is just the type constructor.
So the function wrapped by the Parser
monad takes a String
as an input, parses as much as possible and returns what it managed to parse
together with the rest of the string, which was not parser. The two possible results hence are Just (a, String)
representing the parsed data (a
part of Maybe (a, String)
)
and the unparsed remainder of a string (String
part of Maybe (a, String)
). The other possible result is Nothing
, meaning the input string can not be parsed
with a given parser function.
Based on this explanation, one can build combinations of the parsers using the chaining of the Maybe
monad and the boolean logic (AND
and OR
operations, predominantly).
Helper functions
There are few trivial parsers which to help on this endeavour:
success a
- always succeeds, returning aJust (a, str)
(some default value and an entire input string) for any input stringfailure
- always fails, returningNothing
for any input stringitem
- parses any one arbitrary character, returning this character wrapped inJust (chr, str)
(first character and the rest of the string) if the input string is non-empty orNothing
if the input string is emptyparser1 <|> parser2
- tries the first parser,parser1
and if it succeeds - returns the result, but if it fails - returns whatever running the second parser,parser2
returnsparser1 >>= (\x -> parser2)
- chains two parsers together, runsparser1
and passes its output string to the second parser,parser2
; but since a parser result is a tuple, not just a string that can be passed to the second parser, this function takes a function from the output of the first parser to the second parserparserFn2 <*> parser1
- chains two parsers together, runsparser1
(yes, in the reverse order) and appliesparserFn2
to the result; whereparserFn2
is a function wrapped in a parser (Parser (a -> a)
)
Since a parser is just a function String -> Maybe (a, String)
, in order to actually parse something, one needs to run it.
Running a parser is as easy as executing a function parse
, passing it a parser and an input string:
>>> parse (success 1) "abc"
Just (1,"abc")
>>> parse (failure) "abc"
Nothing
>>> parse item "abc"
Just ('a',"bc")
>>> parse (item <|> success '!') "abc"
Just ('a',"bc")
>>> parse (item <|> success '!') ""
Just ('!',"")
>>> parse (item >>= (\_ -> item)) "abc"
Just ('b',"c")
>>> parse (item >>= (\_ -> item)) "abc"
Just ('b',"c")
>>> parse (success (\x -> if x == '-' then (-1) else 1) <*> item) "-123"
Just (-1,"123")
>>> parse (success (\x -> if x == '-' then (-1) else 1) <*> item) "123"
Just (1,"23")
The aforementioned helper functions are defined as following:
success :: a -> Parser a
success a = P (\str -> Just (a, str))
failure :: Parser a
failure = P (\_ -> Nothing)
item :: Parser Char
item = P $ \str -> case str of
"" -> Nothing
(ch:chs) -> Just (ch, chs)
instance Alternative Parser where
empty = failure
p1 <|> p2 = P $ \str -> case parse p1 str of
Nothing -> parse p2 str
result -> result
instance Monad Parser where
p1 >>= p2 = P $ \str -> case parse p1 str of
Just (a, str') -> parse (p2 a) str'
Nothing -> Nothing
return = pure
Basic parsers
These basic building blocks are then extended with a bit more complex building blocks, which are a little bit more handy for building complex parsers:
sat (ch -> Bool)
- returns a parser which only succeeds if the first character of a (non-empty) input string satisfies the function (ch -> Bool
)zeroOrMore p
- returns a parser which succeeds if the parserp
(passed as a param) succeeds on the input string (potentially multiple times) or fails - regardless, the resulting parser will succeedzeroOrOne p
- returns a parser which succeeds if the parserp
(passed as a param) succeeds exactly once on an input string or if it failsoneOrOne p
- returns a parser which succeeds if the parserp
(passed as a param) succeeds once or more on an input string
With these building blocks, one can build more complex parsers:
Parsing a digit
>>> digit = sat isDigit
>>> parse digit "123abc"
Just ('1',"23abc")
Parsing a number (as a string)
>>> numberStr = oneOrMore (sat isDigit)
>>> parse numberStr "123abc"
Just ("123","abc")
Since Parser
is an instance of Monad
, you can use fmap
or <$>
to combine it with other functions:
Parsing a number (as a non-negative number)
>>> naturalNumber :: Parser Integer; naturalNumber = read <$> oneOrMore (sat isDigit)
>>> parse naturalNumber "123abc"
Just (123,"abc")
Parsiung a potentially negative number
>>> sign = fmap (maybe 1 (\_ -> -1)) (zeroOrOne (sat (== '-')))
>>> intNumber = ((*) <$> sign) <*> naturalNumber
>>> parse intNumber "-123abc"
Just (-123,"abc")
The code above requires a bit of an explanation, I guess. The first part,
sign = fmap (maybe 1 (\_ -> -1)) (zeroOrOne (sat (== '-')))
parses a potential -
sign at the beginning of a string and returns either 1
or -1
, aka the sign multiplier.
The tricky part is fmap (maybe 1 (\_ -> -1)) ...
- the zeroOrOne
parser will return Maybe a
– that is, Maybe (Maybe a, String)
.
So if the wrapped part (in this case - sat (== '-')
) is present in the string, it will return Just a
(in this case - Just '-'
).
Hence we need to cast this Maybe Char
to something reasonable - a number would do. We call the maybe
helper with two params - the first
one is what would be returned if it is applied to Nothing
and the second one is a function which will be called on the value wrapped in Just
the thing is applied to:
>>> maybe 1 (\_ -> -1) (Just '-')
-1
>>> maybe 1 (\_ -> -1) Nothing
-1
>>> maybe False (\ch -> ch == '-') (Just '-')
True
>>> maybe True (\ch -> ch == '-') Nothing
True
The fmap
bit then applies this function (returned by maybe 1 (\_ -> -1)
) to the value wrapped by the next argument:
>>> fmap (maybe 1 (\_ -> -1)) (Just (Just '-'))
Just (-1)
>>> fmap (maybe 1 (\_ -> -1)) (Just Nothing)
Just 1
This very same code could be rewritten as follows:
(maybe 1 (\_ -> -1)) <$> (Just Nothing)
Finally, the second part:
((*) <$> sign) <*> number
The left part of it is just like the previously rewritten function from fmap
to <$>
, so this entire line can be written down as follows:
(fmap (*) sign) <*> number
What it does is applies the (*)
function (multiplication) to the value wrapped in the second argument.
In this case it is sign
, which is a Parser Integer
. This is the tricky bit: the type of this expression is not Parser Integer
.
It is Parser (a -> a)
, a parser of a function. This function, wrapped in a Parser
type, can be applied to whatever other parser returns
and hence chained together.
The <*>
operator, as mentioned before, chains the two parsers. So the entire expression applies the multiplication operation to the value
returned by the sign
parser and the value returned by the number
parser.
TL;DR: the whole thing analyzes the first character of a string (without consuming it) and returns either 1
or -1
; it then multiplies this value by the number
returned by the number
parser.
>>> parse intNumber "-123"
Just (-123,"")
>>> parse intNumber "123"
Just 123
Using a <|>
operator, one can parse integer (both negative and non-negative) numbers in this weird manner:
negativeNumber = (sat (== '-')) >> (* (-1)) <$> read <$> oneOrMore (sat isDigit)
positiveNumber = read <$> oneOrMore (sat isDigit)
number2 = negativeNumber <|> positiveNumber
>>> parse number2 "-42"
Just (42,"")
>>> parse number2 "123"
Just (123,"")
Combining parsers
To give a few practical examples of combining the parsers into something meaningful, consider the parser of an integer equation:
-3x1 + 4*x2 = 5
There are few helpful smaller parsers mentioned above, like intNumber
.
But we would need a bit more than that: an equation is a series of members (-3x1
, 4*x2
), where each member is a combination of a factor (-3
, 4
) and a variable (x1
, x2
).
There might be either an addition or subtration between members.
Lastly, equation must end with an equal sign followed by a free-standing number.
Let’s see if we can come up with a plan for this.
We can start at the bottom and define parsers for each bit of an equation or start from the top and define the parser for an entire equation and go down to every member and factor. Let’s start with the small bits, the members:
equationMember :: Parser (Integer, String)
equationMember = do
-- first, the numeric factor
factor <- equationFactor -- TODO
-- then, optionally, a multiplication sign, potentially surrounded by any number of whitespace
zeroOrMore (sat isSpace)
zeroOrOne (sat (== '*'))
zeroOrMore (sat isSpace)
-- followed by the variable name - only letters first, then either letters or numbers
nameFirst <- oneOrMore (sat isAlpha)
nameRest <- zeroOrMore (sat isAlphaNum)
-- potentially surrounded by any number of whitespace
zeroOrMore (sat isSpace)
-- lastly, return the tuple - numeric factor and the name of a variable
return (factor, nameFirst ++ nameRest)
This should be relatively straightforward to follow.
We do not have the equationFactor
defined which is used in this parser, so here it is:
equationFactor :: Parser Integer
equationFactor = do
-- optionally, a sign (+ or -)
_sign <- factorSign -- TODO
-- surrounded by any number of whitespace
zeroOrMore (sat isSpace)
-- followed by a number
factor <- fmap (fromMaybe 1) (zeroOrOne intNumber)
-- combine the numeric factor and a member sign (+ or -) for a final factor number
return (_sign * factor)
The factorSign
is yet to be implemented. The reason why I extract it into a separate parser is because one-liner would look ugly:
factorSign :: Parser Integer
factorSign = positiveSign <|> negativeSign <|> (success 1)
positiveSign = fmap (\_ -> 1) (sat (== '+'))
negativeSign = fmap (\_ -> -1) (sat (== '-'))
This works by either matching the +
sign and converting it to 1
or matching the -
sign and converting it to -1
or succeeding with 1
(by default, when there is no sign).
Lastly, an entire equation parser would be a rather simple combination of the above parsers:
equation :: Parser ([(Integer, String)], Integer)
equation = do
-- first, match all the members as a list of tuples (factor, variable name)
members <- oneOrMore equationMember
-- potentially surrounded by any number of whitespace, the '=' sign
zeroOrMore (sat isSpace)
sat (== '=')
zeroOrMore (sat isSpace)
-- the last, free-standing member
freeMember <- intNumber
-- finally, return the tuple of members and a free-standing member
return (members, freeMember)
You can try it on the equation from the example above:
>> parse equation "-3x1 + 4*x2 = 5"
Just (([(-3,"x1"),(4,"x2")],5),"")
The finished project contains a little bit more - it implements rational calculus (rational factors instead of just pure integers). The entire code is also hosted on GitHub.