usage
creating and matching regular expressions
string regular expressions
You can create a regular expression via a String
literal:
import ceedubs.irrec.regex._
import ceedubs.irrec.parse.{regex => r}
// a RegexC[A] is a regex that parses a `C`har input sequence into an `A` result
val unitCount: RegexC[Int] = r("""\d{1,3}""").map(_.toInt)
You'll even get a compile-time error if the regex is invalid:
val invalidUnitCount: RegexC[String] = r("""\d{1,-3}""").map(_.toInt)
// error: Error compiling regular expression: Expected repeat count such as '{3}', '{1,4}', `{1, 4}?`, '{3,}', or `{3,}?:1:3, found "{1,-3}"
// val invalidUnitCount: RegexC[String] = r("""\d{1,-3}""").map(_.toInt)
// ^^^^^^^^^^^^^^^^^
You can also build a regular expression using the methods in the
combinator
object, standard Applicative/Alternative methods, and irrec's DSL for combining regexes.
|
denotes that either the expression on the left or the right needs to match..star
denotes the Kleene star (repeat 0 to many times).- Applicative methods such as
<*
,*>
, andmapN
indicate one match followed by another.
import ceedubs.irrec.regex.combinator._
import Greediness.Greedy
import cats.implicits._
import java.time.Duration
import java.time.temporal.ChronoUnit
val chronoUnit: RegexC[ChronoUnit] = r("ms|millis?").as(ChronoUnit.MILLIS) |
r("s|seconds?").as(ChronoUnit.SECONDS) |
r("m|mins?|minutes?").as(ChronoUnit.MINUTES)
val duration: RegexC[Duration] = (
r("-|negative ").optional(Greedy).map(_.isDefined),
unitCount <* lit(' ').?,
chronoUnit
).mapN{ (isNegative, count, unit) =>
val d = Duration.of(count.toLong, unit)
if (isNegative) d.negated else d
}
Once you've built a regular expression, you can compile it and parse input with it.
val durationParser: ParseState[Char, Duration] = duration.compile
// parseOnly parses a sequence of input elements
durationParser.parseOnly(Stream('3', 'm', 's'))
// res1: Option[Duration] = Some(value = PT0.003S)
// parseOnlyS is specialized to String input
durationParser.parseOnlyS("3ms")
// res2: Option[Duration] = Some(value = PT0.003S)
durationParser.parseOnlyS("negative 10 seconds")
// res3: Option[Duration] = Some(value = PT-10S)
durationParser.parseOnlyS("eleventy buckets")
// res4: Option[Duration] = None
pretty printing
Regular expressions can be printed in a (hopefully) POSIX style:
duration.pprint
// res5: String = "(-|negative\\u0020)?[0-9]{1,3}\\u0020?(ms|millis?|s|seconds?|m|mins?|minutes?)"
Pattern
Java Regular expressions can be converted to a java.util.regex.Pattern
:
duration.toPattern
// res6: java.util.regex.Pattern = (-|negative\u0020)?[0-9]{1,3}\u0020?(ms|millis?|s|seconds?|m|mins?|minutes?)
generating random data
random matches for a regular expression
Irrec provides Scalacheck generators that produce values that match a regular expression. These can be useful for tests, or even just for glancing at random matches to ensure that your regular expression does what you intended. Check out the regex-explorer for interactive browser-based regular expression exploration powered by irrec.
import ceedubs.irrec.regex.gen._, CharRegexGen._
import org.scalacheck.Gen
import org.scalacheck.rng.Seed
val genDurationString: Gen[String] = genRegexMatchingString(duration)
Gen.listOfN(3, genDurationString).apply(Gen.Parameters.default, Seed(1046531L))
// res7: Option[List[String]] = Some(
// value = List("35 millis", "5seconds", "3milli")
// )
Were all of those results input that you intended your regular expression to accept?
This generation is done efficiently as opposed to generating a bunch of random values and then filtering the ones that don't match the regular expression (which would quickly lead to Scalacheck giving up on generating matching values).
Sometimes you may want to generate both matches and non-matches for your regular expression to make sure that both cases are handled. There are various Gen
instances that will generate input that matches the regular expresssion roughly half of the time.
val genDurationCandidateString: Gen[String] =
Gen.resize(12, genRegexCandidateString(duration))
val genExamples: Gen[List[(String, Option[Duration])]] =
Gen.listOfN(4, genDurationCandidateString).map(candidates =>
candidates.map(candidate => (candidate, durationParser.parseOnlyS(candidate)))
)
genExamples.apply(Gen.Parameters.default, Seed(1046513L))
// res8: Option[List[(String, Option[Duration])]] = Some(
// value = List(
// ("negative 2milli", Some(value = PT-0.002S)),
// ("7 minute", Some(value = PT7M)),
// ("55ms", Some(value = PT0.055S)),
// ("\u0002\u001ag#\u0000\u001c\u007f3f\\rs", None)
// )
// )
random regular expressions
Irrec provies support for creating random (valid) regular expressions along with potential matches for them.
val regexGen: Gen[RegexC[List[Long]]] = Gen.resize(16, genAsciiRegex)
val randomRegex1: RegexC[List[Long]] = regexGen.apply(Gen.Parameters.default, Seed(10570573L)).get
randomRegex1.pprint
// res9: String = "c[^#\\-=ESZadhy]\\^"
You can now generate random data to match this regular expression as described here. Alternatively, you can generate a regular expression and a match for it in one step:
val regexAndMatchGen: Gen[RegexAndCandidate[Char, Double]] =
Gen.resize(12, genAlphaNumRegexAndMatch)
val regexesAndMatchesGen: Gen[List[RegexAndCandidate[Char, Double]]] =
Gen.listOfN(4, regexAndMatchGen)
val regexesAndMatches: List[RegexAndCandidate[Char, Double]] = regexesAndMatchesGen.apply(Gen.Parameters.default.withSize(30), Seed(105769L)).get
regexesAndMatches.map(x =>
(x.r.pprint, x.candidate.mkString)
)
// res10: List[(String, String)] = List(
// ("9[02CFKNUYu][^8C](k|M|w|[E])", "922M"),
// ("y[^7]", "yl"),
// ("(6mP)*?l", "6mP6mPl"),
// ("K3?", "K")
// )
Sometimes you may want to generate both matches and non-matches for your random regular expression to make sure that both cases are handled. There are various Gen
instances for RegexAndCandidate
that will generate random regular expressions along with data that matches the regular expresssion roughly half of the time.
val regexesAndCandidatesGen: Gen[List[RegexAndCandidate[Char, Double]]] =
Gen.listOfN(4, genAlphaNumRegexAndCandidate)
val regexesAndCandidates: List[RegexAndCandidate[Char, Double]] = regexesAndCandidatesGen.apply(Gen.Parameters.default.withSize(15), Seed(105361L)).get
regexesAndCandidates.map(x =>
(x.r.pprint, x.candidate.mkString, x.r.compile.parseOnly(x.candidate))
)
// res11: List[(String, String, Option[Double])] = List(
// ("(((VD)*?g)?){0}0.", "H74xSj2tHB", None),
// ("6k[^47Elo]", "sHsdx0su06hN", None),
// ("572Z", "572Z", Some(value = -3.515633435242795E-128)),
// ("1|[^8O]|f{0,1}", "c", Some(value = 6.438824451059825E-241))
// )
non-string regular expressions
While RegexC
is the most common choice, irrec supports regular expressions for types other than chars/strings. For example if your input is a stream of integers instead of a string:
import Greediness._
val numRegex: RegexM[Int, Unit] = lit(1).*.void <* range(2, 4).repeat(1, Some(3), Greedy) <* oneOf(5, 6).oneOrMore(Greedy)
val numMatcher: Stream[Int] => Boolean = numRegex.matcher[Stream]
numMatcher(Stream(1, 2, 5))
// res12: Boolean = true
numMatcher(Stream(1, 1, 1, 2, 4, 5, 6, 5))
// res13: Boolean = true
numMatcher(Stream(0, 5, 42))
// res14: Boolean = false
The M
in RegexM
stands for Match. It is useful for matches on any discrete input type such as characters or numbers.