This post summarizes the top questions asked about Java regular expressions. As they are most frequently asked, you may find that they are also very useful.
1. How to extract numbers from a string?
One common question of using regular expression is to extract all the numbers into an array of integers.
In Java, \d
means a range of digits (0-9). Using the predefined classes whenever possible will make your code easier to read and eliminate errors introduced by malformed character classes. Please refer to Predefined character classes for more details. Please note the first backslash \
in \d
. If you are using an escaped construct within a string literal, you must precede the backslash with another backslash for the string to compile. That’s why we need to use \\d
.
List<Integer> numbers = new LinkedList<Integer>(); Pattern p = Pattern.compile("\\d+"); Matcher m = p.matcher(str); while (m.find()) { numbers.add(Integer.parseInt(m.group())); } |
2. How to split Java String by newlines?
There are at least three different ways to enter a new line character, dependent on the operating system you are working on.
\r represents CR (Carriage Return), which is used in Unix \n means LF (Line Feed), used in Mac OS \r\n means CR + LF, used in Windows
Therefore the most straightforward way to split string by new lines is
String lines[] = String.split("\\r?\\n"); |
But if you don’t want empty lines, you can use, which is also my favourite way:
String.split("[\\r\\n]+") |
A more robust way, which is really system independent, is as follows. But remember, you will still get empty lines if two newline characters are placed side by side.
String.split(System.getProperty("line.separator")); |
3. Importance of Pattern.compile()
A regular expression, specified as a string, must first be compiled into an instance of Pattern class. Pattern.compile() method is the only way to create a instance of object. A typical invocation sequence is thus
Pattern p = Pattern.compile("a*b"); Matcher matcher = p.matcher("aaaaab"); assert matcher.matches() == true; |
Essentially, Pattern.compile() is used to transform a regular expression into an Finite state machine (see Compilers: Principles, Techniques, and Tools (2nd Edition)). But all of the states involved in performing a match resides in the matcher. By this way, the Pattern p can be reused. And many matchers can share the same pattern.
Matcher anotherMatcher = p.matcher("aab"); assert anotherMatcher.matches() == true; |
Pattern.matches() method is defined as a convenience for when a regular expression is used just once. This method still uses compile() to get the instance of a Pattern implicitly, and matches a string. Therefore,
boolean b = Pattern.matches("a*b", "aaaaab"); |
is equivalent to the first code above, though for repeated matches it is less efficient since it does not allow the compiled pattern to be reused.
4. How to escape text for regular expression?
In general, regular expression uses “\” to escape constructs, but it is painful to precede the backslash with another backslash for the Java string to compile. There is another way for users to pass string Literals to the Pattern, like “$5”. Instead of writing \\$5
or [$]5
, we can type
Pattern.quote("$5"); |
5. Why does String.split() need pipe delimiter to be escaped?
String.split() splits a string around matches of the given regular expression. Java expression supports special characters that affect the way a pattern is matched, which is called metacharacter. |
is one metacharacter which is used to match a single regular expression out of several possible regular expressions. For example, A|B
means either A
or B
. Please refer to Alternation with The Vertical Bar or Pipe Symbol for more details. Therefore, to use |
as a literal, you need to escape it by adding \
in front of it, like \\|
.
6. How can we match anbn with Java regex?
This is the language of all non-empty strings consisting of some number of a
‘s followed by an equal number of b
‘s, like ab
, aabb
, and aaabbb
. This language can be show to be context-free grammar S → aSb | ab, and therefore a non-regular language.
However, Java regex implementations can recognize more than just regular languages. That is, they are not “regular” by formal language theory definition. Using lookahead and self-reference matching will achieve it. Here I will give the final regular expression first, then explain it a little bit. For a comprehensive explanation, I would refer you to read How can we match a^n b^n with Java regex.
Pattern p = Pattern.compile("(?x)(?:a(?= a*(\\1?+b)))+\\1"); // true System.out.println(p.matcher("aaabbb").matches()); // false System.out.println(p.matcher("aaaabbb").matches()); // false System.out.println(p.matcher("aaabbbb").matches()); // false System.out.println(p.matcher("caaabbb").matches()); |
Instead of explaining the syntax of this complex regular expression, I would rather say a little bit how it works.
- In the first iteration, it stops at the first
a
then looks ahead (after skipping somea
s by usinga*
) whether there is ab
. This was achieved by using(?:a(?= a*(\\1?+b)))
. If it matches,\1
, the self-reference matching, will matches the very inner parenthesed elements, which is one singleb
in the first iteration. - In the second iteration, the expression will stop at the second
a
, then it looks ahead (again skippinga
s) to see if there will beb
. But this time,\\1+b
is actually equivalent tobb
, therefore twob
s have to be matched. If so,\1
will be changed tobb
after the second iteration. - In the nth iteration, the expression stops at the nth
a
and see if there are nb
s ahead.
By this way, the expression can count the number of a
s and match if the number of b
s followed by a
is same.
7. How to replace 2 or more spaces with single space in string and delete leading spaces only?
String.replaceAll() replaces each substring that matches the given regular expression with the given replacement. “2 or more spaces” can be expressed by regular expression [ ]+
. Therefore, the following code will work. Note that, the solution won’t ultimately remove all leading and trailing whitespaces. If you would like to have them deleted, you can use String.trim() in the pipeline.
String line = " aa bbbbb ccc d "; // " aa bbbbb ccc d " System.out.println(line.replaceAll("[\\s]+", " ")); |
8. How to determine if a number is a prime with regex?
public static void main(String[] args) { // false System.out.println(prime(1)); // true System.out.println(prime(2)); // true System.out.println(prime(3)); // true System.out.println(prime(5)); // false System.out.println(prime(8)); // true System.out.println(prime(13)); // false System.out.println(prime(14)); // false System.out.println(prime(15)); } public static boolean prime(int n) { return !new String(new char[n]).matches(".?|(..+?)\\1+"); } |
The function first generates n number of characters and tries to see if that string matches .?|(..+?)\\1+
. If it is prime, the expression will return false and the !
will reverse the result.
The first part .?
just tries to make sure 1 is not primer. The magic part is the second part where backreference is used. (..+?)\\1+
first try to matches n length of characters, then repeat it several times by \\1+
.
By definition, a prime number is a natural number greater than 1 that has no positive divisors other than 1 and itself. That means if a=n*m then a is not a prime. n*m can be further explained “repeat n m times”, and that is exactly what the regular expression does: matches n length of characters by using (..+?)
, then repeat it m times by using \\1+
. Therefore, if the pattern matches, the number is not prime, otherwise it is. Remind that !
will reverse the result.
9. How to split a comma-separated string but ignoring commas in quotes?
You have reached the point where regular expressions break down. It is better and more neat to write a simple splitter, and handles special cases as you wish.
Alternative, you can mimic the operation of finite state machine, by using a switch statement or if-else. Attached is a snippet of code.
public static void main(String[] args) { String line = "aaa,bbb,\"c,c\",dd;dd,\"e,e"; List<String> toks = splitComma(line); for (String t : toks) { System.out.println("> " + t); } } private static List<String> splitComma(String str) { int start = 0; List<String> toks = new ArrayList<String>(); boolean withinQuote = false; for (int end = 0; end < str.length(); end++) { char c = str.charAt(end); switch(c) { case ',': if (!withinQuote) { toks.add(str.substring(start, end)); start = end + 1; } break; case '\"': withinQuote = !withinQuote; break; } } if (start < str.length()) { toks.add(str.substring(start)); } return toks; } |
10. How to use backreferences in Java Regular Expressions
Backreferences is another useful feature in Java regular expression.
what is easy way to check if string is URL? please explain the same
True. Modified
True. Modified
Very nice article, good explaination how to solve the problems with regex.
Before using regex for everything, take a look at google guava Splitter class and apache commons io StringUtils.
Pretty cool 🙂
As an alternative you can use “\s+” on item 7, it’ll handle tabs too.