Last modified: May 19, 2000 (Japanese
Version)
Regular Expressions for Beginners
Yasumasa Someya
"Regular expressions" are combinations of special characters and symbols
used for pattern matching.; i.e., you specify a particular combination
of such characters and symbols (= regular expression) and the computer
will search for that string of words through the text data (the BLC in
our case). The following is a very short, and hopefully easy-to-follow,
introduction to some of the most useful regular expressions that you want
to know in using the BLC concordancer.
1. Ordinary alphanumerals: Ordinary alphabets (a,b,c,...;
A,B,C,...) and numerals (1,2,3...) will match "as is" -- as in the following
examples:
| you specify |
and it will match |
| a |
a |
| word |
word |
| 123 |
123 |
| this word |
this word |
2. Major regular expression symbols and their meanings
| RegEx Symbols |
Meaning |
| * |
Match 0 or more times |
| + |
Match 1 or more times |
| . |
Match any single character, including space |
| ^ |
Match the beginning of a line (if used in the square brackets, this
means "NOT") => See Note below. |
| [ ] |
Character class |
| ( ) |
Grouping |
| | |
Alternation |
Note: This RegEx symbol (called "caret") is not accepted by the
current BLC concordancer due to the particular data structure of the corpus.
It, however, can be used within the square brackets.
3. Examples
| you specify |
and it will... |
| a* |
match 0 or more times of the instance of "a" (e.g. space,
a,
aa, aaa,....) |
| a+ |
match 1 or more times of the instance of "a" (e.g. a, aa, aaa, ...) |
| ... |
match any combinations of three characters, including space (by adding
a space before and after this sequence, it means "any single word consisting
of three characters"). |
| ^Word |
match "Word" that appears at the beginning of a line/sentence (=> this
symbol, however, is not accepted at the moment). |
| [abc] |
match either "a" or "b" or "c". |
| [a-z] |
match any one of the lowercase alphabets. |
| [A-Z] |
match any one of the UPPERCASE alphabets. |
| [0-9] |
match any one of the numbers 0 through 9. |
| [a-zA-Z0-9] |
match any one of the alphabets and numbers. |
| [a-z]+ |
match a single word of any length consisting of lowercase alphabets. |
| [A-Za-z]+ |
match a single word of any length, |
| [^a-zA-Z] |
match anything other than alphabets (i.e., space, numbers, punctuation
marks and symbols). |
| (aaa|bbb|ccc) |
match either "aaa" or "bbb" or "ccc". |
| ab(c|cd|cde) |
match either "abc" or "abcd" or "abcde". |
Since the RegEX symbols (called "metacharacters") as those mentioned
in section 2 above have special meanings, they must be properly "escaped"
in case you want to quote them as they are. For instance, if you want to
find any combinations of numbers headed by the plus mark, your search
string should be:
\+[0-9]+
which will match, for instance, +123, +5427, and so on
(Note: There's no such instances in the BLC, however.). Likewise,
if you want to search for instances of any single word within the round
brackets, your search string would be:
\([a-zA-Z]+\)
which will match, for instance, (s),
(txt), (Japan),
etc. Due to the particular data structure of the BLC, however, you
need to add the full stop (i.e. space-equivalent RegEx symbol) before and
after all the punctuation marks in cases like this. Thus, your RegEx search
string to match instances like ( s ),
( txt ), ( Japan
), etc. should be:
.\(.[a-zA-Z]+.\).
or, if you want to have two or more words within the brackets,
you simply specify the same RegEx string without the end bracket mark,
as follows:
.\(.[a-zA-Z]+
I've tested all these regular expressions and in most cases they work
fine and return the results as expected. If you didn't get what you want,
make sure your regular expression is correct and try again, of simply forget
it.
Want to learn more about
Regular Expressions? Click here.
Back to BLC
Concordancer
(c) 2000 Yasumasa Someya
|