NOTE: I was going to give this presentation for MacDMV this month, however, due to unannounced circumstances I won’t be able to present. Here is what I was going to post immediately following the meet-up.


There is no better way to comprehend a new topic by admitting you know nothing. Prior to this article I knew nothing about regex, how it was used, when to use it, and with what programs could utilize it’s capabilities. What I did know is regex could find “stuff” within text whether that text was a simple paragraph, multiple lined sentences, or in a result string from CLI commands.

To start learning how to use regex I decided to use Apple’s “The Crazy Ones” text (Full version) 1 as my source and tried to extract information to better understand regex.

I am using two different formatted versions “The Crazy Ones”. Once is a giant long text string that wraps around as seen here:

My second version is a multi lined version that can be seen here: 2

The Basics

Below are some of the command line utilities that can use some principals of regex, however, there are some limitations such as stringing multiple searches by using parenthesis, or backreference (Dependants if an item is before or after a result).

  • egrep or grep -E
  • sed -e
  • awk

Most of the tools that I have reviewed are focusing on programming languages such as perl, python, ruby, javascript, php, Objective C, etc. As you start learning new tools to assist in managing your Apple environment (munki, autopkg, puppet, Casper Suite, etc) regex can definitely assist as these tools use programming languages to achieve their goals.

Beginning of words

Lets play around with some examples. I’m taking my two source texts (singe line and multi lined) of “The Crazy Ones” and try to find “stuff” and review the results. In the below pictures you will first see an attempt to find the word “the” anywhere within my test sources, then use “^the” to only find those letters at the beginning.

In order to try help illustrate I’ve created a text table to show which text items I’m trying to search and the words that were matched. Below the text table are screenshots of the results because in this case pictures are far better explaining the results than text.

Test Options Apple single line Apple multi Line
the (anywhere) the, they, them the, they, them
^the (beginning) none none
The (anywhere) The, They The, They
^The (beginning) none The, They

Single Line Results

Multi Line Results

As you can see regex is much more helpful when dealing with multi lined items vs. one giant run-on string of text.

End of words

To find items that exist at the end of a string, place a dollar sign ($) at the end of your search requirements, for example: foo$

Let’s try to find the periods at the end of our multi line sample text. First, we’re knowingly going to incorrectly try just a “.” (period) to see if we can find a match.

Test Options Apple Multi Line
. everything!
.$ periods as desired
\. periods as desired

Sometimes you need to watch out for special characters and “escape” them so our program (or CLI command) can understand what we want such as the period “.” (or “\” by escaping our escaping character). Also note there is always a different way to do something. One way may be better than the other, such as in this example if I really only wanted the periods at the end of sentences I should look for “.$”, as sometimes in my writing I’ll use an ellipsis (…) to indicate a more dramatic pause vs. just a comma. The use of “\.” would give me false positives on ellipsis within any of my articles.

Lets find the special characters in our multi line sample text.

Test Options Apple Multi Line
, only commas
' only apostrophes
(,\|'\|\.) special characters

This time we’re using the pipe symbol to symbolically state “comma OR apostrophe OR period”. Notice we are not surrounding things around double quotes or single quotes. I believe this makes things harder to read, but that is the syntax that is being used.

Words in the middle

This is where things get complicated. Trying to slice through line after line the right word, words, or data segment within a multi lined text can be difficult as we need to correctly spell our conditions. Furthermore, conditions may have dependencies on what precedes or follows the desired text. The different types of requirements are as follows:

Assertions

  • foo(?=bar) Lookahead assertion. The pattern foo will only match if followed by a match of pattern bar.
  • foo(?!bar) Negative lookahead assertion. The pattern foo will only match if not followed by a match of pattern bar.
  • (?<=foo)bar Lookbehind assertion. The pattern bar will only match if preceded by a match of pattern foo.
  • (?<!foo)bar Negative lookbehind assertion. The pattern bar will only match if not preceded by a match of pattern foo.

Character Classes

  • . Matches any character except newline. Will also match newline if single-line mode is enabled.
  • \s Matches white space characters.
  • \S Matches anything but white space characters.
  • \d Matches digits. Equivalent to [0-9].
  • \D Matches anything but digits. Equivalent to [^0-9].
  • \w Matches letters, digits and underscores. Equivalent to [A-Za-z0-9_].
  • \W Matches anything but letters, digits and underscores. Equivalent to [^A-Za-z0-9_].

Bracket Expressions

  • [adf] Matches characters a or d or f.
  • [^adf] Matches anything but characters a, d and f.
  • [a-f] Match any lowercase letter between a and f inclusive.
  • [A-F] Match any uppercase letter between A and F inclusive.
  • [0-9] Match any digit between 0 and 9 inclusive. Does not support using numbers larger than 9, such as [10-20].

Quantifiers

  • \* 0 or more. Matches will be as large as possible.
  • *? 0 or more, lazy. Matches will be as small as possible.
  • \+ 1 or more. Matches will be as large as possible.
  • +? 1 or more, lazy. Matches will be as small as possible.
  • ? 0 or 1. Matches will be as large as possible.
  • ?? 0 or 1, lazy. Matches will be as small as possible.
  • {2} 2 exactly.
  • {2,} 2 or more. Matches will be as large as possible.
  • {2,}? 2 or more, lazy. Matches will be as small as possible.
  • {2,4} 2, 3 or 4. Matches will be as large as possible.
  • {2,4}? 2, 3 or 4, lazy. Matches will be as small as possible.

String matching requirements are bundle together by enclosing their unique elements in-line. Use parenthesis to enclose Assertions or Character Classes while adding Bracket Expressions or curly brackets for Quantifiers inside those parenthesis.

Lets find the words before periods in our multi line sample text.

Test Options Apple Multi Line
(.$) All the periods
(\S+)(.$) Words plus periods
(\S{4})(.$) Last four letters plus periods
(\s)(\S{4})(.$) preceding four letter words ONLY, plus periods
(?<=\s)(\S{4})(?=.$) only the four letter words that precede a period

Multiple words with the same meaning

Sometimes, two (or more) words could be used in a string to represent an accurate statement, the biggest example is dates. You could write dates with the month as a number, full name of the month, or just the abbreviation. You could as add the “st”, “nd”, or “th” at the end of the days. Not knowing the pattern of how your string comes into your search criteria could work against you so either a) force a pattern b) take into account of all possibilities.

The way we search for optional items is with the question mark at the end of letters or numbers. If there is a group of items that need to be optional, then nest them together with parenthesis. For example I want to match all the possibilities of March 18th, 2015; which could be written as

  • March 18th
  • March 18
  • Mar 18th
  • Mar 18

My regex search string would be: Mar(ch)? 18(th)? giving me the option to include the “ch” at the end of March and the “th” at the end of eighteenth. But now if I wanted to include “3/18”, or “03/18” as a set of possible date formats I need to expand my search string to be: (\d{1,2}|Mar(ch)?)( ?/?)18(th)?

(
	\d{1,2}		# Search for a number that are 1-2 digits in length
	|		# "OR"
	Mar(ch)?	# Start with "Mar" but could have "ch" at the end.
)
(
	&nbps;?		# There might be a space
	/?		# There might be a forward slash
)
(
	18(th)?		# The last group will have 18, but might end with "th"
)

Tools

Applications

Online

Footnotes

  1. There are actually three versions of “The Crazy Ones” per this Wikipedia article: Original, Full version, and Short version. 

  2. To create the multi lined version I wanted to use sed and substituted each “. “ with “.\n” (a period then a line break). There is an OS X sed issue when trying to substitute with line breaks that is outlined at: http://stackoverflow.com/questions/6111679/insert-linefeed-in-sed-mac-os-x. To create multi line version I used the following commands, bash-3.2$ cat Apple.quote | sed 's/\. /\. \=/g' | tr "=" "\n" 

Leave a Comment

Your email is used for Gravatar image and reply notifications only. Address will not be published. *

Loading...