Masking PII with Ruby gsub with Regular Expression Named Match Groups, Non-Greedy

rubyJanuary 25, 2015Dotby Justin Gordon

In this article, you'll learn:

  1. How to effectively use rubular.com and the Ruby console to get the correct regular expression syntax.
  2. What is the difference between .* and .*?, greedy and non-greedy.
  3. What are regular expression named capture groups and why you should use them.
  4. How to use String#gsub without and with the block syntax, and without or with named capture groups.

Suppose you have to filter out PII (Personally Identifiable Information) out of log entries that look like this HTML. We don't want the following PII fields to show their values: email, social_security_number, date_of_birth

Input Html

Updated User (4)<br>
     changed first_name to &quot;Karina&quot;<br>
     changed last_name to &quot;Senger&quot;<br>
     changed phone to &quot;2133432154&quot;<br>
     changed email to &quot;brenna.treutel@runolfsdottirdonnelly.org&quot;<br>
     changed street to &quot;123 Main St&quot;<br>
     changed city to &quot;Paia&quot;<br>
     changed state to &quot;HI&quot;<br>
     changed zip_code to &quot;96677&quot;<br>
     changed social_security_number to &quot;555-33-4444&quot;<br>
     changed date_of_birth to &quot;2000-10-03&quot;

And the end result we want is:

Updated User (4)<br>
     changed first_name to &quot;Karina&quot;<br>
     changed last_name to &quot;Senger&quot;<br>
     changed phone to &quot;2133432154&quot;<br>
     changed email to XXXXXX<br>
     changed street to &quot;123 Main St&quot;<br>
     changed city to &quot;Paia&quot;<br>
     changed state to &quot;HI&quot;<br>
     changed zip_code to &quot;96677&quot;<br>
     changed social_security_number to XXXXXX<br>
     changed date_of_birth to XXXXXX

Reference Web Pages

I suggest you open the following reference web pages:

  1. Ruby doc for Class Regexp
  2. Ruby doc for String#gsub
  3. Interactive Ruby regular expression tester: rubular.com

Figure out the Regexp

First, let's figure out the right regular expression using rubular.com.

Basic Regexp

Copy the above string of input (gray box above labeled /Input HTML/into the "Your test string" box, and then let's figure out a simple regexp to match an individual line.

changed email to &quot;(.*?)&quot;

image1

Why the (.*?) or Why Non-Greedy

The .*? syntax means to get the non-greedy match. In order to see this, let's use this string below into the "Your test string" box (note, the text scrolls way to the right):

Updated User (4)<br>     changed first_name to &quot;Karina&quot;<br>     changed last_name to &quot;Senger&quot;<br>     changed phone to &quot;2133432154&quot;<br>     changed email to &quot;brenna.treutel@runolfsdottirdonnelly.org&quot;<br>     changed street to &quot;123 Main St&quot;<br>     changed city to &quot;Paia&quot;<br>     changed state to &quot;HI&quot;<br>     changed zip_code to &quot;96677&quot;<br>     changed social_security_number to &quot;555-33-4444&quot;<br>   changed date_of_birth to &quot;2000-10-03&quot;

Notice that you see the correct result for the match.

Now, remove the ? in the .*?, using this regexp:

changed email to &quot;(.*)&quot;

And you'll see this:

image2

Now, add back the ? after the .*, and you'll see the right value.

image3

The Ruby docs for Class Regexp explain this:

Repetition is greedy by default: as many occurrences as possible are matched while still allowing the overall match to succeed. By contrast, lazy matching makes the minimal amount of matches necessary for overall success. A greedy metacharacter can be made lazy by following it with ?.

Match any of the 3 fields

How do we match any of the PII fields of email, social_security_number, date_of_birth?

The answer is to use alternation.

The vertical bar metacharacter (|) combines two expressions into a single one that matches either of the expressions. Each expression is an alternative.

changed (email|social_security_number|date_of_birth) to &quot;(.*?)&quot;

image4

Experiment in the Console (Pry)

Open up your rails console (rails c) and paste the following 2 lines. This will set you up with what we've been testing in Rubular.

log_entry = "Updated User (4)<br>     changed first_name to &quot;Karina&quot;<br>     changed last_name to &quot;Senger&quot;<br>     changed phone to &quot;2133432154&quot;<br>     changed email to &quot;brenna.treutel@runolfsdottirdonnelly.org&quot;<br>     changed street to &quot;123 Main St&quot;<br>     changed city to &quot;Paia&quot;<br>     changed state to &quot;HI&quot;<br>     changed zip_code to &quot;96677&quot;<br>     changed social_security_number to &quot;555-33-4444&quot;<br>     changed date_of_birth to &quot;2000-10-03&quot;"
regexp = /changed (email|social_security_number|date_of_birth) to &quot;(.*?)&quot;/

Then enter the following. Feel free to experiment!

log_entry.match(regexp)
$~
$1
$2
$&

Here's the doc of the globals set by a regexp. These are thread-local and method-local variables. So they are safe in a multi-threaded environment.

Pattern matching sets some global variables :

$~ is equivalent to ::last_match;
$& contains the complete matched text;
$` contains string before match;
$' contains string after match;
$1, $2 and so on contain text matching first, second, etc capture group;
$+ contains last capture group.

image5

The $~ will come in particularly handy when we try to use gsub.

How Do We Get All the Matches?

String#scan does it!

log_entry.scan(regexp)

image6

Named Match Groups

$1 and $2 are not the most illuminating names for the capture group values. Ruby offers a way to give them readable names. Quoting the Ruby doc for Class Regexp:

Capture groups can be referred to by name when defined with the (?<name>) or (?'name') constructs.

/\$(?<dollars>\d+)\.(?<cents>\d+)/.match("$3.67")
    => #<MatchData "$3.67" dollars:"3" cents:"67">
/\$(?<dollars>\d+)\.(?<cents>\d+)/.match("$3.67")[:dollars] #=> "3"
Named groups can be backreferenced with \k<name>, where name is the group name.

/(?<vowel>[aeiou]).\k<vowel>.\k<vowel>/.match('ototomy')
    #=> #<MatchData "ototo" vowel:"o">
Note: A regexp can't use named backreferences and numbered backreferences simultaneously.

When named capture groups are used with a literal regexp on the left-hand side of an expression and the =~ operator, the captured text is also assigned to local variables with corresponding names.

/\$(?<dollars>\d+)\.(?<cents>\d+)/ =~ "$3.67" #=> 0
dollars #=> "3"

Let's try that in rubular first by copying this regexp into rubular:

changed (?<field>email|social_security_number|date_of_birth) to &quot;(?<value>.*?)&quot;

image7

And then try this in the console:

regexp_named_captures = /changed (?<field>email|social_security_number|date_of_birth) to &quot;(?<value>.*?)&quot;/
match_data = log_entry.match(regexp_named_captures)
match_data[:field]
match_data[:value]
log_entry.scan(regexp_named_captures)
arr[0]
arr[0][0]

image8

Substitution with String#gsub

Simple (Non-block) String#gsub Syntax

Now, back to the task at hand, which was to convert the original log entry with PII so that the PII is redacted. We'll change the lines to something like this:

changed email to XXXXXX

Let's take a look at the documentation for String#gsub

If replacement is a String it will be substituted for the matched text. It may contain back-references to the pattern’s capture groups of the form \d,̣ where d is a group number, or \k<n>, where n is a group name. If it is a double-quoted string, both back-references must be preceded by an additional backslash. However, within replacement the special match variables, such as $&, will not refer to the current match.

log_entry.gsub(regexp_named_captures, "changed \\k<field> to XXXXXX")

And that results in mission accomplished!

image9

Block String#gsub Syntax

Suppose you want to use the block syntax of String#gsub. Given the use case in this example, there's no particular reason to use it. However, you might come across a use case where you'd like some logic in the bock to determine the substitution value. Here's how you do it.

In the block form, the current match string is passed in as a parameter, and variables such as $1, $2, $`, $&, and $' will be set appropriately. The value returned by the block will be substituted for the match on each call.

To use the block syntax with named capture groups is not exactly obvious.

You might think that the value passed into the block is the match data. Instead, it's the full value of what was matched.

For example:

log_entry.gsub(regexp_named_captures) { |match| "XXXXXX" }

image10

That's not what we want. We want to show the field that was redacted.

Maybe we can use the same syntax as the non-block form:

log_entry.gsub(regexp_named_captures) { |match| "changed \\k<field> to XXXXXX" }

That doesn't work!

image11

The solution is that you have to use the globals mentioned above, like $1.

log_entry.gsub(regexp_named_captures) { |match| "changed #{$1} to XXXXXX" }

This works!

image12

But what if you want to use the named capture groups?

Then you have to use the $~ which gives you the MatchData.

log_entry.gsub(regexp_named_captures) { |match| "changed #{$~[:field]} to XXXXXX" }

Nice! That works!

image13

Summary of Key Lessons

  1. The rubular.com site is super useful for testing regular expressions in Ruby.
  2. The Ruby console is awesome for testing the syntax using regular expressions in Ruby, such as using the String#match, String#scan, and String#gsub methods.
  3. String#match only returns the first match, in the form of a MatchData object. String#scan returns all matches, but the results come in the form of Arrays of Arrays.
  4. (.*) matches greedily. (.*?) is non-greedy. Non-greedy stops at the first possible place. Greedy goes to the last possible place. This is all within a single line.
  5. (?<some_name>.*?) is the syntax for a named capture group. Named capture groups make your regular expressions easier to read.
  6. You can use a named capture group in your replacement value for a String#gsub with the syntax \\k<some_name>. This is much more clear than \\1.
  7. If you use the block syntax for String#gsub, it does not work like the non-block syntax in terms of substitution. You need to be aware of:
    1. Value passed into the block is the full string matched, rather than a MatchData object.
    2. The value returned from the block is what is substituted for the whole string matched.
    3. The block is called once for each string matched.
    4. You have to use String interpolation within your code in the block, as this is normal ruby code, unlike the String value in the non-block gsub syntax. I.e., don't just return a String with $1 inside of it. You need to put #{$1} in the String.
    5. You can use the regexp globals like $1 to access a capture group.
    6. To use a named capture group inside the block you need to use the $~[:some_name] syntax, where some_name is the your named capture. You will probably ignore the passed in argument to the gsub block if using this syntax.
Are you looking for a software development partner who can
develop modern, high-performance web apps and sites?
See what we've doneArrow right