Regex Introduction - Reimart's Webpage

If you are exposed to programming or Linux in general, chances are, you have used some form of regular expressions or regex before. In this post, I am going to go over how I used regex in the problems I have encounter before.

Short anecdote

While reviewing for the board exam, I have an idea to scrape all the questions and answers on a few websites that host MCQs and compile it in a database. The way to extract these questions and answers is to use regular expressions so that I do not have to check for each string in the file. If you are interested check here, although I have not updated it for a while since then.

Sample Problem

The best way to learn a new concept is through a sample problem. Suppose we have an HTML file and we are interested in the links in the file, how are we gonna solve it? We’re going to solve it through Regex.

Detect HTML links

I linked the hackerrank problem above but you can read the rest of the article without the hackerrank link.

Problem

We are given an HTML file as an input and we are required to extract every link from the file and return it as we find it.

Suppose we have an input file:

 1<div class="portal" role="navigation" id='p-navigation'>
 2<h3>Navigation</h3>
 3<div class="body">
 4<ul>
 5 <li id="n-mainpage-description"><a href="/wiki/Main_Page" title="Visit the main page [z]" accesskey="z">Main page</a></li>
 6 <li id="n-contents"><a href="/wiki/Portal:Contents" title="Guides to browsing Wikipedia">Contents</a></li>
 7 <li id="n-featuredcontent"><a href="/wiki/Portal:Featured_content" title="Featured content  the best of Wikipedia">Featured content</a></li>
 8<li id="n-currentevents"><a href="/wiki/Portal:Current_events" title="Find background information on current events">Current events</a></li>
 9<li id="n-randompage"><a href="/wiki/Special:Random" title="Load a random article [x]" accesskey="x">Random article</a></li>
10<li id="n-sitesupport"><a href="//donate.wikimedia.org/wiki/Special:FundraiserRedirector?utm_source=donate&utm_medium=sidebar&utm_campaign=C13_en.wikipedia.org&uselang=en" title="Support us">Donate to Wikipedia</a></li>
11</ul>
12</div>
13</div>

Then the output would be:

1wiki/Main_Page,Main page
2/wiki/Portal:Contents,Contents
3/wiki/Portal:Featured_content,Featured content
4/wiki/Portal:Current_events,Current events
5/wiki/Special:Random,Random article
6//donate.wikimedia.org/wiki/Special:FundraiserRedirector?utm_source=donate&utm_medium=sidebar&utm_campaign=C13_en.wikipedia.org&uselang=en,Donate to Wikipedia

Intuition:

It is a matter of extracting the href value in the (<a>) tag.
We can write a regex expression that will extract both the href value and the string value inside the anchor tag.
Afterwards, it is just a matter of formatting the output but inserting a comma in between the extracted values.

Writing regular expressions

Regex expression are terse by convention however it is not that difficult to understand once you have encountered it before. First we need to write the regex expression that will match the link in the href attribute. We can do that by this regex expression:

href\=\"(.*?)\"

you may follow with this link

In this expression, href refers to the string “href” itself and the same goes to the following escaped characters \= and \". The most interesting part here is the token (.*?) which roughly means match everything. If we converted the regular expression into english it would roughly mean find the characters after “href=” and match everything between the quotation marks.

We can extend the expression to include the everything between the two characters > and <.

href\=\"(.*?)\"\>(.*?)\<

This will inevitably include extra stuff that we do not want like the title attribute but we can process it using some string methods in Python. There is more than one way to skin a cat - this solution may not be the most optimal but it can get you starting in using regular expressions in your day to day programming tasks.

The code snippet below shows the solution function that I have written. You can clone the solution and the unit test via the git repo below.

 1import re
 2
 3def extract_links(input_html: str) -> list:
 4    split_input = input_html.split('\n') # Splits the string into a list separated by the newline
 5    pattern = re.compile(r"href\=\"(.*?)\"\>(.*?)\<")
 6    output = []
 7    for line in split_input:
 8        search = pattern.search(line)
 9        if search != None:
10            res = search.group(1).split('"')[0] + ',' + search.group(2)
11            output.append(res)
12    return output

Conclusion

You have learned to write regular expressions, particularly to extract strings between two characters.

Sources

Github Repo