• Pranav Kulkarni

An Introduction to Regular Expressions

The purpose of this blog is to give those with no prior experience or knowledge of regular expressions in Python a simple & intuitive introduction. Regexes, or regular expressions, are character sequences that are used to find a pattern in a string or series of strings. Several examples below help us better understand regex and how it works.

Accessing the Regex Module in Python

The regex module and the search function can be imported as:


import  re 
re.search(<regex>, <string>)

Or we can directly import the search function:


from re import search 
search(<regex>, <string>)                                   

In both of the blocks of code, <regex> refers to the pattern that needs to be

searched and <string> refers to the string in which the search is to be conducted.



Using re.search()

Let's look at an example of how regex is exactly used to search for a specific pattern


import re 
s='xYzff123' 
print(re.search('123', s))

Output - <re.Match object; span=(5, 8), match='123'>

What this output gives us, is that the pattern was found in the given string and the

'==span==' attribute gives us the start and end indexes of the pattern's position

inside the string.

The search function is a useful tool since it allows one to see if a string sequence

is part of a bigger string sequence and, if it is, it notifies of the search query's

relative location.


Combining Boolean Statements and Regex

The regex search function can also be integrated into code using boolean statements:


from re import search 
str = "Twitter is a platform that runs on the concept of #s" 
if search('#', str): 
    print("# found in the string") 
else:
    print("No # found in the string")

Output - `# found in the string

Complex Regex Queries with Metacharacters

The building components of regular expressions are metacharacters. Regex considers

characters to be either metacharacters with special meanings or regular characters

with literal meanings.

The following table gives us more insight into what exactly each metacharacter is used

for:

Metacharacter

Description

Examples

\d

​Whole Number - 0 to 9

\d\d\d = 444; \d\d = 21; \d = 8

\w

Alphanumeric Character

\w\w\w = dog; \w\w\w = 467

\W

Symbols

\W = %; \W = #; \W\W\W = @#$

[a-z]

Character set, at least

one of which must be a

match

pand[ora] = panda, pando & pandr(Since

the pattern specifies any 1 character)

[0-9]

Numeric Set with the

exact same logic

012[12] = 0121 & 0122

(abc)

Character Group matching

in the exact order

pand(ora) = pandora

(123)

Numeric Group matching

in the exact order

0123(456) = 0123456

|

Fulfills the Boolean OR

condition

pand(ora|123) = pandora OR pand123

?

​Matches when the

character preceding

occurs 0 or 1 time,

making match optional

colou?r = colour(u found once); colou?

r = color(u found 0 times)

*

Asterisk matches when

the character preceding

* matches 0 or more

times

tre* = tree(e found twice); tre* =

tre(e found once); tre* = tr(e found 0

times); tre* != trees(s doesn't match

regex)

+

Matches the character

preceding + 1 or more

times, + makes match

mandatory

tre+ = tree(e found twice); tre+ =

tre(e found once); tre+ != tr(e found

0 times hence no match)

.

The period matches any

alphanumeric character

or symbol

ton. = tone, ton4, ton@ but ton. !=

tones(only single character matching)

.*​

Combines the

functionalities of . & *

tr.* = tr, tre, tree, trees, trough,

treadmill

Regex Quantifiers

Quantifiers specify how many instances of a character, group, or character class must

be present in the input for a match to be found, the following table describes the

quantifiers used in Regex and their usage.

Quantifier

Description

Example

{n}

Matches when the preceding character(or

group) occurs exactly n times

​\d{3} = 123 & 456 &

789; pand(ora){2} =

pandoraora

{n,m}

Matches when the preceding character(or

group) occurs at least n times and at most

m times

\d{2,5} = 97430 & 9743

& 97



Pattern Usage In Python

Let's assume that one wants to search for specific snippets of strings in a certain

input, or even put strict rules for the input inside a textbox, like detecting if the

email entered is valid or not; then in this case specific pattern usage is used which

combines metacharacters and quantifiers in a specific sequence that matches the

specified pattern that is to be matched in the string.

Let's take the example of email validation, say you want to create a form and inside

that a text box, which will ask the user to enter their email in the box, now how will

you check if the entered email is valid or not?

Let's see!

The Python code for checking the correct email pattern is:


from re import search 
s = "anyemail@regex.com" 
match = search(r'[\w.]+@[\w.]+', s) 
if match: 
    print(match.group()) 
else:
    print("Match not found, ")

In the above code, we can see that the r'[\w.]+@[\w.]+' snippet is the pattern

matching statement used for email checking inside the string s (for convenience I

have directly taken s as a string).


Now let's try to breakdown the pattern that has been used for email detection:- It

starts with the double quotes, and immediately inside we can see that first there is a

square bracket, and then a + and then the @ symbol and then another square bracket.


Let's try to breakdown the square brackets first and then proceed to the inside

contents.


From the metacharacters table, we can see that the square brackets [a-z] are used

for the pattern of a =="character set"==, which implies that the first part of the

email should compulsorily be only a character set, then the + indicates that the

pattern will be matched for 1 or more instances(so repeated characters will be

allowed), then the @ sign indicates that only @ is permissible and no other

character will be allowed, and after that we can see that there is another character

set, and then another + which has the same functionality as the first one.


Now let's dive inside the square brackets, in the 1st one we can see that it contains

\w . . Looking at the metacharacters, we can see that \w is the metacharacter for

alphanumerics, meaning that both letters and numbers are allowed in the pattern(as

they should be, it's an email after all!), and the . after is used for matching any

alphanumeric character, as evident from the metacharacters table.


In the second [] pretty much the same thing happens, except it has to be preceded by

an @ character for the email to be completely valid.


The result of the above program comes out to be - Output - anyemail@regex.com


The .group() method in the program returns the pattern that is found in the string,

if and only if it matches the regex pattern, else it returns Match Not Found


And we get the desired email by using the pattern r'[\w.]+@[\w.]+'


However, the best part of the above program is that, even if we consider s as an

entire string, that contains a sentence along with an Email ID, the program will

filter out only the email ID and give that out as the output due to the pattern only

searching for immediate character sets preceding and succeeding the @ character.


So, if we consider s = Something Something Something anyemail@regex.com , then the

output of the program is only anyemail@regex.com

17 views0 comments

Recent Posts

See All

PEP8