Tuesday, 26 November 2024

Regular Expression in Python.




Q. What is Regular Expression?

Ans: The word Regular comes from the concept of Regular Languages in automata theory.

Regular Languages can be represented using finite set of rules. These rules can be expressed using regular expression.

Example:

Here, we want to create a string of two letters a and b but there is one condition that each string must end with 'a'. if string ends with 'a' then only that string is in Regular Language 'L'., otherwise it is not in the Regular Language 'L'.

So, let's define Regular Language L: 

L = (a | b) * a

it means the language contains strings made of letter a and b, but every string must end with letter 'a'.

Program in Python:


import re

# Define the regular expression

pattern = r"(a|b)*a"

"""A regular expression (regex) is a sequence of characters that defines a search pattern, primarily used for pattern matching and manipulation in strings."""

# Test some strings

strings = ["a", "bab", "bba", "b", "bb", "ba"]

for s in strings:

    if re.fullmatch(pattern, s):

        print(f'"{s}" is in the language.')

    else:

        print(f'"{s}" is NOT in the language.')


------ end of program ------


Regular Expression Methods:

The regular expression (regex) method is used for

  • Searching, 
  • Matching, 
  • and Manipulating strings based on patterns.
How to use Regular Expression:
1. Defining Pattern:
    A regex pattern is composed of:

  • Literal Characters: Match exactly as they appear (e.g., cat matches the word "cat").
  • Special Characters (Meta-characters):
    • .: Matches any character except a newline.
    • *: Matches zero or more of the preceding character.
    • +: Matches one or more of the preceding character.
    • ?:Matches zero or one of the preceding character.
    • ^: Matches the start of a string.
    • $: Matches the end of a string.
    • []: Denotes a set of characters (e.g., [a-z] matches any lowercase letter).
    • |: Acts as an OR (e.g., cat|dog matches "cat" or "dog").
    • (): Groups patterns and captures matches.

2. Using Regex Methods:

import re

# Matching a pattern
match = re.search(r'\d+', 'The year is 2024')  

# Finds first occurrence of digits
print(match.group()) # Output: 2024

# Checking for a pattern
if re.match(r'^[a-zA-Z] +$', 'Hello'):
    print ("It's a valid word")

# Finding all matches
matches = re.findall(r'\b\w{3}\b', 'The cat sat on the mat')  

# Finds all three-letter words
print(matches) # Output: ['cat', 'sat', 'mat']

# Replacing matches
result = re.sub(r'\d+', 'YEAR', 'The year is 2024')
print(result) # Output: 'The year is YEAR'


Implementation 

1. re.split():

The re.split() function in Python is part of the re module and is used to split a string into a list based on a specified regular expression pattern.

Syntax: re.split(pattern, string, maxsplit=0, flags=0)

pattern: The regular expression pattern to split on.

string: The input string to be split on.

maxsplit: default 0 means no limit, the maximum number of splits to perform. (optional)

flags: (optional) Flags to modify the behavior of regular expression

Example:

import re

text = "one,two;three four"

result = re.split(r'[,\s;]+', text)  # Split on commas, spaces, or semicolons

print(result)

# Output: ['one', 'two', 'three', 'four']

--------------------------------------------------

text = "abc123def456ghi"

result = re.split(r'\d+', text)  # Split on one or more digits

print(result)

# Output: ['abc', 'def', 'ghi']

-----------------------------------------------------

text = "one,two;three four"

result = re.split(r'[,\s;]+', text, maxsplit=2) # Limit to 2 splits

print(result)

# Output: ['one', 'two', 'three four']

-------------------------------------------------------
text = "HelloWorldHELLOworld"
result = re.split(r'world', text, flags=re.IGNORECASE) # Case-insensitive split
print(result)
# Output: ['HelloWorldHELLO', '']

--------------------------------------------------------

2. re.sub():
The re.sub() function in Python is used to replace parts of a string that match a regular expression pattern with a specified replacement. It’s part of the re module, which provides powerful tools for text processing.

Syntax: re.sub(pattern, repl, string, count=0, flags=0)

pattern: The regular expression pattern to search for.
repl: The replacement string or a function to generate the replacement dynamically.
string: The input string to perform the replacements on.
count (optional): The maximum number of replacements to make. Default is 0 (no limit).
flags (optional): Flags to modify the behavior of the pattern (e.g., re.IGNORECASE).

Example:

import re

text = "The rain in Spain falls mainly in the plain."
result = re.sub(r'in', 'out', text)
print(result)
# Output: 'The raout out Spaout falls maonly out the plaout.'
-------------------------------------------------
# Limiting replacement
text = "123 456 789"
result = re.sub(r'\d', 'X', text, count=4) # Replace only the first 4 digits
print(result)
# Output: 'XXX X56 789'
----------------------------------------------------
# case insensitive replacement
text = "Python is great. python is fun!"
result = re.sub(r'python', 'JavaScript', text, flags=re.IGNORECASE)
print(result)
# Output: 'JavaScript is great. JavaScript is fun!'

------------------------------------------------------
# dynamic replacement using function.
def multiply_by_two(match):
    return str(int(match.group()) * 2)

text = "I have 3 apples and 5 oranges."
result = re.sub(r'\d+', multiply_by_two, text) # Multiply numbers by 2
print(result)
# Output: 'I have 6 apples and 10 oranges.'
----------------------------------------------------------

3. re.sub():
The re.subn() function in Python is similar to re.sub(), but it provides additional information about the number of substitutions made. It performs a search-and-replace operation using a regular expression and returns a tuple containing: 
The modified string.
The number of replacements made.

Returns: A tuple of the form as follows:
(modified_string, number_of_replacements)

Example:
import re

text = "one fish, two fish, red fish, blue fish"
result = re.subn(r'fish', 'whale', text)
print(result)
# Output: ('one whale, two whale, red whale, blue whale', 4)
--------------------------------------------------------
# Limiting replacement.
text = "apple apple apple"
result = re.subn(r'apple', 'orange', text, count=2) # Replace only the first 2 occurrences
print(result)
# Output: ('orange orange apple', 2)
---------------------------------------------------------
# using Regex group replacement
text = "123-456-789"
result = re.subn(r'(\d+)', r'[\1]', text)
print(result)
# Output: ('[123]-[456]-[789]', 3)
----------------------------------------------------------

4. re.compile():
The re.compile() function in Python is used to compile a regular expression pattern into a regex object. This object can then be reused multiple times for pattern matching, making it more efficient when the same pattern is used repeatedly.

Syntax: re.compile(pattern, flags=0)
pattern: The regular expression pattern to compile.
flags (optional): Flags to modify the behavior of the pattern, such as:
                          re.IGNORECASE or re.I: Case-insensitive matching.
                          re.MULTILINE or re.M: Multi-line matching.
re.DOTALL or re.S: Make . match newline characters.
re.VERBOSE or re.X: Allow more readable regex patterns with comments.

Benefits of Using re.compile():
Efficiency: Compiling a regex once avoids recompiling the pattern every time it’s used.
Readability: You can create a reusable regex object, which improves code clarity.
Advanced Configuration: Predefine flags and patterns for later use.

Example:
import re

pattern = re.compile(r'\d+')  # Matches one or more digits
result = pattern.findall("There are 123 apples and 456 oranges.")
print(result)
# Output: ['123', '456']
------------------------------------------------------------
# Case-insensitive matching
pattern = re.compile(r'hello', re.IGNORECASE)  
result = pattern.search("Hello, how are you?")
print(result.group())
# Output: 'Hello'
-------------------------------------------------------------
# Matches words with exactly 3 characters
pattern = re.compile(r'\b\w{3}\b')  

text1 = "The cat sat on the mat."
text2 = "A bat flew by."

# Use the compiled pattern on multiple strings
print(pattern.findall(text1)) # Output: ['cat', 'sat', 'the', 'mat']
print(pattern.findall(text2)) # Output: ['bat']

----------------------------------------------------------------

Match Object:

A match object in Python is the result of using methods like re.match(), re.search(), or re.finditer() from the re module. It contains information about the part of the string that matched the regular expression, and it provides methods and attributes to extract useful details about the match.

Example:

import re

pattern = r'\d+'
text = "The number is 12345."

# Get a match object using re.search()
match = re.search(pattern, text)

if match:
    print ("Match found:", match.group()) # Access the matched text
else:
    print ("No match found.")

-------------------------------------------------------
# Extracting information with group
# Match a product code like ABC-1234

text = "The product code is ABC-1234 and costs $45."

pattern = r'(\w+)-(\d+)'  
match = re.search(pattern, text)

if match:
    print ("Full Match:", match.group(0)) # Output: 'ABC-1234'
    print ("Code:", match.group(1)) # Output: 'ABC'
    print ("Number:", match.group(2)) # Output: '1234'
-----------------------------------------------------------------------

Raw String with 'r' or 'R' Prefix:

In Python, the r or R prefix before a string denotes a raw string literal. This tells Python to interpret the string literally, without processing escape sequences like \n, \t, or \\.

Why Use Raw Strings in Regex?
Regular expressions often use backslashes (\) for special characters (e.g., \d for digits). Without raw strings, these backslashes would need to be escaped (\\d), making the code harder to read and prone to errors. Raw strings simplify this by treating backslashes as literal characters.

Example:
import re

# Without raw string
pattern = "\\d+"
result = re.findall(pattern, "123 abc 456")
print(result) # Output: ['123', '456']

# With raw string
pattern = r"\d+"
result = re.findall(pattern, "123 abc 456")
print(result) # Output: ['123', '456']

Example:
import re

# *: 0 or more
print (re.findall(r"a*", "aaabbb")) # Output: ['aaa', '', '', '', '', '']

# +: 1 or more
print (re.findall(r"a+", "aaabbb")) # Output: ['aaa']

# ?: 0 or 1
print (re.findall(r"a?", "aaabbb")) # Output: ['a', 'a', 'a', '', '', '', '', '']

# {n}: exact number
print(re.findall(r"a{2}", "aaabbb")) # Output: ['aa']

# {n,}: n or more
print(re.findall(r"a{2,}", "aaabbb")) # Output: ['aaa']

# {n,m}: between n and m
print(re.findall(r"a{1,2}", "aaabbb")) # Output: ['aa', 'a']

-----------------------------------------------------------------------

3. Greedy vs Non-Greedy Quantifiers

Greedy Quantifiers:
By default, regex quantifiers are greedy, meaning they match as much text as possible.
Examples: *, +, {n,}

Non-Greedy Quantifiers:
Non-greedy quantifiers match as little text as possible. Add? to make a quantifier non-greedy.
Examples: *?, +?, {n,m}?

Example:

import re

text = "<tag>content</tag>"

# Greedy quantifier
result = re.findall(r"<.*>", text)
print(result)  
# Output: ['<tag>content</tag>'] (matches the entire string)

# Non-greedy quantifier
result = re.findall(r"<.*?>", text)
print(result)  
# Output: ['<tag>', '</tag>'] (matches the smallest possible matches)

---------------------------------------------------------------------------------

RegEx Flags:

Regex flags in Python modify the behavior of regular expression patterns. Flags are optional parameters you can pass to regex functions (re.compile(), re.search(), re.match(), etc.) to enable specific functionalities like case-insensitive matching, multi-line handling, and more.


Flags: 

re.IGNORECASE re.I Makes the pattern matching case-insensitive.

re.MULTILINE re.M Changes ^ and $ to match at the beginning and end of each line.

re.DOTALL re.S Makes the . match newline characters as well.

re.VERBOSE re.X Allows whitespace and comments in the regex pattern for readability.

re.ASCII re.A Makes \w, \d, \s match only ASCII characters (ignores Unicode).

re.LOCALE re.L Makes \w, \d, \s, etc., match based on the current locale settings.

re.UNICODE re.U Makes \w, \d, \s match Unicode characters (default in Python 3).


data structures and algorithms Web Developer

No comments:

Post a Comment