Strings in Python and Text Processing

Strings in python are collections of characters like digits, letters of the alphabet, symbols and even non-printable characters. A string is a series of characters treated as a single unit.

In python, strings are consists of characters enclosed in matching single or double quotation signs.

In defining a string, the start and closing quotation marks must match. For instance, if you are using a single quotation sign, the start and closing quotation signs must be a single one.


S1 = 'this is string'

S2 = "this is correct"


S3 = "this is incorrect'

S4 = 'this is incorrect"

Indexing and slicing

Strings are considered iterable because it consists of a sequence of characters. As a result, the individual characters in a string can be accessed using their positional indexes.

For example:


name = 'henry'

print(name[0])

print(name[1])

print(name[2])

print(name[3])

print(name[4])

#output

#h

#e

#n

#r

#y

Slicing strings

Slicing simply means extracting one or more characters in a string. The syntax for this is shown below:


color = 'yellow'

print(color[0:3])

#output

#yel

The above syntax simply means – to return all the characters in the string named color, starting from position 0 and ending but not including position 3.

Therefore, color[3:6] will return ‘low’ as the result.

Negative indexes

Aside from the regular positional indexing that starts from the left side of the string, you might as well access the position of a character in a string starting from the right side using the negative indexes.

The first item from the right or the last item from the left has an index of -1 and the second to the last with an index of -2 and so on.


color = 'yellow'

print(color[-1])

print(color[-2])

print(color[-3])

print(color[-4])

print(color[-5])

print(color[-6])

#outputs

#w

#o

#l

#l

#e

#y

Now, let’s apply this concept to slicing strings.


color = 'yellow'

print(color[0:-1])

print(color[0:-2])

print(color[0:-4])

print(color[-6:-1])

print(color[-6:5])

#output

#yello

#yell

#ye

#yello

#yello

In slicing, if the start index position is zero, you might as well leave it empty as shown below:


S = 'hello'

print(S[0:5])

print(S[:5])


#outputs

#hello

#hello

Also, if you want to include the last item in a sequence, you can also leave the stop index empty. This is particularly necessary if you don’t know the position of the last item in a sequence.


S = 'hello'


print(S[0:5])

print(S[:5])

print(S[:])

#outputs

#hello

#hello

#hello

Slicing in steps

Including a second colon after the stop index is used to specify the step value. The step value is the number of increments in the indexes as it is iterating from the start to the stop index. If no value is provided, it defaults to 1 which means that all the items should be included. If the step value is 2, it means skipping 1 item in the course of the iteration as shown below.


S = 'hello'

print(S[::])

print(S[::2])

print(S[::3])


#outputs

#hello

#hlo

#hl

If the step value is a negative number, it means that the slicing will be in reverse as shown below.


S = 'hello'

print(S[::-1])

#output

#olleh

Immutability of strings

Strings are immutable, meaning that you cannot perform remove or update characters in a string once it has been defined, or else you will get an error. For instance:


color = 'yellow'

color[0] = 'Y'

#error message

#Traceback (most recent call last):

#File "/Users/ex.py", line 2, in <module>

#color[0] = 'Y'

#TypeError: 'str' object does not support item assignment

Operations on strings

Strings and string manipulations are crucial in any programming language, especially python. Let’s look at different operations that can be performed on a string.

Concatenation

You can add two or more strings together to get another string. This is done using the + operator.


S = 'hello' + 'world'

print(S)

S = 'hello' + ' ' + 'world' + '!'

print(S)


#ouputs

#helloworld

#hello world!

Repetition

In order to repeat a given string into any given number of times using the * operator. You can repeat a given character or group of characters a given number of times as shown below:


print('h'*10)

print('hello'*3)


#output

#hhhhhhhhhh

#hellohellohello

Iteration on strings

Strings are sequences of characters and can be iterated like every other type of sequence through looping.


color = 'yellow'

for char in color:

    print(char)

#output

#y

#e

#l

#l

#o

#w

Membership Test

Strings support membership tests. Using the in operator, you can determine whether a character or group of characters are in a given string. The outcome of this operation is either True or False.


color = 'yellow'

print('y' in color)

print('llo' in color)

print('x' in color)


#outputs

#True

#True

#False

Triple quotes

Triple quotes also known as docstrings are used to create multiline strings. Even though some consider this as a way of commenting strings, technically, docstrings or characters in triple quotes are considered as strings.

You can use single or double quotes for triple quotes in defining strings, but be consistent in your choice, otherwise, you will get a syntax error.

However, it’s recommended that you only use single or double quotes in defining strings. Triple quotes are predominantly used for code documentation. Also, avoid using triple quotes for the commenting except if it is part of your program documentation.


"""

This is the documentation for the program

"""

#triple quotes used in defining strings

S = """Hello 1"""

print(S)

S = '''hello 2'''

print(S)

S = '''Wrong"""

#outputs

#Hello 1

#hello 2

#File "/Users/mac/Documents/portfolio/tools/ex.py", line 13

#S = '''Wrong"""

#^

#SyntaxError: unterminated triple-quoted string literal (detected at line 17)

Escape sequences

Escape sequences are characters used to present non-printable characters or literals into strings.

Here are some escape sequences and their meanings.

\’ – single quote

\” – double quote

\\ – backslash

\n – new line

\r – carriage return

\t – horizontal tab

\b – backspace

\uxxxxxxxx – 16-bit Unicode hex value

\Uxxxxxxxx – 32-bit Unicode hex value

\ooo – Octal value of 000

Examples of the use of escape sequences

Now, we will be demonstrating the use of escape sequences using the following codes.

Single and double quotes

The following examples illustrate how to escape single or double quotes in a string. For example, if you want to have a double quote in your string, then the outer quotation must be a single quote. In the same manner, if you want to include a single quotation in your string, then the outer quotation must be a double quote.


S = 'There are "3" apples'

print(S)

S = "There are '3' apples"

print(S)

#ouputs

#There are "3" apples

#There are '3' apples

Tabs and new line

This is how to implement the tab and new line escape sequences in python.


S = 'This is line 1\nThis is line 2\nThis is line 3'

print(S)

S = 'Word 1\tWord 2\tWord 3'

print(S)

#outputs

#This is line 1

#This is line 2

#This is line 3

#Word 1  Word 2  Word 3

Backlash

Backslash has a special meaning to python interpreters, so in order to represent the character in your string, you have to escape it with another backslash as shown below:


S = 'C:\\Programs\\Hintacare'

print(S)

#output

#C:\Programs\Hintacare

Unicode and Octal Values

The following examples show how to represent Unicode and octal values in python strings.


S = 'This is an octal value - \ooo421'

print(S)

S = 'This is a 32-bits Unicode hex value - \U00000023'

print(S)

S = 'This is a 16-bits Unicode hex value - \u0066'

print(S)

#outputs

#This is an octal value - \ooo421

#This is a 32-bits Unicode hex value - #

#This is a 16-bits Unicode hex value - f

Leave a Reply

Your email address will not be published. Required fields are marked *