Extract Title from the Image documents in python — Application of RLSA
In this post, we will discuss extracting the title from the document Images mainly E-paper Articles.
In python, we have computer vision library OpenCV(open-source computer vision library) and PIL(Python Imaging Library) module to load the image and do the processing with the “Math Object” loaded.
MATLAB is also one of the popular software amongst scientists involved in image processing but commercial.
The answer to the above question is, “Impossible”. Because we don’t get any inbuilt methods like title(), content()
from any Image Processing libraries. All we have to do is perform math operations and apply necessary algorithms to achieve our ROI(region of interest) from the Document Image. Our ROI is the title and content in an Image provided.
Separating the title
and content
as shown in output.jpg
from the input image.png
is the goal.
Terminology
- Contour — Every connected pixel region is a contour. Ex: “A Character in a text”
- RLSA — Run Length Smoothing Algorithm
- ROI — Region Of Interest
Steps to follow
- RGB to Binary image Conversion.
- Apply Contour Height heuristic.
- Apply the RLSA algorithm on the image.
- Apply Contour width heuristic.
RGB to Binary Conversion:
import cv2image = cv2.imread('image.png') # reading the imagegray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY) # convert2grayscale(thresh, binary) = cv2.threshold(gray, 150, 255, cv2.THRESH_BINARY | cv2.THRESH_OTSU) # convert2binarycv2.imshow('binary', binary)
cv2.imwrite('binary.png', binary)
Apply Contour Height heuristic:
- Explaining contours
- Filter contours based on average height
Explaining contours:
What are the contours? How do we find them?
(_, contours, _) = cv2.findContours(~binary,cv2.RETR_EXTERNAL,cv2.CHAIN_APPROX_SIMPLE)
# find contoursfor contour in contours:
"""
draw a rectangle around those contours on main image
"""
[x,y,w,h] = cv2.boundingRect(contour)
cv2.rectangle(image, (x,y), (x+w,y+h), (0, 255, 0), 1)cv2.imshow('contour', image)
cv2.imwrite('contours.png', image)
cv2.waitKey(0)
cv2.destroyAllWindows()
From the binary image, we found the contours and imposed the contour dimensions on the original image.
Filter contours based on average height:
- Create a blank image of the same dimension of the original image.
- Find the average height from all the contour heights.
- Draw those which are above the average on the blank image.
import numpy as npmask = np.ones(image.shape[:2], dtype="uint8") * 255 # create blank image of same dimension of the original image(_, contours, _) = cv2.findContours(~binary,cv2.RETR_EXTERNAL,cv2.CHAIN_APPROX_SIMPLE)
heights = [cv2.boundingRect(contour)[3] for contour in contours] # collecting heights of each contouravgheight = sum(heights)/len(heights) # average height# finding the larger contours
# Applying Height heuristic
for c in contours:
[x,y,w,h] = cv2.boundingRect(c)
if h > 2*avgheight:
cv2.drawContours(mask, [c], -1, 0, -1)cv2.imshow('filter', mask)
cv2.imwrite('filter.png', mask)
We got the title contours along with noise. Let us perform RLSA to connect the title contours and avoid noise with some heuristics.
Apply the RLSA algorithm on the image:
Check the documentation of this algorithm here.
pip install pythonRLSA
Applying RLSA Horizontal on the image
from pythonRLSA import rlsa
import mathx, y = mask.shapevalue = max(math.ceil(x/100),math.ceil(y/100))+20 #heuristic
mask = rlsa.rlsa(mask, True, False, value) #rlsa applicationcv2.imshow('rlsah', mask)
cv2.imwrite('rlsah.png', mask)
Apply Contour width heuristic:
Since the title will be a long text, we can filter that by applying width heuristic.
- Create a blank image
- Find contours on the rlsah.png(mask)
- Filter contours based on width
- Copy the title to a blank image
- Nullify the title in the original to get the content part
(_, contours, _) = cv2.findContours(~mask,cv2.RETR_EXTERNAL,cv2.CHAIN_APPROX_SIMPLE) # find contoursmask2 = np.ones(image.shape, dtype="uint8") * 255 # blank 3 layer imagefor contour in contours:
[x, y, w, h] = cv2.boundingRect(contour)
if w > 0.60*image.shape[1]: # width heuristic applied
title = image[y: y+h, x: x+w]
mask2[y: y+h, x: x+w] = title # copied title contour onto the blank image
image[y: y+h, x: x+w] = 255 # nullified the title contour on original imagecv2.imshow('title', mask2)
cv2.imwrite('title.png', mask2)
cv2.imshow('content', image)
cv2.imshow('content.png', image)
cv2.waitKey(0)
cv2.destroyAllWindows()# mask — ndarray we got after applying rlsah
# mask2 — blank array
You can pass these images to pytesseract to get the text
import pytesseracttitle = pytesseract.image_to_string(Image.fromarray(mask2))
content = pytesseract.image_to_string(Image.fromarray(image))
From the above code snippet, mask2
is the title ndarray and image
is the content ndarray.
This post explained the basic steps involved in extracting the title by using RLSA. We can add more steps in pre-processing/cleaning of the image to get the expected output from complex images. Let's do an analysis of the above post.
Analysis of the post:
- Why we applied only horizontal RLSA on the image?
- What if the title is spread over two lines?
- How we got the “Value” passed to the RLSA.
If the title is spread over two lines, we can perform RLSA horizontal and vertical. But the pictures, lines that exist in the image might cause the issue. If the pictures exist close to the title might connect when we perform Vertical RLSA. I suggest cleaning the image before.
Clean the Image:
To avoid the complications, remove the pictures and lines present in the image. Pictures can be removed by the area, width, height of the contour. Pictures surely will be the largest contours of the image. Lines can be detected and separated out by Canny edge detection and Houghlines algorithms.
Detect the lines and pictures and draw over another binary image and do bitwise-and with the original image to subtract out the noise.
We can start the process of extracting ROI after the cleaning process i.e., binary_noise_removal.png
for better results.
minLineLength = 100
maxLineGap = 50def lines_extraction(gray: List[int]) -> List[int]:
"""
this function extracts the lines from the binary image. Cleaning process.
"""
edges = cv2.Canny(gray, 75, 150)
lines = cv2.HoughLinesP(edges, 1, np.pi/180, 100, minLineLength, maxLineGap)
return linesmask = np.ones(image.shape[:2], dtype="uint8") * 255 # create a white imagelines = lines_extraction(gray) # extract linestry:
for line in lines: # write lines to mask
x1, y1, x2, y2 = line[0]
cv2.line(mask, (x1, y1), (x2, y2), (0, 255, 0), 3)
except TypeError:
pass(_, contours, _) = cv2.findContours(~binary,cv2.RETR_EXTERNAL,cv2.CHAIN_APPROX_SIMPLE) # find contoursareas = [cv2.contourArea(c) for c in contours] # find area of contour
avgArea = sum(areas)/len(areas) # finding average area
for c in contours:# average area heuristics
if cv2.contourArea(c)>60*avgArea:
cv2.drawContours(mask, [c], -1, 0, -1)binary = cv2.bitwise_and(binary, binary, mask=mask) # subtracting the noisecv2.imwrite('noise.png', mask)
cv2.imshow('mask', mask)
cv2.imshow('binary_noise_removal', ~binary)
cv2.imwrite('binary_noise_removal.png', ~binary)
cv2.waitKey(0)
cv2.destroyAllWindows()
load the ‘binary’
image from the code snippet provided earlier.
Value of RLSA:
The value passed to RLSA might change with image structure and the text language. Since almost all the epaper-English articles follow a similar structure, the value might not change.
Along with RLSA, we can use the Smearing algorithm also.
Code and examples are provided in this GitHub repository.
Thanks for reading! If you like the concept, please don’t forget to endorse my skills on Linkedin.
Love the article? Please do Clap and Share if you like the post and please do comment if you have any questions related to the post.