KAUNAS UNIVERSITY OF
SIGNALS THEORY PAPER WORK
WORD RECOGNITION AND ITS METHODS
Amir Salha Lecturer: Marius Gudauskis
Table of Contents
I. Abstract: 3
II. Introduction to word recognition: 3
III. Visual word recognition: 3
i. Theoretical Approach: 3
Bayes theorem: 3
Interactive activation (IA) model: 4
Lexical competition: 4
Lexical decision: 4
Masked priming: 4
Neighbourhood density: 4
Open bigrams: 5
Reaction time (RT) distribution: 5
Word-frequency effect: 5
ii. Practical Approach: 5
Step 1: Detect Candidate Text Regions Using MSER; 6
Step 2: Remove Non-Text Regions Based on Basic Geometric
Step 3: Remove Non-Text Regions Based on Stroke Width
Step 4: Merge Text Regions for Final Detection Result; 7
Step 5: Recognize Detected Text Using OCR; 8
Text Recognition Using the OCR Function: 9
Challenges Obtaining Accurate Results: 9
Image Pre-processing Techniques to Improve Results: 10
ROI-based Processing to Improve Results: 11
IV. Speech-recognition systems
and its Example: 12
The Development Workflow: 12
Acquiring Speech: 12
Analyzing the Acquired Speech: 13
Developing a Speech-Detection Algorithm: 13
Developing the Acoustic Model: 13
Selecting a Classification Algorithm: 15
Building the User Interface: 16
V. Conclusion: 16
VI. References: 16
The abstract of this paper work is to recognize
what is word recognition and the approach of methods used. Initializing the
visual and speech word recognition and its different approaches by providing
examples and real-life computation.
to word recognition:
Word recognition is a computational model that
converts or transfer visual (image, video) text or speech (audio, real voice)
into a real text document. This computation model could in different methods
and ways, either by scanning the text like in OCR tools or by taking live
images. As well as it could be a voice or speech recognition of words like
computational linguistics that develops methodologies and technologies that
enables the recognition and translation of spoken language into text by
Visual word recognition:
Is a computational recognition refers to the
branch of computer science that involves reading text from different recourse and
translating the images, videos or live templates into a form that the computer can manipulate
(for example, into ASCII codes).
a mathematical procedure
for updating probabilities or beliefs in the light of new evidence. In the case
of word recognition, the probability of a word given the input, or evidence, is
Figure 1:Mathematical equation used
models expressed as artificial neural networks; this includes, for
example, the IA model. These models are intended to capture general properties
of neurons, or neuronal populations.
activation (IA) model:
the first, and still most influential, form of connectionist model of
word recognition. Words are represented as nodes in a network that are
connected by inhibitory links.
Figure 2: The
top panel illustrates a simplified interactive activation model.
in both IA models and Bayesian models, neighbouring words compete with
each other for recognition. In IA models, this is due to the inhibitory
connections between word nodes.
the most common laboratory task for studying word recognition.
Participants are required to decide whether a string of letters is a word or not
a variant on the lexical decision task in which the target is preceded
by a briefly presented prime, which can be a word or a nonword. Participants
are rarely aware of the prime. The prime is usually presented in lower case and
the target in upper case to minimise physical overlap. Masked priming is most
commonly used to address questions about the representation of orthography.
a measure of how similar a word is to other words. A common measure is
Coltheart’s N 34: how many other words can be formed by changing a single letter in a
word? According to this definition, only words of the same length can be
neighbours. A more flexible measure is given by a Levenshtein distance metric.
This measures similarity in terms of the number of ‘edits’ – insertions,
deletion, and substitutions – so WORD and WORDS will now consider to be
neighbours. The OLD20 is the average distance of the 20 closest neighbours.
a proposal that the order of letters in a word is coded in terms
of a set of ordered letter pairs, which may be non-contiguous. WORD might be
coded as WO, WR WD, OR, OD, or RD.
Figure 3: Three
different representations of letter order.
time (RT) distribution:
RTs in tasks like lexical decision are generally positively skewed.
Variables like word frequency rarely shift only the mean of the distribution,
but usually the form of the distribution, too. Accounting for these changes is
a challenge for computational models.
by far the strongest influence on how readily a word can be identified
is its frequency of occurrence in the language; words that occur very often in
the language are recognised more quickly than low-frequency words. The speed
and ease with which words can be recognised is an approximately logarithmic
function of word frequency.
Automatically detect and recognize text in natural images:
This method shows how to detect regions in an image that contain text.
This is a common task performed on unstructured scenes. Unstructured scenes are
images that contain undetermined or random scenarios. For example, you can
detect and recognize text automatically from captured video to alert a driver
about a road sign. This is different than structured scenes, which contain
known scenarios where the position of text is known beforehand. The automated
text detection algorithm detects a large number of text region candidates and
progressively removes those less likely to contain text
1: Detect Candidate Text Regions Using MSER;
Figure 4:MSER regions
2: Remove Non-Text Regions Based on Basic Geometric Properties;
There are several geometric properties that
are good for discriminating between text and non-text regions, including:
Figure 5: after removing non-text regions
based on basic geometric properties
3: Remove Non-Text Regions Based on Stroke Width Variation;
Figure 6:after removing non-text regions based
on stroke width variation
4: Merge Text Regions for Final Detection Result;
5: Recognize Detected Text Using OCR;
Recognition of text using optical character recognition (OCR):
ü Optical character recognition refers to the branch of
computer science. Is a method that involves reading text from paper and
translating the images into a form that the computer can manipulate (for
example, into ASCII codes), and this example
shows how to use the OCR function from the Computer Vision System Toolbox™ to
perform Optical Character Recognition.
Recognition Using the OCR Function:
Recognizing text in images is useful in many
computer vision applications such as image search, document analysis, and robot
navigation. The ocr function
provides an easy way to add text recognition functionality to a wide range of
The OCR functions returns the
recognized text, the recognition confidence, and the location of the text in
the original image. You can use this information to identify the location of
misclassified text within the image.
Figure 9: the
logo is incorrectly classified as a text character.
These kind of OCR
errors can be identified using the confidence values before any further
processing takes place.
Obtaining Accurate Results:
OCR performs best when the text is located on
a uniform background and is formatted like a document. When the text appears on
a non-uniform background, additional pre-processing steps are required to get
the best OCR results. In this part of the example, you will try to locate the
digits on a keypad. Although, the keypad image may appear to be easy for OCR,
it is actually quite challenging because the text is on a non-uniform
Text indicates that no text
is recognized. In the image, the text is sparse and located on an irregular
background. In this case, the heuristics used for document layout analysis
within OCR might
be failing to find blocks of text within the image, and, as a result, text
recognition fails. In this situation, disabling the automatic layout analysis,
using the ‘TextLayout’ parameter, may help improve the results.
If Adjusting the ‘TextLayout’ parameter did not
help. To understand why OCR continues to fail, you have to investigate the
initial binarization step performed within OCR. You can use imbinarize to check this initial binarization
step because both OCR and
the default ‘global’ method in imbinarize use Otsu’s method for image binarization.
thresholding, the binary image contains no text.
OCR failed to recognize any text in the original
image. You can help improve the results by pre-processing the image to improve
Pre-processing Techniques to Improve Results:
The poor text
segmentation seen above is caused by the non-uniform background in the image, using
the following pre-processing technique to remove the background variations and
improve the text segmentation.
Step 1: pre-processing using morphological reconstruction;
remove these artefacts and produce a cleaner image for OCR.
artifacts and producing a cleaner image for OCR.
Step 2: “LocateText” method;
There is some
“noise” in the results due to the smaller text next to the digits.
Also, the digit 0, is falsely recognized as the letter ‘o’. This type of error
may happen when two characters have similar shapes and there is not enough
surrounding text for the OCR function
to determine the best classification for a specific character. Despite the
“noisy” results, you can still find the digit locations in the
original image using the locateText method
with the OCR results.
ignoring irrelevant text using localtext method
Processing to Improve Results:
situations, just pre-processing the image may not be sufficient to achieve good
OCR results. One approach to use in this situation, is to identify specific
regions in the image that OCR should
process. In the keypad example image, these regions would be those that just
contain the digits. You may select the regions manually using imrect, or you can automate the process. One
method for automating text detection is given in the example entitled Automatically Detect and Recognize Text in Natural Images. In this example, you will use vision. BlobAnalysis to find the digits on the keypad.
connected regions within the keypad image
are not likely to contain any text and can be removed using the area statistic
returned by vision.BlobAnalysis.
Here, regions having an area smaller than 300 are removed.
Speech-recognition systems and
ü A robust
speech-recognition system combines accuracy of identification with the ability
to filter out noise and adapt to other acoustic conditions, such as the
speaker’s speech rate and accent.
speech-recognition systems are classified as isolated or continuous. Isolated
word recognition requires a brief pause between each spoken word, whereas
continuous speech recognition does not.
systems can be further classified as speaker-dependent or speaker-independent:
A speaker-dependent system only recognizes
speech from one speaker’s voice.
A speaker-independent system can recognize
speech from anybody.
robust speech-recognition algorithm is a complex task requiring detailed
knowledge of signal processing and statistical modelling:
There are two
major stages within isolated word recognition: a training stage and a testing
stage. Training involves “teaching” the system by building its dictionary, an
acoustic model for each word that the system needs to recognize.
The development workflow consists of three
User interface development
For training, speech is acquired
from a microphone and brought into the development environment for offline
analysis. For testing, speech is continuously streamed into the environment for
During the training stage, it is
necessary to record repeated utterances of each digit in the dictionary. For
example, we repeat the word ‘one’ many times with a pause between each
In the testing stage, however, we
need to continuously acquire and buffer speech samples, and at the same time,
process the incoming speech frame by frame, or in continuous groups of samples.
We use Data Acquisition Toolbox™
to set up continuous acquisition of the speech signal and simultaneously
extract frames of data for processing.
the Acquired Speech:
We begin by
developing a word-detection algorithm that separates each word from ambient
noise. We then derive an acoustic model that gives a robust representation of
each word at the training stage. Finally, we select an appropriate
classification algorithm for the testing stage.
a Speech-Detection Algorithm:
Ø The speech-detection algorithm is
developed by processing the pre-recorded speech frame by frame within a simple
Ø To detect isolated digits, we use
a combination of signal energy and zero-crossing counts for each speech frame.
Signal energy works well for detecting voiced signals, while zero-crossing
counts work well for detecting unvoiced signals.
Ø To avoid identifying ambient
noise as speech, we assume that each isolated word will last at least 25
the Acoustic Model:
A good acoustic
model should be derived from speech characteristics that will enable the system
to distinguish between the different words in the dictionary.
Ø Different sounds
are produced by varying the shape of the human vocal tract and that these
different sounds can have different frequencies. To investigate these frequency
characteristics, we examine the power spectral density (PSD) estimates of
various spoken digits.
Ø the human
vocal tract can be modelled as an all-pole filter, we use the Yule-Walker
parametric spectral estimation technique from Signal Processing Toolbox™ to
calculate these PSDs.
Ø After importing
an utterance of a single digit into the variable ‘speech’, we visualize the PSD
algorithm fits an autoregressive linear prediction filter model to the signal,
we must specify an order of this filter. We select an arbitrary value of 12,
which is typical in speech applications.
Figure 1b. Yule Walker PSD
estimate of three different utterances of the word “TWO.” Click on image to
see enlarged view.
Figure 14a. Yule Walker PSD
estimate of three different utterances of the word “ONE.” Click on image to
see enlarged view.
Ø Mel Frequency
Cepstral Coefficients (MFCCs) is used in speech applications because of its
robustness it gives a measure of the energy within overlapping frequency bins
of a spectrum with a warped (Mel) frequency scale.
Ø Speech can be a
short-term stationary, MFCC feature vectors are calculated for each frame of
detected speech. Using many utterances of a digit and combining all the feature
vectors, we can estimate a multidimensional probability density function (PDF)
of the vectors for a specific digit. Repeating this process for each digit, we
obtain the acoustic model for each digit.
Ø During the
testing stage, we extract the MFCC vectors from the test speech and use a
probabilistic measure to determine the source digit with maximum likelihood.
The challenge then becomes to select an appropriate PDF to represent the MFCC
feature vector distributions.
Figure 15: Distribution of the first dimension of
MFCC feature vectors for the digit one.
Ø Fitting a
Gaussian mixture model (GMM) a sum of weighted Gaussians. Providing a good fit
of standard distributions so it won’t look arbitrary.
Figure 16: Overlay of estimated
Gaussian components (red) and overall Gaussian mixture model (green) to the
Ø The complete Gaussian mixture density is parameterized by the
mixture weights, mean vectors, and covariance matrices from all component
densities. For isolated digit recognition, each digit is represented by the
parameters of its GMM.
Ø To estimate the parameters of a GMM for a set of MFCC feature
vectors extracted from training speech, we use an iterative
expectation-maximization (EM) algorithm to obtain a maximum likelihood (ML)
Ø Given some MFCC training data in the
variable MFCCtraindata, we use the Statistics and Machine Learning
Toolbox gmdistribution function to estimate the GMM parameters. This
function is all that is required to perform the iterative EM calculations.
a Classification Algorithm:
estimating a GMM for each digit, we have a
dictionary for use at the testing stage.
Given some test speech, we again extract the
MFCC feature vectors from each frame of the detected word.
The objective is to find the digit model with
the maximum a posteriori probability for the set of test feature vectors, which
reduces to maximizing a log-likelihood value.
Given a digit model gmmmodel and some test
feature vectors testdata,
the log-likelihood value is easily computed using the posteriorfunction in Statistics
and Machine Learning Toolbox.
the User Interface:
developing the isolated digit recognition system in an offline environment with
prerecorded speech, we migrate the system to operate on streaming speech from a
microphone input.We use MATLAB GUIDE tools to create an interface that displays
the time domain plot of each detected word as well as the classified digit.
Figure 17: Interface to final application. Click on image to see enlarged view.
ü Visual word recognition is a technology that
enables you to convert different types of documents, such as scanned paper
documents, PDF files, images captured by a digital camera into editable and
searchable data or real live images using different methods and software’s
(e.g. OCR, MICR and others…) were each program as a different computational
format capable of diagnosing and turning scanned words into a real documented
from a real live image from which is unclear or of a low resolutions would take
further steps in detecting text by removing all the unneeded parts and by
filtering the image under specific functions processed from the program used.
ü Speech-recognition software
programmes work by analysing sounds and converting them to text. They also
initial data for example, “English language’ is usually spoken to decide what
the speaker most probably said. Once correctly set up, the systems should
recognise what is said if you speak clearly up to a certain precision of almost
ü Major challenges with
Speech-Recognition technology, an effective voice user interface requires solid
error mitigation and the ability to actively demonstrate the capabilities of
1 Matlab support main website www.mathworks.com/company/newsletters/articles/
2 Chen, Huizhong, et al. “Robust Text Detection in
Natural Images with Edge-Enhanced Maximally Stable Extremal Regions.”
Image Processing (ICIP), 2011 18th IEEE International Conference on. IEEE,
3 Gonzalez, Alvaro, et al. “Text location in complex
images.” Pattern Recognition (ICPR), 2012 21st International Conference
on. IEEE, 2012.
4 Li, Yao, and Huchuan Lu. “Scene text detection via
stroke width.” Pattern Recognition (ICPR), 2012 21st International
Conference on. IEEE, 2012.
5 Neumann, Lukas, and Jiri Matas. “Real-time scene text
localization and recognition.” Computer Vision and Pattern Recognition
(CVPR), 2012 IEEE Conference on. IEEE, 2012.
6 Ray Smith. Hybrid Page Layout Analysis via
Tab-Stop Detection. Proceedings of the 10th international conference on
document analysis and recognition. 2009.
7 Morton J. The interaction of information in word
recognition. Psychol. Rev. 1969;76:165–178.
8 Davis C.J. The spatial coding model of visual word
identification. Psychol. Rev. 2010;117:713–758.PubMed