Feature extraction is another approach to dimensionality reduction which assumes
replacing a set of N features with a set of M features, which are actually
combination of the original (set N) feature values 4, 32. The new features are
supposed to be more informative and less redundant. For example, it consists
of transforming text into numerical features that then could be used for machine
The following is a short theoretical description of the feature extraction techniques
we will implement (the implementation itself is described in the section
4): Bag-of-Words (BoW) and Term Frequency Inverse Document Frequency
(TF-IDF). Both of these are known methods for representing text in Vector
Space Model (VSM), that is,transforming text into vector of numbers, format
that would be easier for a machine to process. The components of VSM could
be the frequency of a term (BoW) or its importance (TF-IDF) 33.
Bag-of-words is a simple method of converting text into VSM. The text here
is represented as a set of words which frequencies are used as features for
building a classification model. It does not take into consideration neither grammar
of the language used nor its word order 4. BoW counts all words as equal,
and consequently we end up having some not as informative words more emphasized
than the important ones, which is not exactly an ideal situation for the
classifier 33.
TF-IDF is a very common method used, as it measures the importance
(weight) of a term in a particular text in a corpus 4, 34. In our case of tweets,
if a term appears in a lot of tweets with different emojis, then we could say that
that term is not a very good feature to use to distinguish between them and
decide what kind of emoji is used with which kind of tweets. Therefore, these
kind of terms are given less weight by Tf-idf method than the ones that appear
in only few tweets. In other words, if a term is very frequent in tweets with a
certain label, but appear less frequently in tweets with different labels, then that
term is a good indicator of that label 34. In sum, instead of taking into account
the raw term counts in each tweet (like BoW), Tf-idf looks at a normalized count
– each word count is divided by the number of documents (in our case tweets)
it appears in 33.
Taking a look at the formula 1 bellow, to calculate these weights (wij), where
i is a term and j is a document, first we need to find the term frequency (tf) which
is a number of times a word appears in a document, divided by the total number
of words in that document. Then, we need to calculate the inverse document
frequency (idf) by dividing the logarithm of the number of the documents in the
corpus with the number of documents (N) where the particular term appears
(df). Tf-idf value is always greater than or equal to 0. Its value for the terms appearing
in a lot of documents will be closer to 1 (the more documents it appears
in, the closer the value is to 1) 33.The down side of Tf-idf often mentioned is the high dimensionality of the input,
which requires a lot of computation to weight all the terms in a dataset, since
the size of the feature set is actually the size of the vocabulary of the whole

Post Author: admin


I'm Irvin!

Would you like to get a custom essay? How about receiving a customized one?

Check it out