Table of Contents
Unstructured data Classification, as the name suggests, does not have a structured format and may contain data such as dates, numbers or facts.

This results in irregularities and ambiguities which make it difficult to understand using traditional programs when compared to data stored in fielded form in databases or annotated (semantically tagged) in documents.
Source : Wikipedia.
A few examples of unstructured data Classification Data Types are:
- Emails
- Word Processing Files
- PDF files
- Spreadsheets
- Digital Images
- Video
- Audio
- Social Media Posts etc.

Unstructured Data Classification Dataset Description
The Unstructured dataset contains customer usage pattern of a telecommunication company.
The following is a description of our dataset:
No. of Classes: 2 (Spam / Ham)
No. of attributes (Columns): 2
No. of instances (Rows) : 5574
To start with data loading, import the required python package and load the downloaded CSV file.
The data can be stored as dataframe for easy data manipulation/analysis. Pandas is one of the most widely used libraries for this.
import pandas as pd
import csv
Data Loading
messages = [line.rstrip() for line in open(‘dataset.csv’)]
print len(messages)
Appending column headers
messages = pd.read_csv(‘dataset.csv’, sep=’\t’, quoting=csv.QUOTE_NONE,names=[“label”, “message”])
Unstructured Data Classification Analysis
Analyzing data is a must in any classification problem. The goal of data analysis is to derive useful information from the given data for making decisions.
In this section, we will analyze the dataset in terms of size, headers, view data summary and a sample data.
You can see the dataset size using :
data_size=messages.shape
print(data_size)
Column names can be viewed by :
messages_col_names=list(messages.columns)
print(messages_col_names)
To understand aggregate statistics easily, use the following command :
print(messages.groupby(‘label’).describe())
To see a sample data, use the following command :
print(messages.head(3))
Unstructured Data Classification Target Identification
Target is the class/category to which you will assign the data.
In this case, you aim to identify whether the message is spam or not.
By observing the columns, the label column has values Spam or Ham . We can call this case study a Binary Classification, since it has only two possible outcomes.
Identifying the outcome/target variable.
message_target=messages[‘label’]
print(message_target)
Unstructured Data Classification Algorithms
There are various algorithms to solve the classification problems. The code to try out a few of these algorithms will be presented in the upcoming cards.
We will discuss the following :
- Decision Tree Classifier
- Stochastic Gradient Descent Classifier
- Support Vector Machine Classifier
- Random Forest Classifier
Structured Data Classification V/s UnStructured Data Classification

Trending Unstructured Data Classification Interview Questions & Answers Asked in 2022
A classifier that can compute using numeric as well as categorical values is __
Choose the correct option from below list
a) Naive Bayes Classifier
b) Decision Tree Classifier
c) SVM Classifier
d) Random Forest Classifier
Right Answer is — d) Random Forest Classifier
The following are pre processing methods used for unstructured data classification, except _
Choose the correct option from below list
a) Confusion_matrix
b) Stop word removal
c) Stemming
d) Lemmatization
Right Answer is — a) Confusion_matrix
TF and IDF use matrix representations.
Choose the correct option from below list
a) False
b) True
Right Answer is — b) True
An algorithm that counts how many times a word appears in a document is __
Choose the correct option from below list
a) Bag-of-Words (BOW)
b) TF-IDF
c) TDM
d) DTM
Right Answer is — a) Bag-of-Words (BOW)
Inverse Document frequency is used in the term document matrix.
Choose the correct option from below list
a) True
b) False
Right Answer is — b) False
The inverse document frequency is a measure of whether a term is common or rare in a given document corpus. It is obtained by dividing the total number of documents by the number of documents containing the term in the corpus.
Identify the stop word(s) from the following.
Choose the correct option from below list
a) Both “the” and “it”
b) “the”
c) “fragment”
d) “it”
e) “computer”
Right Answer is — a) Both “the” and “it”
Which NLP technique uses a lexical knowledge base to obtain the correct base form of the words?
Choose the correct option from below list
a) lemmatization
b) tokenization
c) object standardization
d) stop word removal
Right Answer is — a) lemmatization
The classification where each data is mapped to more than one class is called _
Choose the correct option from below list
a) Multi Class Classification
b) Binary Classification
c) Multi Label Classification
Right Answer is — c) Multi Label Classification
High classification accuracy always indicates a good classifier.
Choose the correct option from below list
a) False
b) True
If the classification is of the highest accuracy, it does not mean the classifier is of overall good quality and there might be errors and mistake
Right Answer is — a) False
Pruning is a technique associated with __
Choose the correct option from below list
a) SVM
b) Decision tree
c) Logistic regression
d) Linear regression
Right Answer is — b) Decision tree
Can we consider sentiment classification as a text classification problem
Choose the correct option from below list
a) Yes
b) No
Right Answer is — a) Yes
A technique used to depict the performance in a tabular form that has 2 dimensions namely actual and predicted sets of data is _
Choose the correct option from below list
a) Confusion Matrix
b) Classification Accuracy
c) Cross Validation
d) Classification Report
Right Answer is — a) Confusion Matrix
Select the correct statement about Nonlinear classification.
Choose the correct option from below list
a) The concept of slack variables is used in SVM for Nonlinear classification
b) Kernel trick is used in SVM for non-linear classification
c) Kernel tricks are used by Nonlinear classifiers to achieve maximum-margin hyperplanes
Right Answer is — b) Kernel trick is used in SVM for non-linear classification
In machine learning, a trick known as “kernel trick” is used to learn a linear classifier to classify a non-linear dataset. It transforms the linearly inseparable data into a linearly separable one by projecting it into a higher dimension
Choose the correct sequence for classifier building from the following.
Choose the correct option from below list
a) None of the options
b) Initialize -> Evaluate -> Train -> Predict
c) Initialize -> Train -> Predict -> Evaluate
d) Train -> Test -> Initialize -> Predict
Right Answer is — c) Initialize -> Train -> Predict -> Evaluate
First, we need to initialize the classifier.
Then, we are required to train the classifier.
The next step is to predict the target.
And finally, we need to evaluate the classifier model.
Which numerical statistics is used to identify the importance of a rare word in a document?
Choose the correct option from below list
a) None of the options
b) TF-IDF
c) DF
d) TF
Right Answer is — b) TF-IDF
In document classification, each document has to be converted from full text to a document vector.
Choose the correct option from below list
a) True
b) False
Right Answer is — a) True
Email spam data is an example of __
Choose the correct option from below list
a) Unstructured data
b) Structured data
Right Answer is — a) Unstructured data
SVM is a _
Choose the correct option from below list
a) Supervised learning algorithm
b) Semi-supervised learning algorithm
c) Unsupervised learning algorithm
d) Weakly supervised learning algorithm
Right Answer is — a) Supervised learning algorithm
Clustering is supervised classification.
Choose the correct option from below list
a) True
b) False
Right Answer is — b) False
What is the purpose of lemmatization?
Choose the correct option from below list
a) To convert a sentence into words
b) To convert words into a proper base form
c) To remove redundant words
d) To split into sentences
Right Answer is — b) To convert words into a proper base form
The most widely used package for machine learning in Python is _
Choose the correct option from below list
a) jango
b) pillow
c) bottle
d) sklearn
Right Answer is — d) sklearn
Choose the correct sequence from the following in unstructured data classification.
Choose the correct option from below list
a) Data Analysis -> Pre-Processing -> Predict -> Train
b) Pre-Processing -> Model Building -> Predict
c) Data Analysis -> Pre-Processing -> Model Building -> Predict
d) Pre-Processing -> Predict -> Train
Right Answer is — c) Data Analysis -> Pre-Processing -> Model Building -> Predict
The higher value of which of the following hyperparameters is better for the decision tree algorithm?
Choose the correct option from below list
a) Cannot say
b) Samples for leaf
c) Depth of tree
d) Number of samples used for split
Right Answer is — a) Cannot say
The answer cannot be said because if the value of the parameter increases then the performance can also increase.
Suppose in the depth of a tree,the data can overfit from the resulting data, it happens when the value of the depth of tree is higher.
The data is underfit when the value of the depth of tree is less.
True Positive is when the predicted instance and the actual instance are not negative.
Choose the correct option from below list
a) True
b) False
Right Answer is — a) True
What kind of classification is our case study “Spam Detection”?
Choose the correct option from below list
a) Multi class
b) Binary
c) Multi label
Right Answer is — b) Binary
Which of the given hyperparameters, when increased, may cause the random forest to overfit the data?
Choose the correct option from below list
a) Depth of Tree
b) Learning Rate
c) Number of Trees
Right Answer is — a) Depth of Tree
Which pre processing technique is used to remove the most commonly used words?
Choose the correct option from below list
a) Lemmatization
b) Tokenization
c) Stopword removal
Right Answer is — c) Stopword removal
Which statistical technique deals with finding a structure in a collection of unlabeled data?
a) Time Series Analysis
b) Clustering
c) Classification
d) Association Rules Mining
Right Answer is — b) Clustering