Unstructured Data Classification Questions & Answers 2022

Unstructured data Classification, as the name suggests, does not have a structured format and may contain data such as dates, numbers or facts.

Unstructured Data Classification Questions and Answers

This results in irregularities and ambiguities which make it difficult to understand using traditional programs when compared to data stored in fielded form in databases or annotated (semantically tagged) in documents.

Source : Wikipedia.
A few examples of unstructured data Classification Data Types are:

  • Emails
  • Word Processing Files
  • PDF files
  • Spreadsheets
  • Digital Images
  • Video
  • Audio
  • Social Media Posts etc.
Unstructured Data Types

Unstructured Data Classification Dataset Description

The Unstructured dataset contains customer usage pattern of a telecommunication company.

The following is a description of our dataset:

No. of Classes: 2 (Spam / Ham)

No. of attributes (Columns): 2

No. of instances (Rows) : 5574

To start with data loading, import the required python package and load the downloaded CSV file.

The data can be stored as dataframe for easy data manipulation/analysis. Pandas is one of the most widely used libraries for this.

import pandas as pd
import csv

Data Loading

messages = [line.rstrip() for line in open(‘dataset.csv’)]
print len(messages)

Appending column headers

messages = pd.read_csv(‘dataset.csv’, sep=’\t’, quoting=csv.QUOTE_NONE,names=[“label”, “message”])

Unstructured Data Classification Analysis

Analyzing data is a must in any classification problem. The goal of data analysis is to derive useful information from the given data for making decisions.

In this section, we will analyze the dataset in terms of size, headers, view data summary and a sample data.

You can see the dataset size using :

data_size=messages.shape
print(data_size)
Column names can be viewed by :

messages_col_names=list(messages.columns)
print(messages_col_names)
To understand aggregate statistics easily, use the following command :

print(messages.groupby(‘label’).describe())
To see a sample data, use the following command :

print(messages.head(3))

Unstructured Data Classification Target Identification

Target is the class/category to which you will assign the data.

In this case, you aim to identify whether the message is spam or not.

By observing the columns, the label column has values Spam or Ham . We can call this case study a Binary Classification, since it has only two possible outcomes.

Identifying the outcome/target variable.

message_target=messages[‘label’]
print(message_target)

Unstructured Data Classification Algorithms


There are various algorithms to solve the classification problems. The code to try out a few of these algorithms will be presented in the upcoming cards.

We will discuss the following :

  • Decision Tree Classifier
  • Stochastic Gradient Descent Classifier
  • Support Vector Machine Classifier
  • Random Forest Classifier

Structured Data Classification V/s UnStructured Data Classification

Structured Data Classification V/s UnStructured Data Classification

Trending Unstructured Data Classification Interview Questions & Answers Asked in 2022

A classifier that can compute using numeric as well as categorical values is __
Choose the correct option from below list
a) Naive Bayes Classifier
b) Decision Tree Classifier
c) SVM Classifier
d) Random Forest Classifier

Right Answer is — d) Random Forest Classifier

The following are pre processing methods used for unstructured data classification, except _
Choose the correct option from below list
a) Confusion_matrix
b) Stop word removal
c) Stemming
d) Lemmatization

Right Answer is — a) Confusion_matrix

TF and IDF use matrix representations.
Choose the correct option from below list
a) False
b) True

Right Answer is — b) True

An algorithm that counts how many times a word appears in a document is __
Choose the correct option from below list
a) Bag-of-Words (BOW)
b) TF-IDF
c) TDM
d) DTM

Right Answer is — a) Bag-of-Words (BOW)

Inverse Document frequency is used in the term document matrix.
Choose the correct option from below list
a) True
b) False

Right Answer is — b) False

The inverse document frequency is a measure of whether a term is common or rare in a given document corpus. It is obtained by dividing the total number of documents by the number of documents containing the term in the corpus.

Identify the stop word(s) from the following.
Choose the correct option from below list
a) Both “the” and “it”
b) “the”
c) “fragment”
d) “it”
e) “computer”

Right Answer is — a) Both “the” and “it”

Which NLP technique uses a lexical knowledge base to obtain the correct base form of the words?
Choose the correct option from below list
a) lemmatization
b) tokenization
c) object standardization
d) stop word removal

Right Answer is — a) lemmatization

The classification where each data is mapped to more than one class is called _
Choose the correct option from below list
a) Multi Class Classification
b) Binary Classification
c) Multi Label Classification

Right Answer is — c) Multi Label Classification

High classification accuracy always indicates a good classifier.
Choose the correct option from below list
a) False
b) True

If the classification is of the highest accuracy, it does not mean the classifier is of overall good quality and there might be errors and mistake

Right Answer is — a) False

Pruning is a technique associated with __
Choose the correct option from below list
a) SVM
b) Decision tree
c) Logistic regression
d) Linear regression

Right Answer is — b) Decision tree

Can we consider sentiment classification as a text classification problem
Choose the correct option from below list
a) Yes
b) No

Right Answer is — a) Yes

A technique used to depict the performance in a tabular form that has 2 dimensions namely actual and predicted sets of data is _
Choose the correct option from below list
a) Confusion Matrix
b) Classification Accuracy
c) Cross Validation
d) Classification Report

Right Answer is — a) Confusion Matrix

Select the correct statement about Nonlinear classification.
Choose the correct option from below list
a) The concept of slack variables is used in SVM for Nonlinear classification
b) Kernel trick is used in SVM for non-linear classification
c) Kernel tricks are used by Nonlinear classifiers to achieve maximum-margin hyperplanes

Right Answer is — b) Kernel trick is used in SVM for non-linear classification

In machine learning, a trick known as “kernel trick” is used to learn a linear classifier to classify a non-linear dataset. It transforms the linearly inseparable data into a linearly separable one by projecting it into a higher dimension

Choose the correct sequence for classifier building from the following.
Choose the correct option from below list
a) None of the options
b) Initialize -> Evaluate -> Train -> Predict
c) Initialize -> Train -> Predict -> Evaluate
d) Train -> Test -> Initialize -> Predict

Right Answer is — c) Initialize -> Train -> Predict -> Evaluate

First, we need to initialize the classifier.

Then, we are required to train the classifier.

The next step is to predict the target.

And finally, we need to evaluate the classifier model.

Which numerical statistics is used to identify the importance of a rare word in a document?
Choose the correct option from below list
a) None of the options
b) TF-IDF
c) DF
d) TF

Right Answer is — b) TF-IDF

In document classification, each document has to be converted from full text to a document vector.
Choose the correct option from below list
a) True
b) False

Right Answer is — a) True

Email spam data is an example of __
Choose the correct option from below list
a) Unstructured data
b) Structured data

Right Answer is — a) Unstructured data

SVM is a _
Choose the correct option from below list
a) Supervised learning algorithm
b) Semi-supervised learning algorithm
c) Unsupervised learning algorithm
d) Weakly supervised learning algorithm

Right Answer is — a) Supervised learning algorithm

Clustering is supervised classification.
Choose the correct option from below list
a) True
b) False

Right Answer is — b) False

What is the purpose of lemmatization?
Choose the correct option from below list
a) To convert a sentence into words
b) To convert words into a proper base form
c) To remove redundant words
d) To split into sentences

Right Answer is — b) To convert words into a proper base form

The most widely used package for machine learning in Python is _
Choose the correct option from below list
a) jango
b) pillow
c) bottle
d) sklearn

Right Answer is — d) sklearn

Choose the correct sequence from the following in unstructured data classification.
Choose the correct option from below list
a) Data Analysis -> Pre-Processing -> Predict -> Train
b) Pre-Processing -> Model Building -> Predict
c) Data Analysis -> Pre-Processing -> Model Building -> Predict
d) Pre-Processing -> Predict -> Train

Right Answer is — c) Data Analysis -> Pre-Processing -> Model Building -> Predict

The higher value of which of the following hyperparameters is better for the decision tree algorithm?
Choose the correct option from below list
a) Cannot say
b) Samples for leaf
c) Depth of tree
d) Number of samples used for split

Right Answer is — a) Cannot say

The answer cannot be said because if the value of the parameter increases then the performance can also increase.

Suppose in the depth of a tree,the data can overfit from the resulting data, it happens when the value of the depth of tree is higher.

The data is underfit when the value of the depth of tree is less.

True Positive is when the predicted instance and the actual instance are not negative.
Choose the correct option from below list
a) True
b) False

Right Answer is — a) True

What kind of classification is our case study “Spam Detection”?
Choose the correct option from below list
a) Multi class
b) Binary
c) Multi label

Right Answer is — b) Binary

Which of the given hyperparameters, when increased, may cause the random forest to overfit the data?
Choose the correct option from below list
a) Depth of Tree
b) Learning Rate
c) Number of Trees

Right Answer is — a) Depth of Tree

Which pre processing technique is used to remove the most commonly used words?
Choose the correct option from below list
a) Lemmatization
b) Tokenization
c) Stopword removal

Right Answer is — c) Stopword removal

Which statistical technique deals with finding a structure in a collection of unlabeled data?
a) Time Series Analysis
b) Clustering
c) Classification
d) Association Rules Mining

Right Answer is — b) Clustering

About Author


After years of Technical Work, I feel like an expert when it comes to Develop wordpress website. Check out How to Create a Wordpress Website in 5 Mins, and Earn Money Online Follow me on Facebook for all the latest updates.