Structured Data Classification Questions & Answers 2022

In a work environment, most of us create structured data Classification which, due to its nature, is straightforward when it comes to classification. Often, it’s this structured data that is the most sensitive and therefore most reliant on classification for its protection.

Structured Data Classification Questions

What is structured data?


Structured data Classification is classified data that, due to its highly organised nature, is typically hosted on critical data segregation databases such as SharePoint, Documentum or SAP. These platforms, with the vast amount of data they store, use data classification to help ensure data is stored in the appropriate section to facilitate the correct permissions for its level of sensitivity. So those who have permissions are the only ones who have access to your valuable data.

A Simple Example for Structured Data Classification


The first step is to prepare your data. Here we use the Titanic dataset as an example.

TRAIN_DATA_URL = “https://storage.googleapis.com/tf-datasets/titanic/train.csv”
TEST_DATA_URL = “https://storage.googleapis.com/tf-datasets/titanic/eval.csv”

train_file_path = tf.keras.utils.get_file(“train.csv”, TRAIN_DATA_URL)
test_file_path = tf.keras.utils.get_file(“eval.csv”, TEST_DATA_URL)
The second step is to run the StructuredDataClassifier. As a quick demo, we set epochs to 10. You can also leave the epochs unspecified for an adaptive number of epochs.

Initialize the structured data classifier.

clf = ak.StructuredDataClassifier(
overwrite=True, max_trials=3
) # It tries 3 different models.

Feed the structured data classifier with training data.

clf.fit(
# The path to the train.csv file.
train_file_path,
# The name of the label column.
“survived”,
epochs=10,
)

Predict with the best model.

predicted_y = clf.predict(test_file_path)

Evaluate the best model with testing data.

print(clf.evaluate(test_file_path, “survived”))

Defining unstructured data Classification


Unstructured data is data that, because it isn’t classified, is much harder to order, segregate and track. As a result, unstructured data is much harder to protect and control.

Structured vs Unstructured Data Classification

Structured data Classification Questions and Answers

How many new columns does the following command return?
iris_series = pd.get_dummies(iris[‘Species’])
Download the dataset from https://gist.githubusercontent.com/curran/a08a1080b88344b0c8a7/raw/d546eaee765268bf2f487608c537c05e22e4b221/iris.csv to answer the question.
Choose the correct option from below list
i) 1
ii) 3
iii) 4
iv) 2

Right Answer : ii) 3

A process used to identify unusual data points is _
Choose the correct option from below list
i) Anomaly Detection
ii) Over Fitting
iii) Under fitting

Right Answer : i) Anomaly Detection

Images and documents are examples of _
Choose the correct option from below list
i) Structured Data
ii) Unstructured Data

Right Answer : ii) Unstructured Data

Which command is used to identify the unique values of a column?
Choose the correct option from below list
i) distinct()
ii) unique()
iii) value_counts()
iv) shape

Right Answer : ii) unique()

What kind of classification is our case study ‘Churn Analysis’?
Choose the correct option from below list
i) Binary
ii) Multi class
iii) Multi label

Right Answer : i) Binary

Identify the structured data from the following.
Choose the correct option from below list
i) Data from mySQL DB
ii) Image
iii) Excel data
iv) Data from mySQL DB and Excel
v) Video clip

Right Answer : iv) Data from mySQL DB and Excel

True Negative is when the predicted instance and the actual instance are positive.
Choose the correct option from below list
i) False
ii) True

Right Answer : i) False

Clustering is an example of _
Choose the correct option from below list
i) Unsupervised classification
ii) Supervised classification

Right Answer : i) Unsupervised classification

A technique used to depict the performance in a tabular form that has 2 dimensions namely actual and predicted sets of data is __
i) Classification Accuracy
ii) Confusion Matrix———
iii) Classification Report
iv) Cross Validation

Right Answer : ii) Confusion Matrix

Which type of cross-validation is used for an imbalanced dataset?
Choose the correct option from below list
i) Stratified Shuffle Split
ii) Leave One Out
iii) K-Fold

Right Answer : i) Stratified Shuffle Split

Identify the command used to view the dataset SIZE, and what is the value returned?
Download the dataset from https://gist.githubusercontent.com/curran/a08a1080b88344b0c8a7/raw/d546eaee765268bf2f487608c537c05e22e4b221/iris.csv to answer the question.
Choose the correct option from below list
i) iris.size,(150,5)
ii) iris.size(),(150,6)
iii) iris.shape,(150,6)
iv) iris.shape(),(150,5)

Right Answer : iv) iris.shape(),(150,5)

Is there a class imbalance problem in the given data set?
Download the dataset from https://gist.githubusercontent.com/curran/a08a1080b88344b0c8a7/raw/d546eaee765268bf2f487608c537c05e22e4b221/iris.csv to answer the question.
Choose the correct option from below list
i) Yes
ii) No

Right Answer : ii) No

Which classifier converges easily with less training data?
Choose the correct option from below list
i) Decision Tree Classifier
ii) Random Forest Classifier
iii) Naive Bayes Classifier
iv) SVM Classifier

Right Answer : iii) Naive Bayes Classifier

The fit(X, y) is used to _
Choose the correct option from below list
i) Evaluate the classifier
ii) Train the classifier
iii) Test the classifier
iv) Initialize the classifier

Right Answer : ii) Train the classifier

The classification where each data is mapped to more than one class is called _
Choose the correct option from below list
i) Multi Label Classification
ii) Multi Class Classification
iii) Binary Classification

Right Answer : i) Multi Label Classification

Which preprocessing technique is used to make the data Gaussian with zero mean and unit variance?
Choose the correct option from below list
i) Normalization
ii) Standardization
iii) Binarization

Right Answer : ii) Standardization

Email spam detection is an example of __
Choose the correct option from below list
i) Unsupervised classification
ii) Supervised classification

Right Answer : ii) Supervised classification

How many classes will the following command return?
(target classes in the dataset) : classes=list(iris[‘species’].unique())

Download the dataset from https://gist.githubusercontent.com/curran/a08a1080b88344b0c8a7/raw/d546eaee765268bf2f487608c537c05e22e4b221/iris.csv to answer the question.
Choose the correct option from below list
i) 3
ii) 2
iii) 4
iv) 1

Right Answer : i) 3

Pruning is a technique associated with _
Choose the correct option from below list
i) Logistic regression
ii) SVM
iii) Linear regression
iv) Decision tree

Right Answer : iv) Decision tree

Choose the correct sequence from the following.
Choose the correct option from below list
i) Data Analysis -> Preprocessing -> Model Building -> Predict
ii) PreProcessing -> Model Building -> Predict
iii) Data Analysis -> Preprocessing -> Predict -> Train
iv) Preprocessing -> Predict -> Train

Right Answer : i) Data Analysis -> Preprocessing -> Model Building -> Predict

The commonly used package for machine learning in Python is _
Choose the correct option from below list
i) bottle
ii) jango
iii) sklearn
iv) pillow

Right Answer : iii) sklearn

Cross-validation causes over-fitting.
Choose the correct option from below list
i) False
ii) True

Right Answer : i) False

Choose the correct sequence for the classifier building from the following.
Choose the correct option from below list
i) Initialize -> Train -> Predict -> Evaluate
ii) Train -> Test -> Initialize -> Predict
iii) None of the options
iv) Initialize -> Evaluate -> Train -> Predict

Right Answer : i) Initialize -> Train -> Predict -> Evaluate

Model Tuning helps to increase the accuracy.
Choose the correct option from below list
i) True
ii) False

Right Answer : i) True

Select the pre-processing technique(s) from the following.
Choose the correct option from below list
i) One-hot encoding
ii) Normalization
iii) All the options
iv) Standardization
v) Dimensionality reduction

Right Answer : iii) All the options

Let’s assume you are solving a classification problem with a highly imbalanced class.
The majority class is observed 99% of the time in the training data.
Choose the correct option from below list
Which of the following is true when your model has 99% accuracy after taking the predictions on test data?
i) For imbalanced class problems, precision and recall metrics are not good.
ii) For imbalanced class problems, the accuracy metric is not a good idea.
iii) For imbalanced class problems, the accuracy metric is a good idea

Right Answer : ii) For imbalanced class problems, the accuracy metric is not a good idea.

Imputing is a strategy to handle __
Choose the correct option from below list
i) Class Imbalance
ii) Standardization
iii) Missing Values

Right Answer : iii) Missing Values

A classifier that can compute using numeric as well as categorical values is __
Choose the correct option from below list
i) Naive Bayes Classifier
ii) Random Forest Classifier
iii) SVM Classifier
iv) Decision Tree Classifier

Right Answer : ii) Random Forest Classifier

The cross-validation technique is used to evaluate a classifier by dividing the data set into a training set to train the classifier and a testing set to test the same.
Choose the correct option from below list
i) True
ii) False

Right Answer : i) True

To view the first 3 rows of the dataset, which of the following commands is used?
Download the dataset from https://gist.githubusercontent.com/curran/a08a1080b88344b0c8a7/raw/d546eaee765268bf2f487608c537c05e22e4b221/iris.csv to answer the question.
Choose the correct option from below list
i) iris.topiii)
ii) iris.headiii)
iii) iris.selectiii)
iv) iris.getiii)

Right Answer : ii) iris.headiii)

Supervised learning differs from unsupervised learning as supervised learning requires __
Choose the correct option from below list
i) Labeled data
ii) None of the options
iii) Unlabeled data
iv) Raw data

Right Answer : i) Labeled data

What does the command iris[‘species’].value_counts() return?
Download the dataset from https://gist.githubusercontent.com/curran/a08a1080b88344b0c8a7/raw/d546eaee765268bf2f487608c537c05e22e4b221/iris.csv to answer the question.
Choose the correct option from below list
i) The total count of elements in the iris[‘species’] column
ii) The count with unique values in the iris[‘species’] column
iii) The number of columns in the dataset
iv) The number of rows in the dataset

Right Answer : ii) The count with unique values in the iris[‘species’] column

Supervised learning differs from unsupervised learning as supervised learning requires __
Choose the correct option from below list
i) Labeled data
ii) None of the options
iii) Unlabeled data
iv) Raw data

Right Answer : i) Labeled data

What does the command iris[‘species’].value_counts() return?
Download the dataset from https://gist.githubusercontent.com/curran/a08a1080b88344b0c8a7/raw/d546eaee765268bf2f487608c537c05e22e4b221/iris.csv to answer the question.
Choose the correct option from below list
i) The total count of elements in the iris[‘species’] column
ii) The count with unique values in the iris[‘species’] column
iii) The number of columns in the dataset
iv) The number of rows in the dataset

Right Answer : ii) The count with unique values in the iris[‘species’] column

Ensemble learning is used when you build component classifiers that are more accurate and independent of each other.
Choose the correct option from below list
i) False
ii) True

Right Answer : ii) True

True Positive is when the predicted instance and the actual instance are positive.
Choose the correct option from below list
i) True
ii) False

Right Answer : i) True

Ordinal variables have __
Choose the correct option from below list
i) No logical order
ii) A clear logical order

Right Answer : ii) A clear logical order

Structured Data Classification Hands-on Solutions

Structured Data Classification Handson

The Course Id of the Structured Data Classification is 55941.

Structured_test

step 1: –

import pandas as pd

import numpy as np

import dataframe as df

step 2:-

weather = pd.read_csv(‘weather.csv’, sep=’,’)

step 3:-

data_size=weather.shape

print(data_size)

weather_col_names = list(weather.columns)

print(weather_col_names)

print(weather.describe())

print(weather.head(3))

step 4:-

weather_target=weather[‘RainTomorrow’]

print(weather_target)

step 5:-

cols_to_drop = [‘Date’,’RainTomorrow’]

weather_feature = weather.drop(cols_to_drop,axis = 1)

print(weather_feature.head(5))

step 6: –

weather_categorical = weather.select_dtypes(include=[object])

print(weather_categorical.head(15))

step 7:-

yes_no_cols = [“RainToday”]

weather_feature[yes_no_cols] = weather_feature[yes_no_cols] == ‘Yes’

print(weather_feature.head(5))

step 8:-

weather_dumm=pd.get_dummies(weather_feature, columns=[“Location”,”WindGustDir”,”WindDir9am”,”WindDir3pm”], prefix=[“Location”,”WindGustDir”,”WindDir9am”,”WindDir3pm”])

weather_matrix = weather_dumm.values.astype(np.float)

step 9:-

from sklearn.impute import SimpleImputer

imp=SimpleImputer(missing_values=np.nan,strategy=’mean’, fill_value=None,verbose=0,copy=True)

weather_matrix=imp.fit_transform(weather_matrix)

step 10:-

from sklearn.preprocessing import StandardScaler

Standardize the data by removing the mean and scaling to unit variance

scaler = StandardScaler()

Fit to data, then transform it.

weather_matrix = scaler.fit_transform(weather_matrix)

step 11:-

from sklearn.model_selection import train_test_split

seed=5000

train_data,test_data, train_label, test_label = train_test_split(weather_matrix,weather_target,test_size=0.1,random_state = seed)

step 12:-

from sklearn.svm import SVC

classifier = SVC(kernel=”linear”,C=0.025,random_state=seed )

classifier = classifier.fit(train_data,train_label)

churn_predicted_target=classifier.predict(test_data)

score = classifier.score(test_data,test_label)

print(‘SVM Classifier : ‘,score)

with open(‘output.txt’, ‘w’) as file:

file.write(str(np.mean(score)))

step 13:-

from sklearn.ensemble import RandomForestClassifier

classifier = RandomForestClassifier(max_depth=5,n_estimators=10,max_features=10,random_state=seed)

classifier = classifier.fit(train_data,train_label)

churn_predicted_target=classifier.predict(test_data)

score = classifier.score(test_data,test_label)

print(‘Random Forest Classifier : ‘,score)

with open(‘output1.txt’, ‘w’) as file:

file.write(str(np.mean(score)))

About Author


After years of Technical Work, I feel like an expert when it comes to Develop wordpress website. Check out How to Create a Wordpress Website in 5 Mins, and Earn Money Online Follow me on Facebook for all the latest updates.