Table of Contents
In a work environment, most of us create structured data Classification which, due to its nature, is straightforward when it comes to classification. Often, it’s this structured data that is the most sensitive and therefore most reliant on classification for its protection.

What is structured data?
Structured data Classification is classified data that, due to its highly organised nature, is typically hosted on critical data segregation databases such as SharePoint, Documentum or SAP. These platforms, with the vast amount of data they store, use data classification to help ensure data is stored in the appropriate section to facilitate the correct permissions for its level of sensitivity. So those who have permissions are the only ones who have access to your valuable data.
A Simple Example for Structured Data Classification
The first step is to prepare your data. Here we use the Titanic dataset as an example.
TRAIN_DATA_URL = “https://storage.googleapis.com/tf-datasets/titanic/train.csv”
TEST_DATA_URL = “https://storage.googleapis.com/tf-datasets/titanic/eval.csv”
train_file_path = tf.keras.utils.get_file(“train.csv”, TRAIN_DATA_URL)
test_file_path = tf.keras.utils.get_file(“eval.csv”, TEST_DATA_URL)
The second step is to run the StructuredDataClassifier. As a quick demo, we set epochs to 10. You can also leave the epochs unspecified for an adaptive number of epochs.
Initialize the structured data classifier.
clf = ak.StructuredDataClassifier(
overwrite=True, max_trials=3
) # It tries 3 different models.
Feed the structured data classifier with training data.
clf.fit(
# The path to the train.csv file.
train_file_path,
# The name of the label column.
“survived”,
epochs=10,
)
Predict with the best model.
predicted_y = clf.predict(test_file_path)
Evaluate the best model with testing data.
print(clf.evaluate(test_file_path, “survived”))
Defining unstructured data Classification
Unstructured data is data that, because it isn’t classified, is much harder to order, segregate and track. As a result, unstructured data is much harder to protect and control.
- Supported platforms
- Sharepoint
- SAP
- Documentum

Structured data Classification Questions and Answers
How many new columns does the following command return?
iris_series = pd.get_dummies(iris[‘Species’])
Download the dataset from https://gist.githubusercontent.com/curran/a08a1080b88344b0c8a7/raw/d546eaee765268bf2f487608c537c05e22e4b221/iris.csv to answer the question.
Choose the correct option from below list
i) 1
ii) 3
iii) 4
iv) 2
Right Answer : ii) 3
A process used to identify unusual data points is _
Choose the correct option from below list
i) Anomaly Detection
ii) Over Fitting
iii) Under fitting
Right Answer : i) Anomaly Detection
Images and documents are examples of _
Choose the correct option from below list
i) Structured Data
ii) Unstructured Data
Right Answer : ii) Unstructured Data
Which command is used to identify the unique values of a column?
Choose the correct option from below list
i) distinct()
ii) unique()
iii) value_counts()
iv) shape
Right Answer : ii) unique()
What kind of classification is our case study ‘Churn Analysis’?
Choose the correct option from below list
i) Binary
ii) Multi class
iii) Multi label
Right Answer : i) Binary
Identify the structured data from the following.
Choose the correct option from below list
i) Data from mySQL DB
ii) Image
iii) Excel data
iv) Data from mySQL DB and Excel
v) Video clip
Right Answer : iv) Data from mySQL DB and Excel
True Negative is when the predicted instance and the actual instance are positive.
Choose the correct option from below list
i) False
ii) True
Right Answer : i) False
Clustering is an example of _
Choose the correct option from below list
i) Unsupervised classification
ii) Supervised classification
Right Answer : i) Unsupervised classification
A technique used to depict the performance in a tabular form that has 2 dimensions namely actual and predicted sets of data is __
i) Classification Accuracy
ii) Confusion Matrix———
iii) Classification Report
iv) Cross Validation
Right Answer : ii) Confusion Matrix
Which type of cross-validation is used for an imbalanced dataset?
Choose the correct option from below list
i) Stratified Shuffle Split
ii) Leave One Out
iii) K-Fold
Right Answer : i) Stratified Shuffle Split
Identify the command used to view the dataset SIZE, and what is the value returned?
Download the dataset from https://gist.githubusercontent.com/curran/a08a1080b88344b0c8a7/raw/d546eaee765268bf2f487608c537c05e22e4b221/iris.csv to answer the question.
Choose the correct option from below list
i) iris.size,(150,5)
ii) iris.size(),(150,6)
iii) iris.shape,(150,6)
iv) iris.shape(),(150,5)
Right Answer : iv) iris.shape(),(150,5)
Is there a class imbalance problem in the given data set?
Download the dataset from https://gist.githubusercontent.com/curran/a08a1080b88344b0c8a7/raw/d546eaee765268bf2f487608c537c05e22e4b221/iris.csv to answer the question.
Choose the correct option from below list
i) Yes
ii) No
Right Answer : ii) No
Which classifier converges easily with less training data?
Choose the correct option from below list
i) Decision Tree Classifier
ii) Random Forest Classifier
iii) Naive Bayes Classifier
iv) SVM Classifier
Right Answer : iii) Naive Bayes Classifier
The fit(X, y) is used to _
Choose the correct option from below list
i) Evaluate the classifier
ii) Train the classifier
iii) Test the classifier
iv) Initialize the classifier
Right Answer : ii) Train the classifier
The classification where each data is mapped to more than one class is called _
Choose the correct option from below list
i) Multi Label Classification
ii) Multi Class Classification
iii) Binary Classification
Right Answer : i) Multi Label Classification
Which preprocessing technique is used to make the data Gaussian with zero mean and unit variance?
Choose the correct option from below list
i) Normalization
ii) Standardization
iii) Binarization
Right Answer : ii) Standardization
Email spam detection is an example of __
Choose the correct option from below list
i) Unsupervised classification
ii) Supervised classification
Right Answer : ii) Supervised classification
How many classes will the following command return?
(target classes in the dataset) : classes=list(iris[‘species’].unique())
Download the dataset from https://gist.githubusercontent.com/curran/a08a1080b88344b0c8a7/raw/d546eaee765268bf2f487608c537c05e22e4b221/iris.csv to answer the question.
Choose the correct option from below list
i) 3
ii) 2
iii) 4
iv) 1
Right Answer : i) 3
Pruning is a technique associated with _
Choose the correct option from below list
i) Logistic regression
ii) SVM
iii) Linear regression
iv) Decision tree
Right Answer : iv) Decision tree
Choose the correct sequence from the following.
Choose the correct option from below list
i) Data Analysis -> Preprocessing -> Model Building -> Predict
ii) PreProcessing -> Model Building -> Predict
iii) Data Analysis -> Preprocessing -> Predict -> Train
iv) Preprocessing -> Predict -> Train
Right Answer : i) Data Analysis -> Preprocessing -> Model Building -> Predict
The commonly used package for machine learning in Python is _
Choose the correct option from below list
i) bottle
ii) jango
iii) sklearn
iv) pillow
Right Answer : iii) sklearn
Cross-validation causes over-fitting.
Choose the correct option from below list
i) False
ii) True
Right Answer : i) False
Choose the correct sequence for the classifier building from the following.
Choose the correct option from below list
i) Initialize -> Train -> Predict -> Evaluate
ii) Train -> Test -> Initialize -> Predict
iii) None of the options
iv) Initialize -> Evaluate -> Train -> Predict
Right Answer : i) Initialize -> Train -> Predict -> Evaluate
Model Tuning helps to increase the accuracy.
Choose the correct option from below list
i) True
ii) False
Right Answer : i) True
Select the pre-processing technique(s) from the following.
Choose the correct option from below list
i) One-hot encoding
ii) Normalization
iii) All the options
iv) Standardization
v) Dimensionality reduction
Right Answer : iii) All the options
Let’s assume you are solving a classification problem with a highly imbalanced class.
The majority class is observed 99% of the time in the training data.
Choose the correct option from below list
Which of the following is true when your model has 99% accuracy after taking the predictions on test data?
i) For imbalanced class problems, precision and recall metrics are not good.
ii) For imbalanced class problems, the accuracy metric is not a good idea.
iii) For imbalanced class problems, the accuracy metric is a good idea
Right Answer : ii) For imbalanced class problems, the accuracy metric is not a good idea.
Imputing is a strategy to handle __
Choose the correct option from below list
i) Class Imbalance
ii) Standardization
iii) Missing Values
Right Answer : iii) Missing Values
A classifier that can compute using numeric as well as categorical values is __
Choose the correct option from below list
i) Naive Bayes Classifier
ii) Random Forest Classifier
iii) SVM Classifier
iv) Decision Tree Classifier
Right Answer : ii) Random Forest Classifier
The cross-validation technique is used to evaluate a classifier by dividing the data set into a training set to train the classifier and a testing set to test the same.
Choose the correct option from below list
i) True
ii) False
Right Answer : i) True
To view the first 3 rows of the dataset, which of the following commands is used?
Download the dataset from https://gist.githubusercontent.com/curran/a08a1080b88344b0c8a7/raw/d546eaee765268bf2f487608c537c05e22e4b221/iris.csv to answer the question.
Choose the correct option from below list
i) iris.topiii)
ii) iris.headiii)
iii) iris.selectiii)
iv) iris.getiii)
Right Answer : ii) iris.headiii)
Supervised learning differs from unsupervised learning as supervised learning requires __
Choose the correct option from below list
i) Labeled data
ii) None of the options
iii) Unlabeled data
iv) Raw data
Right Answer : i) Labeled data
What does the command iris[‘species’].value_counts() return?
Download the dataset from https://gist.githubusercontent.com/curran/a08a1080b88344b0c8a7/raw/d546eaee765268bf2f487608c537c05e22e4b221/iris.csv to answer the question.
Choose the correct option from below list
i) The total count of elements in the iris[‘species’] column
ii) The count with unique values in the iris[‘species’] column
iii) The number of columns in the dataset
iv) The number of rows in the dataset
Right Answer : ii) The count with unique values in the iris[‘species’] column
Supervised learning differs from unsupervised learning as supervised learning requires __
Choose the correct option from below list
i) Labeled data
ii) None of the options
iii) Unlabeled data
iv) Raw data
Right Answer : i) Labeled data
What does the command iris[‘species’].value_counts() return?
Download the dataset from https://gist.githubusercontent.com/curran/a08a1080b88344b0c8a7/raw/d546eaee765268bf2f487608c537c05e22e4b221/iris.csv to answer the question.
Choose the correct option from below list
i) The total count of elements in the iris[‘species’] column
ii) The count with unique values in the iris[‘species’] column
iii) The number of columns in the dataset
iv) The number of rows in the dataset
Right Answer : ii) The count with unique values in the iris[‘species’] column
Ensemble learning is used when you build component classifiers that are more accurate and independent of each other.
Choose the correct option from below list
i) False
ii) True
Right Answer : ii) True
True Positive is when the predicted instance and the actual instance are positive.
Choose the correct option from below list
i) True
ii) False
Right Answer : i) True
Ordinal variables have __
Choose the correct option from below list
i) No logical order
ii) A clear logical order
Right Answer : ii) A clear logical order
Structured Data Classification Hands-on Solutions

The Course Id of the Structured Data Classification is 55941.
Structured_test
step 1: –
import pandas as pd
import numpy as np
import dataframe as df
step 2:-
weather = pd.read_csv(‘weather.csv’, sep=’,’)
step 3:-
data_size=weather.shape
print(data_size)
weather_col_names = list(weather.columns)
print(weather_col_names)
print(weather.describe())
print(weather.head(3))
step 4:-
weather_target=weather[‘RainTomorrow’]
print(weather_target)
step 5:-
cols_to_drop = [‘Date’,’RainTomorrow’]
weather_feature = weather.drop(cols_to_drop,axis = 1)
print(weather_feature.head(5))
step 6: –
weather_categorical = weather.select_dtypes(include=[object])
print(weather_categorical.head(15))
step 7:-
yes_no_cols = [“RainToday”]
weather_feature[yes_no_cols] = weather_feature[yes_no_cols] == ‘Yes’
print(weather_feature.head(5))
step 8:-
weather_dumm=pd.get_dummies(weather_feature, columns=[“Location”,”WindGustDir”,”WindDir9am”,”WindDir3pm”], prefix=[“Location”,”WindGustDir”,”WindDir9am”,”WindDir3pm”])
weather_matrix = weather_dumm.values.astype(np.float)
step 9:-
from sklearn.impute import SimpleImputer
imp=SimpleImputer(missing_values=np.nan,strategy=’mean’, fill_value=None,verbose=0,copy=True)
weather_matrix=imp.fit_transform(weather_matrix)
step 10:-
from sklearn.preprocessing import StandardScaler
Standardize the data by removing the mean and scaling to unit variance
scaler = StandardScaler()
Fit to data, then transform it.
weather_matrix = scaler.fit_transform(weather_matrix)
step 11:-
from sklearn.model_selection import train_test_split
seed=5000
train_data,test_data, train_label, test_label = train_test_split(weather_matrix,weather_target,test_size=0.1,random_state = seed)
step 12:-
from sklearn.svm import SVC
classifier = SVC(kernel=”linear”,C=0.025,random_state=seed )
classifier = classifier.fit(train_data,train_label)
churn_predicted_target=classifier.predict(test_data)
score = classifier.score(test_data,test_label)
print(‘SVM Classifier : ‘,score)
with open(‘output.txt’, ‘w’) as file:
file.write(str(np.mean(score)))
step 13:-
from sklearn.ensemble import RandomForestClassifier
classifier = RandomForestClassifier(max_depth=5,n_estimators=10,max_features=10,random_state=seed)
classifier = classifier.fit(train_data,train_label)
churn_predicted_target=classifier.predict(test_data)
score = classifier.score(test_data,test_label)
print(‘Random Forest Classifier : ‘,score)
with open(‘output1.txt’, ‘w’) as file:
file.write(str(np.mean(score)))