Top 101 Python Pandas Interview Questions & Answers

Python Pandas is a fast, powerful, Multi-use, flexible and easy to use open source data analysis and manipulation tool, built on top of Python programming language.

101 Python Pandas interview questions

Install pandas now!

101 Python Pandas Interview Questions & Answers

1) Explain what is Series in Python Pandas?

A Series is defined as a one-dimensional array that is capable of storing various data types. The row labels of series are called the index. By using a ‘series’ method, we can easily convert the list, tuple, and dictionary into series. A Series cannot contain multiple columns.

2) How can we calculate the standard deviation from the Series?

The Python Pandas std() is defined as a function for calculating the standard deviation of the given set of numbers, DataFrame, column, and rows.

Series.std(axis=None, skipna=None, level=None, ddof=1, numeric_only=None, **kwargs)

3) Explain what is DataFrame in Pandas?

A DataFrame is a widely used data structure of pandas and works with a two-dimensional array with labeled axes (rows and columns) DataFrame is defined as a standard way to store data and has two different indexes, i.e., row index and column index. It consists of the following properties:

The columns can be heterogeneous types like int and bool.

It can be seen as a dictionary of Series structure where both the rows and columns are indexed. It is denoted as “columns” in the case of columns and “index” in case of rows.

4) What are the significant features of the pandas Library?

The key features of the panda’s library are as follows:

Memory Efficient

Data Alignment

Reshaping

Merge and join

Time Series

5) Explain Categorical data in Pandas?

A Categorical data is defined as a Pandas data type that corresponds to a categorical variable in statistics. A categorical variable is generally used to take a limited and usually fixed number of possible values. Examples: gender, country affiliation, blood type, social class, observation time, or rating via Likert scales. All values of categorical data are either in categories or np.nan.

This data type is useful in the following cases:

It is useful for a string variable that consists of only a few different values. If we want to save some memory, we can convert a string variable to a categorical variable.

It is useful for the lexical order of a variable that is not the same as the logical order (?one?, ?two?, ?three?) By converting into a categorical and specify an order on the categories, sorting and min/max is responsible for using the logical order instead of the lexical order.

It is useful as a signal to other Python Pandas libraries because this column should be treated as a categorical variable.

6) How will you create a series from dict in Pandas?

A Series is defined as a one-dimensional array that is capable of storing various data types.

We can create a Pandas Series from Dictionary:

Create a Series from dict:

We can also create a Series from dict. If the dictionary object is being passed as an input and the index is not specified, then the dictionary keys are taken in a sorted order to construct the index.

If index is passed, then values correspond to a particular label in the index will be extracted from the dictionary.

import pandas as pd   

import numpy as np   

info = {‘x’ : 0., ‘y’ : 1., ‘z’ : 2.}   

a = pd.Series(info)   

print (a)   

Output:

x     0.0

y     1.0

z     2.0

dtype: float64

12) How can we create a copy of the series in Python Pandas?

We can create the copy of series by using the following syntax:

pandas.Series.copy

Series.copy(deep=True)

The above statements make a deep copy that includes a copy of the data and the indices. If we set the value of deep to False, it will neither copy the indices nor the data.

Note: If we set deep=True, the data will be copied, and the actual python objects will not be copied recursively, only the reference to the object will be copied.

7) How will you create an empty DataFrame in Pandas?

A DataFrame is a widely used data structure of pandas and works with a two-dimensional array with labeled axes (rows and columns) It is defined as a standard way to store data and has two different indexes, i.e., row index and column index.

Create an empty DataFrame:

The below code shows how to create an empty DataFrame in Pandas:

# importing the pandas library   

import pandas as pd   

info = pd.DataFrame()   

print (info)   

Output:

Empty DataFrame

Columns: []

Index: []

8) How will you add a column to a Python pandas DataFrame?

We can add any new column to an existing DataFrame. The below code demonstrates how to add any new column to an existing DataFrame:

# importing the pandas library   

import pandas as pd     

info = {‘one’ : pd.Series([1, 2, 3, 4, 5], index=[‘a’, ‘b’, ‘c’, ‘d’, ‘e’]),   

             ‘two’ : pd.Series([1, 2, 3, 4, 5, 6], index=[‘a’, ‘b’, ‘c’, ‘d’, ‘e’, ‘f’])}   

info = pd.DataFrame(info)   

# Add a new column to an existing DataFrame object    

print (“Add new column by passing series”)   

info[‘three’]=pd.Series([20,40,60],index=[‘a’,’b’,’c’])   

print (info)   

print (“Add new column using existing DataFrame columns”)   

info[‘four’]=info[‘one’]+info[‘three’]   

print (info)   

Output:

Add new column by passing series

      one     two      three

a     1.0      1        20.0

b     2.0      2        40.0

c     3.0      3        60.0

d     4.0      4        NaN

e     5.0      5        NaN

f     NaN      6        NaN

Add new column using existing DataFrame columns

       one      two       three      four

a      1.0       1         20.0      21.0

b      2.0       2         40.0      42.0

c      3.0       3         60.0      63.0

d      4.0       4         NaN      NaN

e      5.0       5         NaN      NaN

f      NaN       6        NaN      NaN

9) How to add an Index, row, or column to a Pandas DataFrame?

Adding an Index to a DataFrame

Pandas allow adding the inputs to the index argument if you create a DataFrame. It will make sure that you have the desired index. If you don?t specify inputs, the DataFrame contains, by default, a numerically valued index that starts with 0 and ends on the last row of the DataFrame.

Adding Rows to a DataFrame

We can use .loc, iloc, and ix to insert the rows in the DataFrame.

The loc basically works for the labels of our index. It can be understood as if we insert in loc[4], which means we are looking for that values of DataFrame that have an index labeled 4.

The iloc basically works for the positions in the index. It can be understood as if we insert in iloc[4], which means we are looking for the values of DataFrame that are present at index ‘4`.

The ix is a complex case because if the index is integer-based, we pass a label to ix. The ix[4] means that we are looking in the DataFrame for those values that have an index labeled 4. However, if the index is not only integer-based, ix will deal with the positions as iloc.

Adding Columns to a DataFrame

If we want to add the column to the DataFrame, we can easily follow the same procedure as adding an index to the DataFrame by using loc or iloc.

10) How to Delete Indices, Rows or Columns From a Pandas Data Frame?

Deleting an Index from Your DataFrame

If you want to remove the index from the DataFrame, you should have to do the following:

Reset the index of DataFrame.

Executing del df.index.name to remove the index name.

Remove duplicate index values by resetting the index and drop the duplicate values from the index column.

Remove an index with a row.

Deleting a Column from Your DataFrame

You can use the drop() method for deleting a column from the DataFrame.

The axis argument that is passed to the drop() method is either 0 if it indicates the rows and 1 if it drops the columns.

You can pass the argument inplace and set it to True to delete the column without reassign the DataFrame.

You can also delete the duplicate values from the column by using the drop_duplicates() method.

Removing a Row from Your DataFrame

By using df.drop_duplicates(), we can remove duplicate rows from the DataFrame.

You can use the drop() method to specify the index of the rows that we want to remove from the DataFrame.

11) How to Rename the Index or Columns of a Pandas DataFrame?

You can use the .rename method to give different values to the columns or the index values of DataFrame.

12) How to iterate over a Pandas DataFrame?

You can iterate over the rows of the DataFrame by using for loop in combination with an iterrows() call on the DataFrame.

13) How can we convert a Series to DataFrame?

The Pandas Series.to_frame() function is used to convert the series object to the DataFrame.

Series.to_frame(name=None) 

name: Refers to the object. Its Default value is None. If it has one value, the passed name will be substituted for the series name.

s = pd.Series([“a”, “b”, “c”],   

name=”vals”)   

s.to_frame()   

Output:

       vals

0          a

1          b

2          c

14) What is Python Pandas NumPy array?

Numerical Python (Numpy) is defined as a Python package used for performing the various numerical computations and processing of the multidimensional and single-dimensional array elements. The calculations using Numpy arrays are faster than the normal Python array.

15) How can we convert DataFrame into a NumPy array?

For performing some high-level mathematical functions, we can convert Pandas DataFrame to numpy arrays. It uses the DataFrame.to_numpy() function.

The DataFrame.to_numpy() function is applied to the DataFrame that returns the numpy ndarray.

DataFrame.to_numpy(dtype=None, copy=False)    

16) How can we convert DataFrame into an excel file?

We can export the DataFrame to the excel file by using the to_excel() function.

To write a single object to the excel file, we have to specify the target file name. If we want to write to multiple sheets, we need to create an ExcelWriter object with target filename and also need to specify the sheet in the file in which we have to write.

17) How can we sort the DataFrame?

We can efficiently perform sorting in the DataFrame through different kinds:

By label

By Actual value

By label

The DataFrame can be sorted by using the sort_index() method. It can be done by passing the axis arguments and the order of sorting. The sorting is done on row labels in ascending order by default.

By Actual Value

It is another kind through which sorting can be performed in the DataFrame. Like index sorting, sort_values() is a method for sorting the values.

It also provides a feature in which we can specify the column name of the DataFrame with which values are to be sorted. It is done by passing the ‘by’ argument.

18) What is Time Series in Pandas?

The Time series data is defined as an essential source for information that provides a strategy that is used in various businesses. From a conventional finance industry to the education industry, it consists of a lot of details about the time.

Time series forecasting is the machine learning modeling that deals with the Time Series data for predicting future values through Time Series modeling.

19) What is Time Offset?

The offset specifies a set of dates that conform to the DateOffset. We can create the DateOffsets to move the dates forward to valid dates.

20) Explain what is Time Periods?

The Time Periods represent the time span, e.g., days, years, quarter or month, etc. It is defined as a class that allows us to convert the frequency to the periods.

21) How to convert String to date?

The below code demonstrates how to convert the string to date:

fromdatetime import datetime   

# Explain what is dates as the strings      

dmy_str1 = ‘Wednesday, July 14, 2018’   

dmy_str2 = ’14/7/17′   

dmy_str3 = ’14-07-2017′   

# Explain what is dates as the datetime objects   

dmy_dt1 = datetime.strptime(date_str1, ‘%A, %B %d, %Y’)    

dmy_dt2 = datetime.strptime(date_str2, ‘%m/%d/%y’)   

dmy_dt3 = datetime.strptime(date_str3, ‘%m-%d-%Y’)   

#Print the converted dates   

print(dmy_dt1)   

print(dmy_dt2)   

print(dmy_dt3)   

Output:

2017-07-14 00:00:00

2017-07-14 00:00:00

2018-07-14 00:00:00

22) What is Data Aggregation?

The main task of Data Aggregation is to apply some aggregation to one or more columns. It uses the following:

sum: It is used to return the sum of the values for the requested axis.

min: It is used to return a minimum of the values for the requested axis.

max: It is used to return a maximum values for the requested axis.

23) What is Python Pandas Index?

Pandas Index is defined as a vital tool that selects particular rows and columns of data from a DataFrame. Its task is to organize the data and to provide fast accessing of data. It can also be called a Subset Selection.

24) Explain what is Multiple Indexing?

Multiple indexing is defined as essential indexing because it deals with data analysis and manipulation, especially for working with higher dimensional data. It also enables us to store and manipulate data with the arbitrary number of dimensions in lower-dimensional data structures like Series and DataFrame.

25) Explain what is ReIndexing?

Reindexing is used to change the index of the rows and columns of the DataFrame. We can reindex the single or multiple rows by using the reindex() method. Default values in the new index are assigned NaN if it is not present in the DataFrame.

DataFrame.reindex(labels=None, index=None, columns=None, axis=None, method=None, copy=True, level=None, fill_value=nan, limit=None, tolerance=None)

26) How to Set the index?

We can set the index column while making a data frame. But sometimes, a data frame is made from two or more data frames, and then the index can be changed using this method.

27) How to Reset the index?

The Reset index of the DataFrame is used to reset the index by using the ‘reset_index’ command. If the DataFrame has a MultiIndex, this method can remove one or more levels.

28) Describe Data Operations in Pandas?

In Pandas, there are different useful data operations for DataFrame, which are as follows:

Row and column selection

We can select any row and column of the DataFrame by passing the name of the rows and columns. When you select it from the DataFrame, it becomes one-dimensional and considered as Series.

Filter Data

We can filter the data by providing some of the boolean expressions in DataFrame.

Null values

A Null value occurs when no data is provided to the items. The various columns may contain no values, which are usually represented as NaN.

29) Explain what is GroupBy in Pandas?

In Python Pandas, groupby() function allows us to rearrange the data by utilizing them on real-world data sets. Its primary task is to split the data into various groups. These groups are categorized based on some criteria. The objects can be divided from any of their axes.

DataFrame.groupby(by=None, axis=0, level=None, as_index=True, sort=True, group_keys=True, squeeze=False, **kwargs)

30) What is Pandas?

Ans: Pandas is a Python package providing fast, flexible, and expressive data structures designed to make working with “relational” or “labeled” data both easy and intuitive. It aims to be the fundamental high-level building block for doing practical, real world data analysis in Python Pandas.

31) What is Python pandas used for?

Ans: Python Pandas is a software library written for the Python programming language for data manipulation and analysis. In particular, it offers data structures and operations for manipulating numerical tables and time series. Python pandas is free software released under the three-clause BSD license.

32) Explain Reindexing in python pandas?

Reindexing is used to conform DataFrame to a new index with optional filling logic. It places NA/NaN in that location where the values are not present in the previous index. It returns a new object unless the new index is produced as equivalent to the current one, and the value of copy becomes False. It is used to change the index of the rows and columns of the DataFrame.

33) What is the name of Pandas library tools used to create a scatter plot matrix?

Scatter_matrix

34) Explain what is the different ways a DataFrame can be created in pandas?

We can create a DataFrame using following ways:

Lists

Dict of ndarrays

Example-1: Create a DataFrame using List:

import pandas as pd   

# a list of strings   

a = [‘Python’, ‘Pandas’]   

# Calling DataFrame constructor on list   

info = pd.DataFrame(a)   

print(info)   

Output:

                0

0   Python

1   Pandas

Example-2: Create a DataFrame from dict of ndarrays:

import pandas as pd   

info = {‘ID’ :[101, 102, 103],’Department’ :[‘B.Sc’,’B.Tech’,’M.Tech’,]}   

info = pd.DataFrame(info)   

print (info)  

Output:

       ID      Department

0      101        B.Sc

1      102        B.Tech

2      103        M.Tech

35) What is a Series in Pandas?

Ans: python Pandas Series is a one-dimensional labelled array capable of holding data of any type (integer, string, float, python Pandas objects, etc.). The axis labels are collectively called index. python Pandas Series is nothing but a column in an excel sheet.

36) Mention the different Types of Data structures in python pandas??

Ans: There are two data structures supported by pandas library, Series and DataFrames. Both of the data structures are built on top of Numpy. Series is a one-dimensional data structure in pandas and DataFrame is the two-dimensional data structure in pandas. There is one more axis label known as Panel which is a three-dimensional data structure and it includes items, major_axis, and minor_axis.

37) Explain Reindexing in pandas?

Ans: Re-indexing means to conform DataFrame to a new index with optional filling logic, placing NA/NaN in locations having no value in the previous index. It changes the row labels and column labels of a DataFrame.

38) What are the key features of pandas library ?

Ans: There are various features in pandas library and some of them are mentioned below

Data Alignment

Memory Efficient

Reshaping

Merge and join

Time Series

39) What is pandas Used For ?

Ans: This library is written for the Python programming language for performing operations like data manipulation, data analysis, etc. The library provides various operations as well as data structures to manipulate time series and numerical tables.

40) How can we create copy of series in python Pandas?

Ans: pandas.Series.copy

Series.copy(deep=True)

pandas.Series.copy. Make a deep copy, including a copy of the data and the indices. With deep=False neither the indices or the data are copied. Note that when deep=True data is copied, actual python objects will not be copied recursively, only the reference to the object.

41) What is Time Series in pandas?

Ans: A time series is an ordered sequence of data which basically represents how some quantity changes over time. pandas contains extensive capabilities and features for working with time series data for all domains.

pandas supports:

Parsing time series information from various sources and formats

Generate sequences of fixed-frequency dates and time spans

Manipulating and converting date time with timezone information

Resampling or converting a time series to a particular frequency

Performing date and time arithmetic with absolute or relative time increments

42) Explain Categorical Data in Pandas?

Ans: Categorical are a pandas data type corresponding to categorical variables in statistics. A categorical variable takes on a limited and usually fixed, number of possible values (categories; levels in R). Examples are gender, social class, blood type, country affiliation, observation time or rating via Likert scales. All values of categorical data are either in categories or np.nan.

The categorical data type is useful in the following cases:

A string variable consisting of only a few different values. Converting such a string variable to a categorical variable will save some memory,

The lexical order of a variable is not the same as the logical order (“one”, “two”, “three”). By converting to a categorical and specifying an order on the categories, sorting and min/max will use the logical order instead of the lexical order,

As a signal to other Python libraries that this column should be treated as a categorical variable (e.g. to use suitable statistical methods or plot types).

43) How will you create a series from dict in Python Pandas?

Ans: A Series is a one-dimensional labeled array capable of holding any data type (integers, strings, floating point numbers, Python Pandas objects, etc.). It has to be remembered that unlike Python Pandas lists, a Series will always contain data of the same type.

Let’s see how to create a Pandas Series from Dictionary.

Using Series() method without index parameter.

44) What are operations on Series in pandas?

Ans: Pandas Series is a one-dimensional labeled array capable of holding data of any type (integer, string, float, python objects, etc.). The axis labels are collectively called index. Pandas Series is nothing but a column in an excel sheet.

Creating a Pandas Series-

In the real world, a Pandas Series will be created by loading the datasets from existing storage, storage can be SQL Database, CSV file, and Excel file. Pandas Series can be created from the lists, dictionary, and from a scalar value etc. Series can be created in different ways, here are some ways by which we create a series:

Creating a series from array: In order to create a series from array, we have to import a numpy module and have to use array() function.

# import pandas as pd

import pandas as pd

# import numpy as np

import numpy as np

# simple array

data = np.array([‘g’,’e’,’e’,’k’,’s’])

ser = pd.Series(data)

print(ser)

Output :

45) What is a DataFrame in pandas?

Ans: Pandas DataFrame is two-dimensional size-mutable, potentially heterogeneous tabular data structure with labeled axes (rows and columns). A Data frame is a two-dimensional data structure, i.e., data is aligned in a tabular fashion in rows and columns. Pandas DataFrame consists of three principal components, the data, rows, and columns.

Creating a Pandas DataFrame-

n the real world, a Pandas DataFrame will be created by loading the datasets from existing storage, storage can be SQL Database, CSV file, and Excel file. Pandas DataFrame can be created from the lists, dictionary, and from a list of dictionary etc. Dataframe can be created in different ways here are some ways by which we create a dataframe:

Creating a dataframe using List: DataFrame can be created using a single list or a list of lists.

# import pandas as pd

import pandas as pd

# list of strings

lst = [‘Geeks’, ‘For’, ‘Geeks’, ‘is’,

‘portal’, ‘for’, ‘Geeks’]

# Calling DataFrame constructor on list

df = pd.DataFrame(lst)

print(df)

Output:

46) What are the different ways in which a DataFrame can be created in Pandas?

Ans: Pandas DataFrame is a 2-dimensional labeled data structure with columns of potentially different types. It is generally the most commonly used pandas object.

Pandas DataFrame can be created in multiple ways. Let’s discuss different ways to create a DataFrame one by one.

Creating Pandas DataFrame from lists of lists.

Import pandas library

import pandas as pd

# initialize list of lists

data = [[‘tom’, 10], [‘nick’, 15], [‘juli’, 14]]

# Create the pandas DataFrame

df = pd.DataFrame(data, columns = [‘Name’, ‘Age’])

# print dataframe.

df

Output:

47) How will you create an empty DataFrame in pandas?

Ans: To create a completely empty Pandas dataframe, we use do the following:

import pandas as pd

MyEmptydf = pd.DataFrame()

This will create an empty dataframe with no columns or rows.

To create an empty dataframe with three empty column (columns X, Y and Z), we do:

df = pd.DataFrame(columns=[‘X’, ‘Y’, ‘Z’])

48) How will you add a column to a pandas DataFrame?

Ans: Adding new column to existing DataFrame in Pandas

Import pandas package

import pandas as pd

# Explain what is a dictionary containing Students data

data = {‘Name’: [‘Jai’, ‘Princi’, ‘Gaurav’, ‘Anuj’],

‘Height’: [5.1, 6.2, 5.1, 5.2],

‘Qualification’: [‘Msc’, ‘MA’, ‘Msc’, ‘Msc’]}

# Convert the dictionary into DataFrame

df = pd.DataFrame(data)

# Declare a list that is to be converted into a column

address = [‘Delhi’, ‘Bangalore’, ‘Chennai’, ‘Patna’]

# Using ‘Address’ as the column name

# and equating it to the list

df[‘Address’] = address

# Observe the result

df

Output:

49) How will you retrieve a single column from pandas DataFrame?

Ans: To start a project in Django, use the command $django-admin.py and then use the following command:

Project

_init_.py

manage.py

settings.py

urls.py

50) range ()  vs and xrange () functions in Python Pandas?

Ans:

In Python 2 we have the following two functions to produce a list of numbers within a given range.

range()

xrange()

in Python 3, xrange() is deprecated, i.e. xrange() is removed from python 3.x.

Now In Python 3, we have only one function to produce the numbers within a given range i.e. range() function.

But, range() function of python 3 works same as xrange() of python 2 (i.e. internal implementation of range() function of python 3 is same as xrange() of Python 2).

So The difference between range() and xrange() functions becomes relevant only when you are using python 2.

range() and xrange() function values

a). range() creates a list i.e., range returns a Python list object, for example, range (1,500,1) will create a python list of 499 integers in memory. Remember, range() generates all numbers at once.

b).xrange() functions returns an xrange object that evaluates lazily. That means xrange only stores the range arguments and generates the numbers on demand. It doesn’t generate all numbers at once like range(). Furthermore, this object only supports indexing, iteration, and the len() function.

On the other hand xrange() generates the numbers on demand. That means it produces number one by one as for loop moves to the next number. In every iteration of for loop, it generates the next number and assigns it to the iterator variable of for loop.

Printing return type of range() function:

range_numbers = range(2,10,2)

print (“The return type of range() is : “)

print (type(range_numbers ))

Output:

The return type of range() is :

<type ‘list’>

Printing return type of xrange() function:

xrange_numbers = xrange(1,10,1)

print “The return type of xrange() is : ”

print type(xrange_numbers )

Output:

The return type of xrange() is :

<type ‘xrange’>

51) What is the name of pandas library tools used to create a scatter plot matrix?

Ans: Scatter_matrix

52) What is pylab?

Ans:  PyLab is a package that contains NumPy, SciPy, and Matplotlib into a single namespace.

53) Explain what is the different ways a DataFrame can be created in pandas?

Ans: We can create a DataFrame using following ways:

Lists

Dict of ndarrays

Example-1: Create a DataFrame using List:

importpandas as pd

# a list of strings

a = [‘Python’, ‘Pandas’]

# Calling DataFrame constructor on list

info = pd.DataFrame(a)

print(info)

Output:

0

0   Python

1   Pandas

Example-2: Create a DataFrame from dict of ndarrays:

importpandas as pd

info = {‘ID’:[101, 102, 103],’Department’ :[‘B.Sc’,’B.Tech’,’M.Tech’,]}

info = pd.DataFrame(info)

print (info)

Output:

ID      Department

0      101        B.Sc

1      102        B.Tech

2      103        M.Tech

54) Explain Categorical data in Pandas?

Ans: A Categorical data is defined as a Pandas data type that corresponds to a categorical variable in statistics. A categorical variable is generally used to take a limited and usually fixed number of possible values. Examples: gender, country affiliation, blood type, social class, observation time, or rating via Likert scales. All values of categorical data are either in categories or np.nan.

This data type is useful in the following cases:

It is useful for a string variable that consists of only a few different values. If we want to save some memory, we can convert a string variable to a categorical variable.

It is useful for the lexical order of a variable that is not the same as the logical order (?one?, ?two?, ?three?) By converting into a categorical and specify an order on the categories, sorting and min/max is responsible for using the logical order instead of the lexical order.

It is useful as a signal to other Python Pandas libraries because this column should be treated as a categorical variable.

Parameters-

val : [list-like] The values of categorical. categories : [index like] Unique categorization of the categories. ordered : [boolean] If false, then the categorical is treated as unordered. dtype : [CategoricalDtype] an instance. Error- ValueError :  If the categories do not validate. TypeError :  If an explicit ordered = True but categorical can’t be sorted. Return- Categorical variable

Code :

# Python code explaining

# numpy.pandas.Categorical()

# importing libraries

import numpy as np

import pandas as pd

# Categorical using dtype

c = pd.Series([“a”, “b”, “d”, “a”, “d”], dtype =”category”)

print (“\nCategorical without pandas.Categorical() : \n”, c)

c1 = pd.Categorical([1, 2, 3, 1, 2, 3])

print (“\n\nc1 : “, c1)

c2 = pd.Categorical([‘e’, ‘m’, ‘f’, ‘i’,

‘f’, ‘e’, ‘h’, ‘m’ ])

print (“\nc2 : “, c2)

OutPut:

55) How will you create a series from dict in Pandas?

Ans:  A Series is defined as a one-dimensional array that is capable of storing various data types.

We can create a Pandas Series from Dictionary:

Create a Series from dict:

We can also create a Series from dict. If the dictionary object is being passed as an input and the index is not specified, then the dictionary keys are taken in a sorted order to construct the index.

If index is passed, then values correspond to a particular label in the index will be extracted from the dictionary.

importpandas as pd

importnumpy as np

info = {‘x’: 0., ‘y’ : 1., ‘z’ : 2.}

a = pd.Series(info)

print (a)

Output:

x     0.0

y     1.0

z     2.0

dtype: float64

56) How can we create a copy of the series in Pandas?

Ans: We can create the copy of series by using the following syntax:

pandas.Series.copy

Series.copy(deep=True)

The above statements make a deep copy that includes a copy of the data and the indices. If we set the value of deep to False, it will neither copy the indices nor the data.

57) How will you create an empty DataFrame in Pandas?

Ans:  A DataFrame is a widely used data structure of pandas and works with a two-dimensional array with labeled axes (rows and columns) It is defined as a standard way to store data and has two different indexes, i.e., row index and column index.

Create an empty DataFrame:

The below code shows how to create an empty DataFrame in Pandas:

# importing the pandas library

importpandas as pd

info = pd.DataFrame()

print (info)

Output:

Empty DataFrame

Columns: [ ]

Index: [ ]

58)  How will you add a column to a pandas DataFrame?

Ans: We can add any new column to an existing DataFrame. The below code demonstrates how to add any new column to an existing DataFrame:

# importing the pandas library

import pandas as pd

info = {‘one’: pd.Series([1, 2, 3, 4, 5], index=[‘a’, ‘b’, ‘c’, ‘d’, ‘e’]),

‘two’ : pd.Series([1, 2, 3, 4, 5, 6], index=[‘a’, ‘b’, ‘c’, ‘d’, ‘e’, ‘f’])}

info = pd.DataFrame(info)

# Add a new column to an existing DataFrame object

print (“Add new column by passing series”)

info[‘three’]=pd.Series([20,40,60],index=[‘a’,’b’,’c’])

print (info)

print (“Add new column using existing DataFrame columns”)

info[‘four’]=info[‘one’]+info[‘three’]

print (info)

Output:

Add new column by passing series

one     two      three

a     1.0      1        20.0

b     2.0      2        40.0

c     3.0      3        60.0

d     4.0      4        NaN

e     5.0      5        NaN

f     NaN      6        NaN

Add new column using existing DataFrame columns

one      two       three      four

a      1.0       1         20.0      21.0

b      2.0       2         40.0      42.0

c      3.0       3         60.0      63.0

d      4.0       4         NaN      NaN

e      5.0       5         NaN      NaN

f      NaN       6        NaN      NaN

59) How to add an Index, row, or column to a Pandas DataFrame?

Ans: Adding an Index to a DataFrame

Pandas allow adding the inputs to the index argument if you create a DataFrame. It will make sure that you have the desired index. If you don?t specify inputs, the DataFrame contains, by default, a numerically valued index that starts with 0 and ends on the last row of the DataFrame.

Adding Rows to a DataFrame

We can use .loc, iloc, and ix to insert the rows in the DataFrame.

The loc basically works for the labels of our index. It can be understood as if we insert in loc[4], which means we are looking for that values of DataFrame that have an index labeled 4.

The iloc basically works for the positions in the index. It can be understood as if we insert in iloc[4], which means we are looking for the values of DataFrame that are present at index ‘4`.

The ix is a complex case because if the index is integer-based, we pass a label to ix. The ix[4] means that we are looking in the DataFrame for those values that have an index labeled 4. However, if the index is not only integer-based, ix will deal with the positions as iloc.

Adding Columns to a DataFrame

If we want to add the column to the DataFrame, we can easily follow the same procedure as adding an index to the DataFrame by using loc or iloc.

Add row with specific index name:

import pandas as pd

employees = pd.DataFrame(

    data={‘Name’: [‘John Doe’, ‘William Spark’],

          ‘Occupation’: [‘Chemist’, ‘Statistician’],

          ‘Date Of Join’: [‘2018-01-25’, ‘2018-01-26’],

          ‘Age’: [23, 24]},

    index=[‘Emp001’, ‘Emp002’],

    columns=[‘Name’, ‘Occupation’, ‘Date Of Join’, ‘Age’])

print(“\n———— BEFORE —————-\n”)

print(employees)

employees.loc[‘Emp003’] = [‘Sunny’, ‘Programmer’, ‘2018-01-25’, 45]

print(“\n———— AFTER —————-\n”)

print(employees)

OUTPUT :

C:\pandas>python example22.py

———— BEFORE —————-

                 Name    Occupation Date Of Join  Age

Emp001       John Doe       Chemist   2018-01-25   23

Emp002  William Spark  Statistician   2018-01-26   24

———— AFTER —————-

                 Name    Occupation Date Of Join  Age

Emp001       John Doe       Chemist   2018-01-25   23

Emp002  William Spark  Statistician   2018-01-26   24

Emp003          Sunny    Programmer   2018-01-25   45

28. How to Delete Indices, Rows or Columns From a Pandas Data Frame?

Ans:  Deleting an Index from Your DataFrame

If you want to remove the index from the DataFrame, you should have to do the following:

Reset the index of DataFrame.

Executing del df.index.name to remove the index name.

Remove duplicate index values by resetting the index and drop the duplicate values from the index column.

Remove an index with a row.

Deleting a Column from Your DataFrame

You can use the drop() method for deleting a column from the DataFrame.

The axis argument that is passed to the drop() method is either 0 if it indicates the rows and 1 if it drops the columns.

You can pass the argument inplace and set it to True to delete the column without reassign the DataFrame.

You can also delete the duplicate values from the column by using the drop_duplicates() method.

Removing a Row from Your DataFrame

By using df.drop_duplicates(), we can remove duplicate rows from the DataFrame.

You can use the drop() method to specify the index of the rows that we want to remove from the DataFrame.

60) How to Rename the Index or Columns of a Pandas DataFrame?

Ans: You can use the .rename method to give different values to the columns or the index values of DataFrame.

There are the following ways to change index / columns names (labels) of pandas.DataFrame.

Use pandas.DataFrame.rename()

Change any index / columns names individually with dict

Change all index / columns names with a function

Use pandas.DataFrame.add_prefix(), pandas.DataFrame.add_suffix()

Add prefix and suffix to columns name

Update the index / columns attributes of pandas.DataFrame

Replace all index / columns names

set_index() method that sets an existing column as an index is also provided. See the following post for detail.

Specify the original name and the new name in dict like {original name: new name} to index / columns of rename().

index is for index name and columns is for the columns name. If you want to change either, you need only specify one of index or columns.

A new DataFrame is returned, the original DataFrame is not changed.

df_new = df.rename(columns={‘A’: ‘a’}, index={‘ONE’: ‘one’})

print(df_new)

#         a   B   C

# one    11  12  13

# TWO    21  22  23

# THREE  31  32  33

print(df)

#         A   B   C

# ONE    11  12  13

# TWO    21  22  23

# THREE  31  32  33

61) How to iterate over a Pandas DataFrame?

Ans. You can iterate over the rows of the DataFrame by using for loop in combination with an iterrows() call on the DataFrame.

import pandas as pd

 import numpy as np

df = pd.DataFrame([{‘c1’:10, ‘c2’:100}, {‘c1′:11,’c2’:110}, {‘c1′:12,’c2’:120}])

for index, row in df.iterrows():

    print(row[‘c1’], row[‘c2’])

Output:

   10 100

   11 110

   12 120

62) How to get the items of series A not present in series B?

Ans: We can remove items present in p2 from p1 using isin() method.

import pandas as pd

p1 = pd.Series([2, 4, 6, 8, 10])

p2 = pd.Series([8, 10, 12, 14, 16])

p1[~p1.isin(p2)]

Solution

0    2

1    4

2    6

dtype: int64

63) How to get the items not common to both series A and series B?

Ans:  We get all the items of p1 and p2 not common to both using below example:

import pandas as pd

import numpy as np

p1 = pd.Series([2, 4, 6, 8, 10])

p2 = pd.Series([8, 10, 12, 14, 16])

p1[~p1.isin(p2)]

p_u = pd.Series(np.union1d(p1, p2))  # union

p_i = pd.Series(np.intersect1d(p1, p2))  # intersect

p_u[~p_u.isin(p_i)]

Output:

0     2

1     4

2     6

5    12

6    14

7    16

dtype: int64

64) How to get the minimum, 25th percentile, median, 75th, and max of a numeric series?

Ans: We can compute the minimum, 25th percentile, median, 75th, and maximum of p as below example:

import pandas as pd

import numpy as np

p = pd.Series(np.random.normal(14, 6, 22))

state = np.random.RandomState(120)

p = pd.Series(state.normal(14, 6, 22))

percentile(p, q=[0, 25, 50, 75, 100])

Output:

array([ 4.61498692, 12.15572753, 14.67780756, 17.58054104, 33.24975515])

65) How to get frequency counts of unique items of a series?

Ans: We can calculate the frequency counts of each unique value p as below example:

import pandas as pd

import numpy as np

p= pd.Series(np.take(list(‘pqrstu’), np.random.randint(6, size=17)))

p = pd.Series(np.take(list(‘pqrstu’), np.random.randint(6, size=17)))

value_counts()

Output:

s    4

r    4

q    3

p    3

u    3

66) How to convert a numpy array to a dataframe of given shape?

Ans. We can reshape the series p into a dataframe with 6 rows and 2 columns as below example:

import pandas as pd

import numpy as np

p = pd.Series(np.random.randint(1, 7, 35))

# Input

p = pd.Series(np.random.randint(1, 7, 35))

info = pd.DataFrame(p.values.reshape(7,5))

print(info)

Output:

0  1  2  3  4

0  3  2  5  5  1

1  3  2  5  5  5

2  1  3  1  2  6

3  1  1  1  2  2

4  3  5  3  3  3

5  2  5  3  6  4

6  3  6  6  6  5

67) How can we convert a Series to DataFrame?

Ans: The Pandas Series.to_frame() function is used to convert the series object to the DataFrame.

to_frame(name=None)

name: Refers to the object. Its Default value is None. If it has one value, the passed name will be substituted for the series name.

s = pd.Series([“a”, “b”, “c”],

name=”vals”)

to_frame()

Output:

vals

0          a

1          b

2          c

68) How can we sort the DataFrame?

Ans: We can efficiently perform sorting in the DataFrame through different kinds:

By label

By Actual value

1). By label

The DataFrame can be sorted by using the sort_index() method. It can be done by passing the axis arguments and the order of sorting. The sorting is done on row labels in ascending order by default.

Using the sort_index() method, by passing the axis arguments and the order of sorting, DataFrame can be sorted. By default, sorting is done on row labels in ascending order.

import pandas as pd

import numpy as np

unsorted_df = pd.DataFrame(np.random.randn(10,2),index=[1,4,6,2,3,5,9,8,0,7],colu

   mns = [‘col2′,’col1’])

sorted_df=unsorted_df.sort_index()

print sorted_df

Its output is as follows −

        col2       col1

0   0.208464   0.627037

1   0.641004   0.331352

2  -0.038067  -0.464730

3  -0.638456  -0.021466

4   0.014646  -0.737438

5  -0.290761  -1.669827

6  -0.797303  -0.018737

7   0.525753   1.628921

8  -0.567031   0.775951

9   0.060724  -0.322425

Order of Sorting

By passing the Boolean value to ascending parameter, the order of the sorting can be controlled. Let us consider the following example to understand the same.

import pandas as pd

import numpy as np

unsorted_df = pd.DataFrame(np.random.randn(10,2),index=[1,4,6,2,3,5,9,8,0,7],colu

   mns = [‘col2′,’col1’])

sorted_df = unsorted_df.sort_index(ascending=False)

print sorted_df

Its output is as follows −

         col2        col1

9    0.825697    0.374463

8   -1.699509    0.510373

7   -0.581378    0.622958

6   -0.202951    0.954300

5   -1.289321   -1.551250

4    1.302561    0.851385

3   -0.157915   -0.388659

2   -1.222295    0.166609

1    0.584890   -0.291048

0    0.668444   -0.061294

Sort the Columns

By passing the axis argument with a value 0 or 1, the sorting can be done on the column labels. By default, axis=0, sort by row. Let us consider the following example to understand the same.

import pandas as pd

import numpy as np

unsorted_df = pd.DataFrame(np.random.randn(10,2),index=[1,4,6,2,3,5,9,8,0,7],colu

   mns = [‘col2′,’col1’])

sorted_df=unsorted_df.sort_index(axis=1)

print sorted_df

Its output is as follows −

         col1        col2

1   -0.291048    0.584890

4    0.851385    1.302561

6    0.954300   -0.202951

2    0.166609   -1.222295

3   -0.388659   -0.157915

5   -1.551250   -1.289321

9    0.374463    0.825697

8    0.510373   -1.699509

0   -0.061294    0.668444

7    0.622958   -0.581378

2). By Actual Value

It is another kind through which sorting can be performed in the DataFrame. Like index sorting, sort_values() is a method for sorting the values.

It also provides a feature in which we can specify the column name of the DataFrame with which values are to be sorted. It is done by passing the ‘by‘ argument.

import pandas as pd

import numpy as np

unsorted_df = pd.DataFrame({‘col1′:[2,1,1,1],’col2’:[1,3,2,4]})

   sorted_df = unsorted_df.sort_values(by=’col1′)

print sorted_df

Its output is as follows −

   col1  col2

1    1    3

2    1    2

3    1    4

0    2    1

Observe, col1 values are sorted and the respective col2 value and row index will alter along with col1. Thus, they look unsorted.

‘by’ argument takes a list of column values.

import pandas as pd

import numpy as np

unsorted_df = pd.DataFrame({‘col1′:[2,1,1,1],’col2’:[1,3,2,4]})

   sorted_df = unsorted_df.sort_values(by=[‘col1′,’col2’])

print sorted_df

Its output is as follows −

  col1 col2

2   1   2

1   1   3

3   1   4

0   2   1

Sorting Algorithm

sort_values() provides a provision to choose the algorithm from mergesort, heapsort and quicksort. Mergesort is the only stable algorithm.

import pandas as pd

import numpy as np

unsorted_df = pd.DataFrame({‘col1′:[2,1,1,1],’col2’:[1,3,2,4]})

sorted_df = unsorted_df.sort_values(by=’col1′ ,kind=’mergesort’)

print sorted_df

Its output is as follows −

  col1 col2

1    1    3

2    1    2

3    1    4

0    2    1

69) How to convert String to date?

Ans: The below code demonstrates how to convert the string to date:

From datetime import datetime

# Explain what is dates as the strings

dmy_str1 = ‘Wednesday, July 14, 2018’

dmy_str2 = ’14/7/17′

dmy_str3 = ’14-07-2017′

# Explain what is dates as the datetime objects

dmy_dt1 = datetime.strptime(date_str1, ‘%A, %B %d, %Y’)

dmy_dt2 = datetime.strptime(date_str2, ‘%m/%d/%y’)

dmy_dt3 = datetime.strptime(date_str3, ‘%m-%d-%Y’)

#Print the converted dates

print(dmy_dt1)

print(dmy_dt2)

print(dmy_dt3)

Output:

2017-07-14 00:00:00

2017-07-14 00:00:00

2018-07-14 00:00:00

70) What is Data Aggregation?

Ans: The main task of Data Aggregation is to apply some aggregation to one or more columns. It uses the following:

sum: It is used to return the sum of the values for the requested axis.

min: It is used to return a minimum of the values for the requested axis.

max: It is used to return a maximum values for the requested axis.

Screenshot of the pandas aggregations

Examples

import pandas as pd

import numpy as np

df = pd.DataFrame([[1, 2, 3],

                   [4, 5, 6],

                    [7, 8, 9],

                   [np.nan, np.nan, np.nan]],

columns=[‘A’, ‘B’, ‘C’])

print(df)

# Aggregate these functions over the rows.

print(df.agg([‘sum’, ‘min’]))

# Different aggregations per column.

print(df.agg({‘A’ : [‘sum’, ‘min’], ‘B’ : [‘min’, ‘max’]}))

# Aggregate over the columns.

print(df.agg(“mean”, axis=”columns”))

# Aggregate over the rows.

print(df.agg(“mean”, axis=”rows”))

OUTPUT :

A  B    C

0  1.0  2  3.0

1  4.0  5  6.0

2  7.0  8  9.0

3  NaN  NaN  NaN

        A   B     C

sum  12.0  15.0  18.0

min   1.0  2.0   3.0  

        A    B

max   NaN  8.0

min   1.0  2.0

sum  12.0  NaN

0    2.0

1    5.0

2    8.0

3    NaN

dtype: float64

A    4.0

B    5.0

C    6.0

dtype: float64

71) What is Pandas Index?

Ans:

Indexing in Pandas :

Indexing in pandas means simply selecting particular rows and columns of data from a DataFrame. Indexing could mean selecting all the rows and some of the columns, some of the rows and all of the columns, or some of each of the rows and columns. Indexing can also be known as Subset Selection.

Pandas Indexing using  [ ], .loc[], .iloc[ ], .ix[ ]

There are a lot of ways to pull the elements, rows, and columns from a DataFrame. There are some indexing method in Pandas which help in getting an element from a DataFrame. These indexing methods appear very similar but behave very differently. Pandas support four types of Multi-axes indexing they are:

Dataframe.[ ] ; This function also known as indexing operator

Dataframe.loc[ ] : This function is used for labels.

Dataframe.iloc[ ] : This function is used for positions or integer based

Dataframe.ix[] : This function is used for both label and integer based

Collectively, they are called the indexers. These are by far the most common ways to index data. These are four function which help in getting the elements, rows, and columns from a DataFrame.

1) Indexing a Dataframe using indexing operator [] :

Indexing operator is used to refer to the square brackets following an object. The .loc and .iloc indexers also use the indexing operator to make selections. In this indexing operator to refer to df[].

In order to select a single column, we simply put the name of the column in-between the brackets

filter_nonebrightness_4

# importing pandas package

importpandas as pd

# making data frame from csv file

data =pd.read_csv(“nba.csv”, index_col =”Name”)

# retrieving columns by indexing operator

first =data[“Age”]

print(first)

Output:

2. Indexing a DataFrame using .loc[ ] :

This function selects data by the label of the rows and columns. The df.loc indexer selects data in a different way than just the indexing operator. It can select subsets of rows or columns. It can also simultaneously select subsets of rows and columns.

# importing pandas package

import pandas as pd

# making data frame from csv file

data = pd.read_csv(“nba.csv”, index_col =”Name”)

# retrieving row by loc method

first = data.loc[“Avery Bradley”]

second = data.loc[“R.J. Hunter”]

print(first, “\n\n\n”, second)

Output:

As shown in the output image, two series were returned since there was only one parameter both of the times.

3. Indexing a DataFrame using .iloc[ ] :

This function allows us to retrieve rows and columns by position. In order to do that, we’ll need to specify the positions of the rows that we want, and the positions of the columns that we want as well. The df.iloc indexer is very similar to df.loc but only uses integer locations to make its selections.

In order to select a single row using .iloc[], we can pass a single integer to .iloc[] function.

importpandas as pd

# making data frame from csv file

data =pd.read_csv(“nba.csv”, index_col =”Name”)

# retrieving rows by iloc method

row2 =data.iloc[3]

print(row2)

Output :

4. Indexing a using Dataframe.ix[ ] :

Early in the development of pandas, there existed another indexer, ix. This indexer was capable of selecting both by label and by integer location. While it was versatile, it caused lots of confusion because it’s not explicit. Sometimes integers can also be labels for rows or columns. Thus there were instances where it was ambiguous. Generally, ix is label based and acts just as the .loc indexer. However, .ix also supports integer type selections (as in .iloc) where passed an integer. This only works where the index of the DataFrame is not integer based .ix will accept any of the inputs of .loc and .iloc.

Note: The .ix indexer has been deprecated in recent versions of Pandas.

Selecting a single row using .ix[] as .loc[]

In order to select a single row, we put a single row label in a .ix function. This function act similar as .loc[ ] if we pass a row label as a argument of a function.

# importing pandas package

importpandas as pd

# making data frame from csv file

data =pd.read_csv(“nba.csv”, index_col =”Name”)

# retrieving row by ix method

first =data.ix[“Avery Bradley”]

print(first)

Output :

72) Explain what is ReIndexing?

Ans:  Reindexing changes the row labels and column labels of a DataFrame. To reindex means to conform the data to match a given set of labels along a particular axis.

Multiple operations can be accomplished through indexing like −

Reorder the existing data to match a new set of labels.

Insert missing value (NA) markers in label locations where no data for the label existed.

Example

import pandas as pd

import numpy as np

N=20

df = pd.DataFrame({

   ‘A’: pd.date_range(start=’2016-01-01′,periods=N,freq=’D’),

   ‘x’: np.linspace(0,stop=N-1,num=N),

   ‘y’: np.random.rand(N),

   ‘C’: np.random.choice([‘Low’,’Medium’,’High’],N).tolist(),

   ‘D’: np.random.normal(100, 10, size=(N)).tolist()

})

#reindex the DataFrame

df_reindexed = df.reindex(index=[0,2,5], columns=[‘A’, ‘C’, ‘B’])

print df_reindexed

Its output is as follows −

            A    C     B

0  2016-01-01  Low   NaN

2  2016-01-03  High  NaN

5  2016-01-06  Low   NaN

73) Explain what is Multiple Indexing?

Ans: Multiple indexing is defined as essential indexing because it deals with data analysis and manipulation, especially for working with higher dimensional data. It also enables us to store and manipulate data with the arbitrary number of dimensions in lower-dimensional data structures like Series and DataFrame.

Multiple index Column

In this example, two columns will be made as index column. Drop parameter is used to Drop the column and append parameter is used to append passed columns to the already existing index column.

filter_nonebrightness_4

# importing pandas package

import pandas as pd

# making data frame from csv file

data = pd.read_csv(“employees.csv”)

# setting first name as index column

data.set_index([“First Name”, “Gender”], inplace = True,

                            append = True, drop = False)

# display

data.head()

Output:

As shown in the output Image, the data is having 3 index columns.

74) How to Set the index?

Python is a great language for doing data analysis, primarily because of the fantastic ecosystem of data-centric python packages. Pandas is one of those packages and makes importing and analyzing data much easier.

Pandas set_index() is a method to set a List, Series or Data frame as index of a Data Frame. Index column can be set while making a data frame too. But sometimes a data frame is made out of two or more data frames and hence later index can be changed using this method.

Syntax:

DataFrame.set_index(keys, drop=True, append=False, inplace=False, verify_integrity=False

Parameters:

keys: Column name or list of column name.

drop: Boolean value which drops the column used for index if True.

append: Appends the column to existing index column if True.

inplace: Makes the changes in the dataframe if True.

verify_integrity: Checks the new index column for duplicates if True.

Changing Index column

In this example, First Name column has been made the index column of Data Frame.

# importing pandas package

import pandas as pd

# making data frame from csv file

data = pd.read_csv(“employees.csv”)

# setting first name as index column

data.set_index(“First Name”, inplace = True)

# display

data.head()

Output:

As shown in the output images, earlier the index column was a series of number but later it has been replaced with First name.

Before Operation –

After Operation

75) How to Reset the index?

Ans: Pandas series is a One-dimensional ndarray with axis labels. The labels need not be unique but must be a hashable type. The object supports both integer- and label-based indexing and provides a host of methods for performing operations involving the index.

Pandas Series.reset_index() function generate a new DataFrame or Series with the index reset. This comes handy when index is need to be used as a column.

Syntax: Series.reset_index(level=None, drop=False, name=None, inplace=False)

Parameter :

level : For a Series with a MultiIndex

drop : Just reset the index, without inserting it as a column in the new DataFrame.

name : The name to use for the column containing the original Series values.

inplace : Modify the Series in place

Returns : result : Series

Example #1: Use Series.reset_index() function to reset the index of the given Series object.

# importing pandas as pd

import pandas as pd

# Creating the Series

sr = pd.Series([10, 25, 3, 11, 24, 6])

# Create the Index

index_ = [‘Coca Cola’, ‘Sprite’, ‘Coke’, ‘Fanta’, ‘Dew’, ‘ThumbsUp’]

# set the index

sr.index = index_

# Print the series

print(sr)

Output:

Now we will use Series.reset_index() function to reset the index of the given series object.

# reset the index

result = sr.reset_index()

# Print the result

print(result)

Output :

As we can see in the output, the Series.reset_index() function has reset the index of the given Series object to default. It has preserved the index and it has converted it to a column.

76) Describe Data Operations in Pandas?

Ans: In Pandas, there are different useful data operations for DataFrame, which are as follows:

Row and column selection

We can select any row and column of the DataFrame by passing the name of the rows and columns. When you select it from the DataFrame, it becomes one-dimensional and considered as Series.

Filter Data

We can filter the data by providing some of the boolean expressions in DataFrame.

Null values

A Null value occurs when no data is provided to the items. The various columns may contain no values, which are usually represented as NaN.

77) Explain what is GroupBy in Pandas?

Ans:  Python is a great language for doing data analysis, primarily because of the fantastic ecosystem of data-centric python packages. Pandas is one of those packages and makes importing and analyzing data much easier.

Pandas dataframe.groupby() function is used to split the data into groups based on some criteria. pandas objects can be split on any of their axes. The abstract definition of grouping is to provide a mapping of labels to group names.

Syntax: DataFrame.groupby(by=None, axis=0, level=None, as_index=True, sort=True, group_keys=True, squeeze=False, **kwargs)

Parameters :

by : mapping, function, str, or iterable

axis : int, default 0

level : If the axis is a MultiIndex (hierarchical), group by a particular level or levels

as_index : For aggregated output, return object with group labels as the index. Only relevant for DataFrame input. as_index=False is effectively “SQL-style” grouped output

sort : Sort group keys. Get better performance by turning this off. Note this does not influence the order of observations within each group. groupby preserves the order of rows within each group.

group_keys : When calling apply, add group keys to index to identify pieces

squeeze : Reduce the dimensionality of the return type if possible, otherwise return a consistent type

Returns : GroupBy object

78) How will you append new rows to a pandas DataFrame?

Ans: Pandas dataframe.append() function is used to append rows of other dataframe to the end of the given dataframe, returning a new dataframe object. Columns not in the original dataframes are added as new columns and the new cells are populated with NaN value.

Syntax: DataFrame.append(other, ignore_index=False, verify_integrity=False, sort=None)

Parameters :

other : DataFrame or Series/dict-like object, or list of these

ignore_index : If True, do not use the index labels.

verify_integrity : If True, raise ValueError on creating index with duplicates.

sort : Sort columns if the columns of self and other are not aligned. The default sorting is deprecated and will change to not-sorting in a future version of pandas. Explicitly pass sort=True to silence the warning and sort. Explicitly pass sort=False to silence the warning and not sort.

Returns: appended : DataFrame

Example #1: Create two data frames and append the second to the first one.

# Importing pandas as pd

importpandas as pd

# Creating the first Dataframe using dictionary

df1 =df =pd.DataFrame({“a”:[1, 2, 3, 4],

“b”:[5, 6, 7, 8]})

# Creating the Second Dataframe using dictionary

df2 =pd.DataFrame({“a”:[1, 2, 3],

“b”:[5, 6, 7]})

# Print df1

print(df1, “\n”)

# Print df2

df2

Now append df2 at the end of df1.

# to append df2 at the end of df1 dataframe

df1.append(df2)

Notice the index value of second data frame is maintained in the appended data frame. If we do not want it to happen then we can set ignore_index=True.

# A continuous index value will be maintained

# across the rows in the new appended data frame.

df.append(df2, ignore_index = True)

Output :

79) How will you delete rows from a pandas DataFrame?

Ans:

Import modules

import pandas as pd

Create a dataframe

data = {‘name’: [‘Jason’, ‘Molly’, ‘Tina’, ‘Jake’, ‘Amy’],

        ‘year’: [2012, 2012, 2013, 2014, 2014],

        ‘reports’: [4, 24, 31, 2, 3]}

df = pd.DataFrame(data, index = [‘Cochice’, ‘Pima’, ‘Santa Cruz’, ‘Maricopa’, ‘Yuma’])

df

name      reports   year

Cochice  Jason      4              2012

Pima       Molly      24           2012

Santa Cruz             Tina         31           2013

Maricopa               Jake         2              2014

Yuma      Amy        3              2014

Delete a row

df.drop([‘Cochice’, ‘Pima’])

Output :

name      reports   year

Santa Cruz             Tina         31           2013

Maricopa               Jake         2              2014

Yuma      Amy        3              2014

80) How will you get the number of rows and columns of a DataFrame in pandas?

Ans:

import pandas as pd

import numpy as np

raw_data = {‘name’: [‘Willard Morris’, ‘Al Jennings’, ‘Omar Mullins’, ‘Spencer McDaniel’],

‘age’: [20, 19, 22, 21],

‘favorite_color’: [‘blue’, ‘red’, ‘yellow’, “green”],

‘grade’: [88, 92, 95, 70]}

df = pd.DataFrame(raw_data, columns = [‘name’, ‘age’, ‘favorite_color’, ‘grade’])

df

Output:

name      age          favorite_color       grade

0              Willard Morris      20           blue        88

1              Al Jennings             19           red          92

2              Omar Mullins        22           yellow     95

3              Spencer McDaniel                21           green      70

get the row and column count of the df

df.shape

(4, 4)

81) What is Pandas ml?

Ans: pandas_ml is a package which integrates pandas, scikit-learn, xgboost into one package for easy handling of data and creation of machine learning models

Installation

$ pip install pandas_ml

Example

>>> import pandas_ml as pdml

>>> import sklearn.datasets as datasets

# create ModelFrame instance from sklearn.datasets

>>> df = pdml.ModelFrame(datasets.load_digits())

>>> type(df)

<class ‘pandas_ml.core.frame.ModelFrame’>

# binarize data (features), not touching target

>>> df.data = df.data.preprocessing.binarize()

>>> df.head()

.target  0  1  2  3  4  5  6  7  8 …  54  55  56  57  58  59  60  61  62  63

0        0  0  0  1  1  1  1  0  0  0 …   0   0   0   0   1   1   1   0   0   0

1        1  0  0  0  1  1  1  0  0  0 …   0   0   0   0   0   1   1   1   0   0

2        2  0  0  0  1  1  1  0  0  0 …   1   0   0   0   0   1   1   1   1   0

3        3  0  0  1  1  1  1  0  0  0 …   1   0   0   0   1   1   1   1   0   0

4        4  0  0  0  1  1  0  0  0  0 …   0   0   0   0   0   1   1   1   0   0

[5 rows x 65 columns]

# split to training and test data

>>> train_df, test_df = df.model_selection.train_test_split()

# create estimator (accessor is mapped to sklearn namespace)

>>> estimator = df.svm.LinearSVC()

# fit to training data

>>> train_df.fit(estimator)

# predict test data

>>> test_df.predict(estimator)

0     4

1     2

2     7

448    5

449    8

Length: 450, dtype: int64

# Evaluate the result

>>> test_df.metrics.confusion_matrix()

Predicted   0   1   2   3   4   5   6   7   8   9

Target

0          52   0   0   0   0   0   0   0   0   0

1           0  37   1   0   0   1   0   0   3   3

2           0   2  48   1   0   0   0   1   1   0

3           1   1   0  44   0   1   0   0   3   1

4           1   0   0   0  43   0   1   0   0   0

5           0   1   0   0   0  39   0   0   0   0

6           0   1   0   0   1   0  35   0   0   0

7           0   0   0   0   2   0   0  42   1   0

8           0   2   1   0   1   0   0   0  33   1

9           0   2   1   2   0   0   0   0   1  38

82) What is Pandas Charm?

Ans: pandas-charm is a small Python Pandas package for getting character matrices (alignments) into and out of pandas. Use this library to make pandas interoperable with BioPython and DendroPy.

Convert between the following objects:

BioPython Multiple Seq Alignment <-> pandas DataFrame

DendroPy Character Matrix <-> pandas DataFrame

“Sequence dictionary” <-> pandas DataFrame

The code has been tested with Python 2.7, 3.5 and 3.6.

Installation :

$ pip install pandas-charm

You may consider installing pandas-charm and its required Python packages within a virtual environment in order to avoid cluttering your system’s Python Pandas path. See for example the environment management system conda or the package virtualenv.

Running the tests

Testing is carried out with pytest:

$ pytest -v test_pandascharm.py

Test coverage can be calculated with Coverage.py using the following commands:

$ coverage run -m pytest

$ coverage report -m pandascharm.py

The code follow style conventions in PEP8, which can be checked with pycodestyle:

$ pycodestyle pandascharm.py test_pandascharm.py setup.py

Usage

The following examples show how to use pandas-charm. The examples are written with Python 3 code, but pandas-charm should work also with Python 2.7+. You need to install BioPython Pandas and/or DendroPy manually before you start:

$ pip install biopython

$ pip install dendropy

DendroPy CharacterMatrix to pandas DataFrame

>>> import pandas as pd

>>> import pandascharm as pc

>>> import dendropy

>>> dna_string = ‘3 5\nt1  TCCAA\nt2  TGCAA\nt3  TG-AA\n’

>>> print(dna_string)

3 5

t1  TCCAA

t2  TGCAA

t3  TG-AA

>>> matrix = dendropy.DnaCharacterMatrix.get(

…     data=dna_string, schema=’phylip’)

>>> df = pc.from_charmatrix(matrix)

>>> df

  t1 t2 t3

0  T  T  T

1  C  G  G

2  C  C  –

3  A  A  A

4  A  A  A

By default, characters are stored as rows and sequences as columns in the DataFrame. If you want rows to hold sequences, just transpose the matrix in pandas:

>>> df.transpose()

    0  1  2  3  4

t1  T  C  C  A  A

t2  T  G  C  A  A

t3  T  G  –  A  A

83) How will you add a scalar column with same value for all rows to a pandas DataFrame?

Ans: 

Python is a great language for doing data analysis, primarily because of the fantastic ecosystem of data-centric Python packages. Pandas is one of those packages and makes importing and analyzing data much easier.

Dataframe.add() method is used for addition of dataframe and other, element-wise (binary operator add). Equivalent to dataframe + other, but with support to substitute a fill_value for missing data in one of the inputs.

Syntax: DataFrame.add(other, axis=’columns’, level=None, fill_value=None)

Parameters:

other :Series, DataFrame, or constant

axis :{0, 1, ‘index’, ‘columns’} For Series input, axis to match Series index on

fill_value : [None or float value, default None] Fill missing (NaN) values with this value. If both DataFrame locations are missing, the result will be missing.

level : [int or name] Broadcast across a level, matching Index values on the passed MultiIndex level

Returns: result DataFrame

# Importing Pandas as pd

import pandas as pd

# Importing numpy as np

import numpy as np

# Creating a dataframe

# Setting the seed value to re-generate the result.

np.random.seed(25)

df = pd.DataFrame(np.random.rand(10, 3), columns =[‘A’, ‘B’, ‘C’])

# np.random.rand(10, 3) has generated a

# random 2-Dimensional array of shape 10 * 3

# which is then converted to a dataframe

df

Output :

Output

84) How can we select a column in pandas DataFrame?

Ans: 

Python Pandas is a great language for doing data analysis, primarily because of the fantastic ecosystem of data-centric Python Pandas packages. Pandas is one of those packages and makes importing and analyzing data much easier.

Let’s discuss all different ways of selecting multiple columns in a pandas DataFrame.

Method #1: Basic Method

Given a dictionary which contains Employee entity as keys and list of those entity as values.

# Import pandas package

import pandas as pd

# Explain what is a dictionary containing employee data

data = {‘Name’:[‘Jai’, ‘Princi’, ‘Gaurav’, ‘Anuj’],

        ‘Age’:[27, 24, 22, 32],

        ‘Address’:[‘Delhi’, ‘Kanpur’, ‘Allahabad’, ‘Kannauj’],

        ‘Qualification’:[‘Msc’, ‘MA’, ‘MCA’, ‘Phd’]}

# Convert the dictionary into DataFrame

df = pd.DataFrame(data)

# select two columns

df[[‘Name’, ‘Qualification’]]

Output:

Select Second to fourth column.

# Import pandas package

import pandas as pd

# Explain what is a dictionary containing employee data

data = {‘Name’:[‘Jai’, ‘Princi’, ‘Gaurav’, ‘Anuj’],

        ‘Age’:[27, 24, 22, 32],

        ‘Address’:[‘Delhi’, ‘Kanpur’, ‘Allahabad’, ‘Kannauj’],

        ‘Qualification’:[‘Msc’, ‘MA’, ‘MCA’, ‘Phd’]}

# Convert the dictionary into DataFrame

df = pd.DataFrame(data)

# select all rows

# and second to fourth column

df[df.columns[1:4]]

Output :

85) How can we retrieve a row in pandas DataFrame ?

Ans:  Pandas provide a unique method to retrieve rows from a Data frame. DataFrame.loc[] method is a method that takes only index labels and returns row or dataframe if the index label exists in the caller data frame.

Syntax: pandas.DataFrame.loc[ ]

Parameters:

Index label: String or list of string of index label of rows

Return type: Data frame or Series depending on parameters

Example #1 : Extracting single Row

In this example, Name column is made as the index column and then two single rows are extracted one by one in the form of series using index label of rows.

# importing pandas package

import pandas as pd

# making data frame from csv file

data = pd.read_csv(“nba.csv”, index_col =”Name”)

# retrieving row by loc method

first = data.loc[“Avery Bradley”]

second = data.loc[“R.J. Hunter”]

print(first, “\n\n\n”, second)

Output:

As shown in the output image, two series were returned since there was only one parameter both of the times.

Example #2: Multiple parameters

In this example, Name column is made as the index column and then two single rows are extracted at the same time by passing a list as parameter.

# importing pandas package

import pandas as pd

# making data frame from csv file

data = pd.read_csv(“nba.csv”, index_col =”Name”)

# retrieving rows by loc method

rows = data.loc[[“Avery Bradley”, “R.J. Hunter”]]

# checking data type of rows

print(type(rows))

# display

rows

Output:

As shown in the output image, this time the data type of returned value is a data frame. Both of the rows were extracted and displayed like a new data frame.

86) How to get the items of series A not present in series B?

We can remove items present in p2 from p1 using isin() method.

import pandas as pd 

p1 = pd.Series([2, 4, 6, 8, 10]) 

p2 = pd.Series([8, 10, 12, 14, 16]) 

p1[~p1.isin(p2)] 

Solution

0    2

1    4

2    6

dtype: int64

87) How to get the items not common to both series A and series B?

We get all the items of p1 and p2 not common to both using below example:

import pandas as pd 

import numpy as np 

p1 = pd.Series([2, 4, 6, 8, 10]) 

p2 = pd.Series([8, 10, 12, 14, 16]) 

p1[~p1.isin(p2)] 

p_u = pd.Series(np.union1d(p1, p2))  # union 

p_i = pd.Series(np.intersect1d(p1, p2))  # intersect 

p_u[~p_u.isin(p_i)] 

Output:

0     2

1     4

2     6

5    12

6    14

7    16

dtype: int64

88) How to get the minimum, 25th percentile, median, 75th, and max of a numeric series?

We can compute the minimum, 25th percentile, median, 75th, and maximum of p as below example:

import pandas as pd 

import numpy as np 

p = pd.Series(np.random.normal(14, 6, 22)) 

state = np.random.RandomState(120) 

p = pd.Series(state.normal(14, 6, 22)) 

np.percentile(p, q=[0, 25, 50, 75, 100]) 

Output:

array([ 4.61498692, 12.15572753, 14.67780756, 17.58054104, 33.24975515])

89)  How to get frequency counts of unique items of a series?

We can calculate the frequency counts of each unique value p as below example:

import pandas as pd 

import numpy as np 

p= pd.Series(np.take(list(‘pqrstu’), np.random.randint(6, size=17))) 

p = pd.Series(np.take(list(‘pqrstu’), np.random.randint(6, size=17))) 

p.value_counts() 

Output:

s    4

r    4

q    3

p    3

u    3

90) How to convert a numpy array to a dataframe of given shape?

We can reshape the series p into a dataframe with 6 rows and 2 columns as below example:

import pandas as pd 

import numpy as np 

p = pd.Series(np.random.randint(1, 7, 35)) 

# Input 

p = pd.Series(np.random.randint(1, 7, 35)) 

info = pd.DataFrame(p.values.reshape(7,5)) 

print(info) 

Output:

0  1  2  3  4

0  3  2  5  5  1

1  3  2  5  5  5

2  1  3  1  2  6

3  1  1  1  2  2

4  3  5  3  3  3

5  2  5  3  6  4

6  3  6  6  6  5

91) How will you convert a DataFrame to an array in pandas?

Ans:

For performing some high-level mathematical functions, we can convert Pandas DataFrame to numpy arrays. It uses the DataFrame.to_numpy() function.

The DataFrame.to_numpy() function is applied on the DataFrame that returns the numpy ndarray.

Syntax:

DataFrame.to_numpy(dtype=None, copy=False)

Parameters

dtype: It is an optional parameter that pass the dtype to numpy.asarray().

copy: It returns the boolean value that has the default value False.

It ensures that the returned value is not a view on another array.

Returns

It returns the numpy.ndarray as an output.

Example1:

import pandas as pd

pd.DataFrame({“P”: [2, 3], “Q”: [4, 5]}).to_numpy()

info = pd.DataFrame({“P”: [2, 3], “Q”: [4.0, 5.8]})

info.to_numpy()

info[‘R’] = pd.date_range(‘2000’, periods=2)

info.to_numpy()

Output :

array([[2, 4.0, Timestamp(‘2000-01-01 00:00:00’)],

[3, 5.8, Timestamp(‘2000-01-02 00:00:00’)]], dtype=object)

Example 2:

import pandas as pd

#initializing the dataframe

info = pd.DataFrame([[17, 62, 35],[25, 36, 54],[42, 20, 15],[48, 62, 76]],

columns=[‘x’, ‘y’, ‘z’])

print(‘DataFrame\n———-\n’, info)

#convert the dataframe to a numpy array

arr = info.to_numpy()

print(‘\nNumpy Array\n———-\n’, arr)

Output:

DataFrame

———-

x    y   z

0  17  62  35

1  25  36  54

2  42  20  15

3  48  62  76

Numpy Array

———-

[[17 62 35]

[25 36 54]

[42 20 15]

[48 62 76]]

92) How can you check if a DataFrame is empty in pandas?

Ans : Pandas DataFrame is a two-dimensional size-mutable, potentially heterogeneous tabular data structure with labeled axes (rows and columns). Arithmetic operations align on both row and column labels. It can be thought of as a dict-like container for Series objects. This is the primary data structure of the Pandas.

Pandas DataFrame.empty attribute checks if the dataframe is empty or not. It return True if the dataframe is empty else it return False.

Syntax: DataFrame.empty

Parameter : None

Returns : bool

Example #1: Use DataFrame.empty attribute to check if the given dataframe is empty or not

# importing pandas as pd

import pandas as pd

# Creating the DataFrame

df = pd.DataFrame({‘Weight’:[45, 88, 56, 15, 71],

                   ‘Name’:[‘Sam’, ‘Andrea’, ‘Alex’, ‘Robin’, ‘Kia’],

                   ‘Age’:[14, 25, 55, 8, 21]})

# Create the index

index_ = [‘Row_1’, ‘Row_2’, ‘Row_3’, ‘Row_4’, ‘Row_5’]

# Set the index

df.index = index_

# Print the DataFrame

print(df)

Output :

Now we will use DataFrame.empty attribute to check if the given dataframe is empty or not.

# check if there is any element

# in the given dataframe or not

result = df.empty

# Print the result

print(result)

Output :

As we can see in the output, the DataFrame.empty attribute has returned False indicating that the given dataframe is not empty.

Example #2: Use DataFrame.empty attribute to check if the given dataframe is empty or not.

# importing pandas as pd

import pandas as pd

# Creating an empty DataFrame

df = pd.DataFrame(index = [‘Row_1’, ‘Row_2’, ‘Row_3’, ‘Row_4’, ‘Row_5’])

# Print the DataFrame

print(df)

Output :

Now we will use DataFrame.empty attribute to check if the given dataframe is empty or not.

# check if there is any element

# in the given dataframe or not

result = df.empty

# Print the result

print(result)

Output :

As we can see in the output, the DataFrame.empty attribute has returned True indicating that the given dataframe is empty.

93) How can you get the sum of values of a column in pandas DataFrame?

Ans: Pandas dataframe.sum() function return the sum of the values for the requested axis. If the input is index axis then it adds all the values in a column and repeats the same for all the columns and returns a series containing the sum of all the values in each column. It also provides support to skip the missing values in the dataframe while calculating the sum in the dataframe.

Syntax: DataFrame.sum(axis=None, skipna=None, level=None, numeric_only=None, min_count=0, **kwargs)

Parameters :

axis : {index (0), columns (1)}

skipna : Exclude NA/null values when computing the result.

level : If the axis is a MultiIndex (hierarchical), count along a particular level, collapsing into a Series

numeric_only : Include only float, int, boolean columns. If None, will attempt to use everything, then use only numeric data. Not implemented for Series.

min_count : The required number of valid values to perform the operation. If fewer than min_count non-NA values are present the result will be NA.

Returns : sum : Series or DataFrame (if level specified)

Example #1: Use sum() function to find the sum of all the values over the index axis.

# importing pandas as pd

import pandas as pd

# Creating the dataframe

df = pd.read_csv(“nba.csv”)

# Print the dataframe

df

Now find the sum of all values along the index axis. We are going to skip the NaN values in the calculation of the sum.

# finding sum over index axis

# By default the axis is set to 0

df.sum(axis = 0, skipna = True)

Output:

94) Explain what is the Pandas/Python pandas?

Pandas is defined as an open-source library that provides high-performance data manipulation in Python Pandas. The name of Pandas is derived from the word Panel Data, which means an Econometrics from Multidimensional data. It can be used for data analysis in Python Pandas and developed by Wes McKinney in 2008. It can perform five significant steps that are required for processing and analysis of data irrespective of the origin of the data, i.e., load, manipulate, prepare, model, and analyze.

95) Mention the different types of Data Structures in Pandas?

Pandas provide two data structures, which are supported by the pandas library, Series, and DataFrames. Both of these data structures are built on top of the NumPy.

96) How will you get the average of values of a column in pandas DataFrame?

Ans:

Pandas dataframe.mean() function return the mean of the values for the requested axis. If the method is applied on a pandas series object, then the method returns a scalar value which is the mean value of all the observations in the dataframe. If the method is applied on a pandas dataframe object, then the method returns a pandas series object which contains the mean of the values over the specified axis.

Syntax: DataFrame.mean(axis=None, skipna=None, level=None, numeric_only=None, **kwargs)

Parameters :

axis : {index (0), columns (1)}

skipna : Exclude NA/null values when computing the result

level : If the axis is a MultiIndex (hierarchical), count along a particular level, collapsing into a Series

numeric_only : Include only float, int, boolean columns. If None, will attempt to use everything, then use only numeric data. Not implemented for Series.

Returns : mean : Series or DataFrame (if level specified)

Example : Use mean() function to find the mean of all the observations over the index axis.

# importing pandas as pd

import pandas as pd

# Creating the dataframe

df = pd.DataFrame({“A”:[12, 4, 5, 44, 1],

                   “B”:[5, 2, 54, 3, 2],

                   “C”:[20, 16, 7, 3, 8],

                   “D”:[14, 3, 17, 2, 6]})

# Print the dataframe

df

Let’s use the dataframe.mean() function to find the mean over the index axis.

# Even if we do not specify axis = 0,

# the method will return the mean over

# the index axis by default

df.mean(axis = 0)

Output:

97) How will you apply a function to every data element in a DataFrame?

Ans:

One can use apply() function in order to apply function to every row in given dataframe. Let’s see the ways we can do this task.

Example

# Import pandas package

import pandas as pd

# Function to add

def add(a, b, c):

    return a + b + c

def main():

    # create a dictionary with

    # three fields each

    data = {

            ‘A’:[1, 2, 3],

            ‘B’:[4, 5, 6],

            ‘C’:[7, 8, 9] }

    # Convert the dictionary into DataFrame

    df = pd.DataFrame(data)

    print(“Original DataFrame:\n”, df)

    df[‘add’] = df.apply(lambda row : add(row[‘A’],

                     row[‘B’], row[‘C’]), axis = 1)

    print(‘\nAfter Applying Function: ‘)

    # printing the new dataframe

    print(df)

if __name__ == ‘__main__’:

    main()

Output:

98) How will you get the top 2 rows from a DataFrame in pandas?

# Select the first 2 rows of the Dataframe

dfObj1 = empDfObj.head(2)

print(“First 2 rows of the Dataframe : “)

print(dfObj1)

Output:

First 2 rows of the Dataframe :

Name Age City Experience

a jack 34 Sydney 5

b Riti 31 Delhi 7

99) List major features of the Python pandas?

Ans:

Some of the major features of Python Pandas are,

Fast and efficient in handling the data with its DataFrame object.

It provides tools for loading data into in-memory data objects from various file formats.

It has high-performance in merging and joining data.

It has Time Series functionality.

It provides functions for Data set merging and joining.

It has functionalities for label-based slicing, fancy indexing, and subsetting of large data sets.

It provides functionalities for reshaping and pivoting of data sets.

100) Enlist different types of Data Structures available in Pandas?

Ans: Different types of data structures available in Pandas are,

Series – It is immutable in size and homogeneous one-dimensional array data structure.

DataFrame – It is a tabular data structure which comprises of rows and columns. Here, data and size are mutable.

Panel – It is a three-dimensional data structure to store the data heterogeneously.

101) How To Write a Pandas DataFrame to a File

When you have done your data munging and manipulation with Pandas, you might want to export the DataFrame to another format. This section will cover two ways of outputting your DataFrame: to a CSV or to an Excel file.

Outputting a DataFrame to CSV

To output a Pandas DataFrame as a CSV file, you can use to_csv().

Writing a DataFrame to Excel

Very similar to what you did to output your DataFrame to CSV, you can use to_excel() to write your table to Excel.

102) When, Why And How You Should Reshape Your Pandas DataFrame

Ans: Reshaping your DataFrame is basically transforming it so that the resulting structure makes it more suitable for your data analysis.

In other words, reshaping is not so much concerned with formatting the values that are contained within the DataFrame, but more about transforming the shape of it.

This answers the when and why. Now onto the how of reshaping your DataFrame.

There are three ways of reshaping that frequently raise questions with users: pivoting, stacking and unstacking and melting.

Keep on reading to find out more!

Remember that if you want to see code examples and want to practice your DataFrame skills in our interactive DataCamp environment, go here.

Pivoting Your DataFrame

You can use the pivot() function to create a new derived table out of your original one. When you use the function, you can pass three arguments:

Values: this argument allows you to specify which values of your original DataFrame you want to see in your pivot table.

Columns: whatever you pass to this argument will become a column in your resulting table.

Index: whatever you pass to this argument will become an index in your resulting table.

When you don’t specifically fill in what values you expect to be present in your resulting table, you will pivot by multiple columns. Note that your data can not have rows with duplicate values for the columns that you specify. If this is not the case, you will get an error message. If you can’t ensure the uniqueness of your data, you will want to use the pivot_table method instead .

Using stack() and unstack() to Reshape Your Pandas DataFrame

You have already seen an example of stacking in the answer to question 5!

Good news, you already know why you would use this and what you need to do to do it.

To repeat, when you stack a DataFrame, you make it taller. You move the innermost column index to become the innermost row index. You return a DataFrame with an index with a new inner-most level of row labels.

Go back to the full walk-through of the answer to question 5 “Splitting Text Into Multiple Columns” if you’re unsure of the workings of `stack().

The inverse of stacking is called unstacking. Much like stack(), you use unstack() to move the innermost row index to become the innermost column index.

Reshaping Your DataFrame With Melt()

Melting is considered to be very useful for when you have a data that has one or more columns that are identifier variables, while all other columns are considered measured variables.

These measured variables are all “unpivoted” to the row axis. That is, while the measured variables that were spread out over the width of the DataFrame, the melt will make sure that they will be placed in the height of it. Or, yet in other words, your DataFrame will now become longer instead of wider.

As a result, you just have two non-identifier columns, namely, ‘variable’ and ‘value’.

103) Does Pandas Recognize Dates When Importing Data?

Ans:

Pandas can recognize it, but you need to help it a tiny bit: add the argument parse_dates when you’reading in data from, let’s say, a comma-separated value (CSV) file.

There are, however, always weird date-time formats.

(Honestly, who has never had this?)

In such cases, you can construct your own parser to deal with this. You could, for example, make a lambda function that takes your DateTime and controls it with a format string.

104) How To Format The Data in Your Pandas DataFrame?

Ans:

Most of the times, you will also want to be able to do some operations on the actual values that are in your DataFrame.

Keep on reading to find out what the most common Pandas questions are when it comes to formatting your DataFrame’s values!

Replacing All Occurrences of a String in a DataFrame

To replace certain Strings in your DataFrame, you can easily use replace(): pass the values that you would like to change, followed by the values you want to replace them by.

Note that there is also a regex argument that can help you out tremendously when you’re faced with strange string combinations. In short, replace() is mostly what you need to deal with when you want to replace values or strings in your DataFrame by others.

Removing Parts From Strings in the Cells of Your DataFrame

Removing unwanted parts of strings is cumbersome work. Luckily, there is a solution in place! You use map() on the column result to apply the lambda function over each element or element-wise of the column. The function in itself takes the string value and strips the + or — that’s located on the left, and also strips away any of the six aAbBcC on the right.

Splitting Text in a Column into Multiple Rows in a DataFrame

Splitting your text into multiple rows is quite complex. For a complete walkthrough, go here.

Applying A Function to Your Pandas DataFrame’s Columns or Rows

You might want to adjust the data in your DataFrame by applying a function to it. Go to this page for the code chunks that explain how to apply a function to a DataFrame.

105) How To Add an Index, Row or Column to a Pandas DataFrame?

Ans: Now that you have learned how to select a value from a DataFrame, it’s time to get to the real work and add an index, row or column to it!

Adding an Index to a DataFrame

When you create a DataFrame, you have the option to add input to the ‘index’ argument to make sure that you have the index that you desire. When you don’t specify this, your DataFrame will have, by default, a numerically valued index that starts with 0 and continues until the last row of your DataFrame.

However, even when your index is specified for you automatically, you still have the power to re-use one of your columns and make it your index. You can easily do this by calling set_index() on your DataFrame.

Adding Rows to a DataFrame

Before you can get to the solution, it’s first a good idea to grasp the concept of loc and how it differs from other indexing attributes such as .iloc and .ix:

loc works on labels of your index. This means that if you give in loc[2], you look for the values of your DataFrame that have an index labeled 2.

iloc works on the positions in your index. This means that if you give in iloc[2], you look for the values of your DataFrame that are at index ’2`.

ix is a more complex case: when the index is integer-based, you pass a label to ix. ix[2] then means that you’re looking in your DataFrame for values that have an index labeled 2. This is just like loc! However, if your index is not solely integer-based, ix will work with positions, just like iloc.

Now that the difference between iloc, loc and ix is clear, you are ready to give adding rows to your DataFrame a go!

As a consequence of what has just been explained, you understand that the general recommendation is that you use .loc to insert rows in your DataFrame.

If you would use df.ix[], you might try to reference a numerically valued index with the index value and accidentally overwrite an existing row of your DataFrame.

You better avoid this!

Adding a Column to Your DataFrame

In some cases, you want to make your index part of your DataFrame. You can easily do this by taking a column from your DataFrame or by referring to a column that you haven’t made yet and assigning it to the .index property.

However, if you want to append columns to your DataFrame, you could also follow the same approach as adding an index to your DataFrame: you use loc or iloc.

Note that the observation that was made earlier about loc still stays valid also for when you’re adding columns to your DataFrame!

Resetting the Index of Your DataFrame

When your index doesn’t look entirely the way you want it to, you can opt to reset it. This can easily ben done with .reset_index().

106) How To Select an Index or Column From a Python Pandas DataFrame?

Before you start with adding, deleting and renaming the components of your DataFrame, you first need to know how you can select these elements.

So, how do you do this?

Well, in essence, selecting an index, column or value from your DataFrame isn’t that hard. It’s really very similar to what you see in other languages that are used for data analysis (and which you might already know!).

Let’s take R for example. You use the [,] notation to access the data frame’s values. In Pandas DataFrames, this is not too much different: the most important constructions to use are, without a doubt, loc and iloc. The subtle differences between these two will be discussed in the next sections. For now, it suffices to know that you can either access the values by calling them by their label or by their position in the index or column.

107) How will you get the average of values of a column in pandas DataFrame?

Ans;

Steps to get the Average for each Column and Row in Pandas DataFrame

Step 1: Gather the data

To start, gather the data that needs to be averaged.

For example, I gathered the following data about the commission earned by 3 employees (over the first 6 months of the year):

Excel Table

The goal is to get the average of the commission earned:

For each employee over the first 6 months (average by column)

For each month across all employees (average by row)

Step 2: Create the DataFrame

Next, create the DataFrame in order to capture the above data in Python:

import pandas as pd

data = {‘Month’: [‘Jan ‘,’Feb ‘,’Mar ‘,’Apr ‘,’May ‘,’Jun ‘],

‘Jon Commission’: [7000,5500,6000,4500,8000,6000],

‘Maria Commission’: [10000,7500,6500,6000,9000,8500],

‘Olivia Commission’: [3000,6000,4500,4500,4000,5500],

}

df = pd.DataFrame(data,columns=[‘Month’,’Jon Commission’,’Maria Commission’,’Olivia Commission’])

print (df)

Run the code in Python, and you’ll get the following DataFrame:

Pandas DataFrame

Step 3: Get the Average for each Column and Row in Pandas DataFrame

You can then apply the following syntax to get the average for each column:

df.mean(axis=0)

For our example, this is the complete Python Pandas code to get the average commission earned for each employee over the 6 first months (average by column):

import pandas as pd

data = {‘Month’: [‘Jan ‘,’Feb ‘,’Mar ‘,’Apr ‘,’May ‘,’Jun ‘],

‘Jon Commission’: [7000,5500,6000,4500,8000,6000],

‘Maria Commission’: [10000,7500,6500,6000,9000,8500],

‘Olivia Commission’: [3000,6000,4500,4500,4000,5500]

}

df = pd.DataFrame(data,columns=[‘Month’,’Jon Commission’,’Maria Commission’,’Olivia Commission’])

av_column = df.mean(axis=0)

print (av_column)

Run the code, and you’ll get the average commission per employee:

Average of each Column and Row in Pandas DataFrame

Alternatively, you can get the average for each row using the following syntax:

df.mean(axis=1)

Here is the code that you can use to get the average commission earned for each month across all employees (average by row):

import pandas as pd

data = {‘Month’: [‘Jan ‘,’Feb ‘,’Mar ‘,’Apr ‘,’May ‘,’Jun ‘],

‘Jon Commission’: [7000,5500,6000,4500,8000,6000],

‘Maria Commission’: [10000,7500,6500,6000,9000,8500],

‘Olivia Commission’: [3000,6000,4500,4500,4000,5500],

}

df = pd.DataFrame(data,columns=[‘Month’,’Jon Commission’,’Maria Commission’,’Olivia Commission’], index =[‘Jan ‘,’Feb ‘,’Mar ‘,’Apr ‘,’May ‘,’Jun ‘])

av_row = df.mean(axis=1)

print (av_row)

Once you run the code in Python, you’ll get the average commission earned per month:

Get the Average of each Column and Row in Pandas DataFrame

You may also want to check the following source that explains the steps to get the sum for each column and row in pandas DataFrame.

108) How to Apply function to every row in a Python Pandas DataFrame?

Ans: Python Pandas is a great language for performing data analysis tasks. It provides with a huge amount of Classes and function which help in analyzing and manipulating data in an easier way.

One can use apply() function in order to apply function to every row in given dataframe. Let’s see the ways we can do this task.

Example

# Import pandas package

import pandas as pd

# Function to add

def add(a, b, c):

    return a + b + c

def main():

    # create a dictionary with

    # three fields each

    data = {

            ‘A’:[1, 2, 3],

            ‘B’:[4, 5, 6],

            ‘C’:[7, 8, 9] }

    # Convert the dictionary into DataFrame

    df = pd.DataFrame(data)

    print(“Original DataFrame:\n”, df)

    df[‘add’] = df.apply(lambda row : add(row[‘A’],

                     row[‘B’], row[‘C’]), axis = 1)

    print(‘\nAfter Applying Function: ‘)

    # printing the new dataframe

    print(df)

if __name__ == ‘__main__’:

    main()

Output:

Leave a Reply