This page uses Pyodide. That means all the Python code you see here is
executed in your browser using WebAssembly. You can run the code
snippets by clicking the "Run Code" button. If you want to run all the
code snippets at once, click the "Run All" button.
Loading Python Console
“We owe a lot to the Indians, who taught us how to count, without which no
worthwhile scientific discovery could have been made.”
As discussed in the previous chapter, Pandas is a well established Python
Library used for manipulation, processing and analysis of data. We have already
discussed the basic operations on Series and DataFrame like creating them and
then accessing data from them. Pandas provides more powerful and useful
functions for data analysis.
In this chapter, we will be working with more advanced features of DataFrame
like sorting data, answering analytical questions using the data, cleaning data
and applying different useful functions on the data. Below is the example data
on which we will be applying the advanced features of Pandas.
Case Study
Let us consider the data of marks scored in unit tests held in school. For each
unit test, the marks scored by all students of the class is recorded. Maximum
marks are 25 in each subject. The subjects are Maths, Science. Social Studies
(S.St.), Hindi, and English. For simplicity, we assume there are 4 students in
the class and the table below shows their marks in Unit Test 1, Unit Test 2 and
Unit Test 3. Table 3.1 shows this data.
Table 3.1 Case Study
Result
Name
Unit Test
Maths
Science
S.St.
Hindi
Eng
Raman
1
22
21
18
20
21
Raman
2
21
20
17
22
24
Raman
3
14
19
15
24
23
Zuhaire
1
20
17
22
24
19
Zuhaire
2
23
15
21
25
15
Zuhaire
3
22
18
19
23
13
Aashravy
1
23
19
20
15
22
Aashravy
2
24
22
24
17
21
Aashravy
3
12
25
19
21
23
Mishti
1
15
22
25
22
22
Mishti
2
18
21
25
24
23
Mishti
3
17
18
20
25
20
Let us store the data in a DataFrame, as shown in Program 3.1:
Program 3-1
Store the Result data in a DataFrame called marksUT.
Descriptive Statistics are used to summarise the given data. In other words,
they refer to the methods which are used to get some basic idea about the data.
In this section, we will be discussing descriptive statistical methods that can
be applied to a DataFrame. These are max, min, count, sum, mean, median, mode,
quartiles, variance. In each case, we will consider the above created DataFrame
df.
3.2.1 Calculating Maximum Values
DataFrame.max() is used to calculate the maximum values from the DataFrame,
regardless of its data types. The following statement outputs the maximum value
of each column of the DataFrame:
print(df.max())
print(df.max())
If we want to output maximum value for the columns having only numeric values,
then we can set the parameter numeric_only=True in the max() method, as shown
below:
print(df.max(numeric_only=True))
print(df.max(numeric_only=True))
Program 3-2
Write the statements to output the maximum marks obtained in each subject in
Unit Test 2.
dfUT2 = df[df.UT==2]
print("Result of Unit Test 2:\n")
print(dfUT2)
dfUT2 = df[df.UT == 2]
print("Result of Unit Test 2:\n")
print(dfUT2)
print("Maximum Mark obtained inEach Subject in Unit Test 2:\n")
print(dfUT2.max(numeric_only=True))
print("Maximum Mark obtained inEach Subject in Unit Test 2:\n")
print(dfUT2.max(numeric_only=True))
By default, the max() method finds the maximum value of each column (which
means, axis=0). However, to find the maximum value of each row, we have to
specify axis=1 as its argument.
# maximum marks for each student in each unit test among all the subjects
df.max(axis=1,numeric_only=True)
# maximum marks for each student in each unit test among all the subjects
df.max(axis=1, numeric_only=True)
3.2.2 Calculating Minimum Values
DataFrame.min() is used to display the minimum values from the DataFrame,
regardless of the data types. That is, it shows the minimum value of each column
or row. The following line of code output the minimum value of each column of
the DataFrame:
print(df.min())
print(df.min())
Program 3-3
Write the statements to display the minimum marks obtained by a particular
student ‘Mishti’ in all the unit tests for each subject.
dfMishti = df.loc[df.Name =="Mishti"]
print("Marks obtained by Mishti in allthe Unit Tests:\n")
print(dfMishti)
dfMishti = df.loc[df.Name == "Mishti"]
print("Marks obtained by Mishti in allthe Unit Tests:\n")
print(dfMishti)
print("Minimum Marks obtained by Mishti in each subject across the unit tests:\n")
print("Minimum Marks obtained by Mishti in each subject across the unit tests:\n")
print(dfMishti[["Maths", "Science", "S.St", "Hindi", "Eng"]].min())
3.2.3 Calculating Sum of Values
DataFrame.sum() will display the sum of the values from the DataFrame regardless
of its datatype. The following line of code outputs the sum of each column of
the DataFrame:
print(df.sum())
print(df.sum())
We may not be interested to sum text values. So, to print the sum of a
particular column, we need to specify the column name in the call to function
sum. The following statement prints the total marks of subject mathematics:
print(df["Maths"].sum())
print(df["Maths"].sum())
To calculate total marks of a particular student, the name of the student needs
to be specified.
Program 3-4
Write the python statement to print the total marks secured by Raman in each
subject.
dfRaman = df[df["Name"] =="Raman"]
print("Marks obtained by Raman in each test are:\n")
print(dfRaman)
dfRaman = df[df["Name"] == "Raman"]
print("Marks obtained by Raman in each test are:\n")
print(dfRaman)
# To print total marks scored by Raman
# in all subjects in each Unit Test
dfRaman[
[
"Maths",
"Science",
"S.St",
"Hindi",
"Eng",
]
].sum(axis=1)
Activity 3.1
Write the python statements to print the sum of the english marks scored by
Mishti.
Think and Reflect
Can you write a shortened code to get the output of Program 3.4?
3.2.4 Calculating Number of Values
DataFrame.count() will display the total number of values for each column or row
of a DataFrame. To count the rows we need to use the argument axis=1 as shown in
the Program 3.5 below.
print(df.count())
print(df.count())
Program 3-5
Write a statement to count the number of values in a row.
df.count(axis=1)
df.count(axis=1)
3.2.5 Calculating Mean
DataFrame.mean() will display the mean (average) of the values of each column of
a DataFrame. It is only applicable for numeric values.
df.mean(numeric_only=True)
df.mean(numeric_only=True)
Program 3-6
Write the statements to get an average of marks obtained by Zuhaire in all the
Unit Tests.
dfZuhaire = df[df.Name =="Zuhaire"]
dfZuhaireMarks = dfZuhaire.loc[:, "Maths":"Eng"]
print("Slicing of the DataFrame to get only the marks:\n")
print(dfZuhaireMarks)
dfZuhaire = df[df.Name == "Zuhaire"]
dfZuhaireMarks = dfZuhaire.loc[:, "Maths":"Eng"]
print("Slicing of the DataFrame to get only the marks:\n")
print(dfZuhaireMarks)
print("Average of marks obtained by Zuhaire in all Unit Tests:\n")
print(dfZuhaireMarks.mean(axis=1))
print("Average of marks obtained by Zuhaire in all Unit Tests:\n")
print(dfZuhaireMarks.mean(axis=1))
In the above output, 20.4 is the average of marks obtained by Zuhaire in Unit
Test 1. Similarly, 19.8 and 19.0 are the average of marks in Unit Test 2 and 3
respectively.
Think and Reflect
Try to write a short code to get the above output. Remember to print the
relevant headings of the output.
3.2.6 Calculating Median
DataFrame.Median() will display the middle value of the data. This function will
display the median of the values of each column of a DataFrame. It is only
applicable for numeric values.
print(df.median(numeric_only=True))
print(df.median(numeric_only=True))
Program 3-7
Write the statements to print the median marks of mathematics in UT1.
dfMaths = df["Maths"]
dfMathsUT1 = dfMaths[df.UT==1]
print("Displaying the marks scored in Mathematics in UT1:\n")
print(dfMathsUT1)
dfMaths = df["Maths"]
dfMathsUT1 = dfMaths[df.UT == 1]
print("Displaying the marks scored in Mathematics in UT1:\n")
print(dfMathsUT1)
dfMathMedian = dfMathsUT1.median()
print("Displaying the median of Mathematics in UT1:\n")
print(dfMathMedian)
dfMathMedian = dfMathsUT1.median()
print("Displaying the median of Mathematics in UT1:\n")
print(dfMathMedian)
Here, the number of values are even in number so two middle values are there
i.e. 20 and 22. Hence, Median is the average of 20 and 22.
Activity 3.2
Find the median of the values of the rows of the DataFrame.
3.2.7 Calculating Mode
DateFrame.mode() will display the mode. The mode is defined as the value that
appears the most number of times in a data. This function will display the mode
of each column or row of the DataFrame. To get the mode of Hindi marks, the
following statement can be used.
df["Hindi"]
df["Hindi"]
df["Hindi"].mode()
df["Hindi"].mode()
Note that three students have got 24 marks in Hindi subject while two students
got 25 marks, one student got 23 marks, two students got 22 marks, one student
each got 21, 20, 15, 17 marks.
Activity 3.3
Calculate the mode of marks scored in Maths.
3.2.8 Calculating Quartile
Dataframe.quantile() is used to get the quartiles. It will output the quartile
of each column or row of the DataFrame in four parts i.e. the first quartile is
25% (parameter q = .25), the second quartile is 50% (Median), the third quartile
is 75% (parameter q = .75). By default, it will display the second quantile
(median) of all numeric values.
df.quantile(numeric_only=True)
# by default, median is the output
df.quantile(numeric_only=True)
# by default, median is the output
df.quantile(q=0.25,numeric_only=True)
df.quantile(q=0.25, numeric_only=True)
df.quantile(q=0.75,numeric_only=True)
df.quantile(q=0.75, numeric_only=True)
Program 3-8
Write the statement to display the first and third quartiles of all subjects.
DataFrame.describe() function displays the descriptive statistical values in a
single command. These values help us describe a set of data in a DataFrame.
df.describe()
df.describe()
3.3 Data Aggregations
Aggregation means to transform the dataset and produce a single numeric value
from an array. Aggregation can be applied to one or more columns together.
Aggregate functions are max(), min(), sum(), count(), std(), var().
df.aggregate("max")
df.aggregate("max")
# To use multiple aggregate functions in asingle statement
df.aggregate(["max", "count"])
# To use multiple aggregate functions in asingle statement
df.aggregate(["max", "count"])
df["Maths"].aggregate(["max", "min"])
df["Maths"].aggregate(["max", "min"])
# Using the above statement with axis=0 givesthe same result
df["Maths"].aggregate(["max", "min"],axis=0)
# Using the above statement with axis=0 givesthe same result
df["Maths"].aggregate(["max", "min"], axis=0)
# Total marks of Maths and Science obtained byeach student.
# Use sum() with axis=1 (Row-wise summation)
df[["Maths", "Science"]].aggregate("sum",axis=1)
# Total marks of Maths and Science obtained byeach student.
# Use sum() with axis=1 (Row-wise summation)
df[["Maths", "Science"]].aggregate("sum", axis=1)
3.4 Sorting a DataFrame
Sorting refers to the arrangement of data elements in a specified order, which
can either be ascending or descending. Pandas provide sort_values() function
to sort the data values of a DataFrame. The syntax of the function is as
follows:
DataFrame.sort_values(by,axis=0,ascending=True)
Here, a column list (by), axis arguments (0 for rows and 1 for columns) and the
order of sorting (ascending = False or True) are passed as arguments. By
default, sorting is done on row indexes in ascending order.
Consider a scenario, where the teacher is interested in arranging a list
according to the names of the students or according to marks obtained in a
particular subject. In such cases, sorting can be used to obtain the desired
results. Following is the python code for sorting the data in the DataFrame
created at program 3.1.
To sort the entire data on the basis of attribute ‘Name’, we use the following
command:
# By default, sorting is done in ascending order.
print(df.sort_values(by=["Name"]))
# By default, sorting is done in ascending order.
print(df.sort_values(by=["Name"]))
Now, to obtain sorted list of marks scored by all students in Science in Unit
Test 2, the following code can be used:
# Get the data corresponding to Unit Test 2
dfUT2 = df[df.UT==2]
# Sort according to ascending order of marks in Science
print(dfUT2.sort_values(by=["Science"]))
# Get the data corresponding to Unit Test 2
dfUT2 = df[df.UT == 2]
# Sort according to ascending order of marks in Science
print(dfUT2.sort_values(by=["Science"]))
Program 3-9
Write the statement which will sort the marks in English in the DataFrame df
based on Unit Test 3, in descending order.
# Get the data corresponding to Unit Test 3
dfUT3 = df[df.UT==3]
# Sort according to descending order of marks in English
# Get the data corresponding to Unit Test 3
dfUT3 = df[df.UT == 3]
# Sort according to descending order of marks in English
print(dfUT3.sort_values(by=["Eng"], ascending=False))
A DataFrame can be sorted based on multiple columns. Following is the code of
sorting the DataFrame df based on marks in Science in Unit Test 3 in ascending
order. If marks in Science are the same, then sorting will be done on the basis
of marks in Hindi.
# Get the data corresponding to marks in Unit Test3
dfUT3 = df[df.UT==3]
# Sort the data according to Science and then according to Hindi
print(dfUT3.sort_values(by=["Science", "Hindi"]))
# Get the data corresponding to marks in Unit Test3
dfUT3 = df[df.UT == 3]
# Sort the data according to Science and then according to Hindi
print(dfUT3.sort_values(by=["Science", "Hindi"]))
Here, we can see that the list is sorted on the basis of marks in Science. Two
students namely, Zuhaire and Mishti have equal marks (18) in Science. Therefore
for them, sorting is done on the basis of marks in Hindi.
3.5 GROUP BY Functions
In pandas, DataFrame.groupby() function is used to split the data into groups
based on some criteria. Pandas objects like a DataFrame can be split on any of
their axes. The groupby function works based on a split-apply-combine strategy
which is shown below using a 3-step process:
Split the data into groups by creating a groupby object from the original
DataFrame.
Apply the required function.
Combine the results to form a new DataFrame.
To understand this better, let us consider the data shown in the diagram given
below. Here, we have a two- column DataFrame (key, data). We need to find the
sum of the data column for a particular key, i.e. sum of all the data elements
with key A, B and C, respectively. To do so, we first split the entire DataFrame
into groups by key column. Then, we apply the sum function on the respective
groups. Finally, we combine the results to form a new DataFrame that contains
the desired result.
Figure 3.1: A DataFrame with two columns
The following statements show how to apply groupby() function on our DataFrame
df created at Program 3.1.:
# Create a groupby Name of the student from DataFrame df
g1 = df.groupby("Name")
# Displaying the first entry from each group
g1.first()
# Create a groupby Name of the student from DataFrame df
g1 = df.groupby("Name")
# Displaying the first entry from each group
g1.first()
# Displaying the size of each group
g1.size()
# Displaying the size of each group
g1.size()
# Displaying group data, i.e., group_name, rowindexes
# corresponding to the group and theirdata type
g1.groups
# Displaying group data, i.e., group_name, rowindexes
# corresponding to the group and theirdata type
g1.groups
# Printing data of a single group
g1.get_group("Raman")
# Printing data of a single group
g1.get_group("Raman")
# Grouping with respect to multiple attributes
# Creating a groupby Name and UT
g2 = df.groupby(["Name", "UT"])
g2.first()
# Grouping with respect to multiple attributes
# Creating a groupby Name and UT
g2 = df.groupby(["Name", "UT"])
g2.first()
The above statements show how we create groups by splitting a DataFrame using
groupby(). Next step is to apply functions over the groups just created. This is
done using Aggregation.
Aggregation is a process in which an aggregate function is applied on each group
created by groupby(). It returns a single aggregated statistical value
corresponding to each group. It can be used to apply multiple functions over an
axis. Be default, functions are applied over columns. Aggregation can be
performed using agg() or aggregate() function.
# Calculating average marks scored by all
# students in each subject for each UT
df.groupby(["UT"]).aggregate("mean", numeric_only=True)
# Calculate average marks scored in Maths ineach UT
group1 = df.groupby(["UT"])
group1["Maths"].aggregate("mean")
# Calculate average marks scored in Maths ineach UT
group1 = df.groupby(["UT"])
group1["Maths"].aggregate("mean")
Program 3-10
Write the python statements to print the mean, variance, standard deviation and
quartile of the marks scored in Mathematics by each student across the UTs.
Write the python statements to print average marks in Science by all the
students in each UT.
3.6 Altering the Index
We use indexing to access the elements of a DataFrame. It is used for fast
retrieval of data. By default, a numeric index starting from 0 is created as a
row index, as shown below:
df # With default Index
df # With default Index
Here, the integer number in the first column starting from 0 is the index.
However, depending on our requirements, we can select some other column to be
the index or we can add another index column.
When we slice the data, we get the original index which is not continuous, e.g.
when we select marks of all students in Unit Test 1, we get the following
result:
dfUT1 = df[df.UT==1]
print(dfUT1)
dfUT1 = df[df.UT == 1]
print(dfUT1)
Notice that the first column is a non-continuous index since it is slicing of
original data. We create a new continuous index alongside this using the
reset_index() function, as shown below:
dfUT1.reset_index(inplace=True)
print(dfUT1)
dfUT1.reset_index(inplace=True)
print(dfUT1)
A new continuous index is created while the original one is also intact. We can
drop the original index by using the drop function, as shown below:
In this section, we will learn more techniques and functions that can be used to
manipulate and analyse data in a DataFrame.
3.7.1 Reshaping Data
The way a dataset is arranged into rows and columns is referred to as the shape
of data. Reshaping data refers to the process of changing the shape of the
dataset to make it suitable for some analysis problems. The example given in the
below section explains the utility of reshaping the data.
For reshaping data, two basic functions are available in Pandas, pivot and
pivot_table. This section covers them in detail.
(A) Pivot
The pivot function is used to reshape and create a new DataFrame from the
original one. Consider the following example of sales and profit data of four
stores: S1, S2, S3 and S4 for the years 2016, 2017 and 2018.
Notice that we have to slice the data corresponding to a particular store and
then answer the query. Now, let us reshape the data using pivot and see the
difference.
Here, Index specifies the columns that will be acting as an index in the pivot
table, columns specifies the new columns for the pivoted data and values
specifies
columns whose values will be displayed. In this particular case, store names
will act as index, year will be the headers for columns and sales value will be
displayed as values of the pivot table.
print(pivot1)
print(pivot1)
As can be seen above, the value of Total_sales (Rs) for every row in the
original table has been transferred to the new table: pivot1, where each row has
data of a store and each column has data of a year. Those cells in the new pivot
table which do not have a matching entry in the original one are filled with
NaN. For instance, we did not have values corresponding to sales of Store S2 in
2016, thus the appropriate cell in pivot1 is filled with NaN.
Now the python statements for the above queries will be as follows:
What was the total sale of store S1 in all the years?
pivot1.loc["S1"].sum()
pivot1.loc["S1"].sum()
What is the maximum sale value by store S3 in any year?
We can notice that reshaping has transformed the structure of the data, which
makes it more readable and easy to analyse the data.
Activity 3.6
Consider the data of unit test marks given at program 3.1, write the python
statements to print name wise UT marks in mathematics.
(B) Pivoting by Multiple Columns
For pivoting by multiple columns, we need to specify multiple column names in
the values parameter of pivot() function. If we omit the values parameter, it
will display the pivoting for all the numeric values.
But this statement results in an error:
"ValueError: Index contains duplicate entries, cannot reshape". This is
because duplicate data can’t be reshaped using pivot function. Hence, before
calling the pivot() function, we need to ensure that our data do not have rows
with duplicate values for the specified columns. If we can’t ensure this, we may
have to use pivot_table function instead.
(C) Pivot Table
It works like a pivot function, but aggregates the values from rows with
duplicate entries for the specified columns. In other words, we can use
aggregate functions like min, max, mean etc, wherever we have duplicate entries.
The default aggregate function is mean.
Syntax:
pandas.pivot_table(
data,
values=None,
index=None,
columns=None,
aggfunc="mean",
)
The parameter aggfunc can have values among sum, max, min, len, np.mean,
np.median. We can apply index to multiple columns if we don’t have any unique
column to act as index.
Please note that mean has been used as the default aggregate function. Price of
the blue pen in the original data is 50 and 20. Mean has been used as aggregate
and the price of the blue pen is 35 in df1.
We can use multiple aggregate functions on the data. Below example shows the use
of the sum, max and np.mean function.
Pivoting can also be done on multiple columns. Further, different aggregate
functions can be applied on different columns. The following example
demonstrates pivoting on two columns - Price(Rs) and Units_in_stock. Also, the
application of len() function on the column Price(Rs) and mean() function of
column Units_in_stock is shown in the example. Note that the aggregate function
len returns the number of rows corresponding to that entry.
As we know that a DataFrame can consist of many rows (objects) where each row
can have values for various columns (attributes). If a value corresponding to a
column is not present, it is considered to be a missing value. A missing value
is denoted by NaN.
In the real world dataset, it is common for an object to have some missing
attributes. There may be several reasons for that. In some cases, data was not
collected properly resulting in missing data e.g some people did not fill all
the fields while taking the survey. Sometimes, some attributes are not relevant
to all. For example, if a person is unemployed then salary attribute will be
irrelevant and hence may not have been filled up.
Missing values create a lot of problems during data analysis and have to be
handled properly. The two most common strategies for handling missing values
explained in this section are:
i) drop the object having missing values,
ii) fill or estimate the missing value
Let us refer to the previous case study given at table 3.1.
Suppose, the students have now appeared for Unit Test 4 also. But, Raman could
not appear for the Science, Maths and English tests, and suppose there is no
possibility of a re-test. Therefore, marks obtained by him corresponding to
these subjects will be missing. The dataset after Unit Test 4 is as shown at
Table 3.2. Note that the attributes ‘Science, ‘Maths’ and
‘English’ have missing values in Unit Test 4 for Raman.
Table 3.2: Case study data after UT4
Result
Name
Unit Test
Maths
Science
S.St.
Hindi
Eng
Raman
1
22
21
18
20
21
Raman
2
21
20
17
22
24
Raman
3
14
19
15
24
23
Raman
4
19
18
Zuhaire
1
20
17
22
24
19
Zuhaire
2
23
15
21
25
15
Zuhaire
3
22
18
19
23
13
Zuhaire
4
19
20
17
19
16
Aashravy
1
23
19
20
15
22
Aashravy
2
24
22
24
17
21
Aashravy
3
12
25
19
21
23
Aashravy
4
15
20
20
20
17
Mishti
1
15
22
25
22
22
Mishti
2
18
21
25
24
23
Mishti
3
17
18
20
25
20
Mishti
4
14
20
19
20
18
To calculate the final result, teachers are asked to submit the percentage of
marks obtained by all students. In the case of Raman, the Maths teacher decides
to compute the marks obtained in 3 tests and then find the percentage of marks
from the total score of 75 marks. In a way, she decides to drop the marks of
Unit Test 4. However, the English teacher decides to give the same marks to
Raman in the 4th test as scored in the 3rd test. Science teacher decides to give
Raman zero marks in the 4th test and then computes the percentage of marks
obtained. Following sections explain the code for checking missing values and
the code for replacing those missing values with appropriate values.
3.8.1 Checking Missing Values
Pandas provide a function isnull() to check whether any value is missing or not
in the DataFrame. This function checks all attributes and returns True in case
that attribute has missing values, otherwise returns False.
The following code stores the data of marks of all the Unit Tests in a DataFrame
and checks whether the DataFrame has missing values or not.
One can check for each individual attribute also, e.g. the following statement
checks whether attribute ‘Science’ has a missing value or not. It returns True
for each row where there is a missing value for attribute ‘Science’, and False
otherwise.
print(df["Science"].isnull())
print(df["Science"].isnull())
To check whether a column (attribute) has a missing value in the entire dataset,
any() function is used. It returns True in case of missing value else returns
False.
print(df.isnull().any())
print(df.isnull().any())
The function any() can be used for a particular attribute also. The following
statements returns True in case an attribute has a missing value else it returns
False.
print(df["Science"].isnull().any())
print(df["Science"].isnull().any())
print(df["Hindi"].isnull().any())
print(df["Hindi"].isnull().any())
To find the number of NaN values corresponding to each attribute, one can use
the sum() function along with isnull() function, as shown below:
print(df.isnull().sum())
print(df.isnull().sum())
To find the total number of NaN in the whole dataset, one can use
df.isnull().sum().sum().
print(df.isnull().sum().sum())
print(df.isnull().sum().sum())
Program 3-12
Write a program to find the percentage of marks scored by Raman in hindi.
dfRaman = df[df["Name"] =="Raman"]
print("Marks Scored by Raman:\n")
print(dfRaman)
dfRaman = df[df["Name"] == "Raman"]
print("Marks Scored by Raman:\n")
print(dfRaman)
dfHindi = dfRaman["Hindi"]
print("Marks Scored by Raman in Hindi:\n")
print(dfHindi)
dfHindi = dfRaman["Hindi"]
print("Marks Scored by Raman in Hindi:\n")
print(dfHindi)
# Number of Unit Tests held. Here row will be 4
row =len(dfHindi)
print("Percentage of Marks Scored by Ramanin Hindi:\n")
print((dfHindi.sum()*100) / (25* row),"%",sep="")
# Number of Unit Tests held. Here row will be 4
row = len(dfHindi)
print("Percentage of Marks Scored by Ramanin Hindi:\n")
print((dfHindi.sum() * 100) / (25 * row), "%", sep="")
The denominator in the above formula represents the aggregate of marks of all tests.
Here rowis 4 tests and 25 is maximum marks for one test.
Program 3-13
Write a python program to find the percentage of marks obtained by Raman in
Maths subject.
dfMaths = dfRaman["Maths"]
print("Marks Scored by Raman in Maths:\n")
print(dfMaths)
dfMaths = dfRaman["Maths"]
print("Marks Scored by Raman in Maths:\n")
print(dfMaths)
# here, row will be 4,the number of Unit Tests
row =len(dfMaths)
print("Percentage of Marks Scored by Ramanin Maths\n")
print(dfMaths.sum()*100/ (25* row),"%",sep="")
# here, row will be 4,the number of Unit Tests
row = len(dfMaths)
print("Percentage of Marks Scored by Ramanin Maths\n")
print(dfMaths.sum() * 100 / (25 * row), "%", sep="")
Here, notice that Raman was absent in Unit Test 4 in Maths Subject. While
computing the percentage, marks of the fourth test have been considered as 0.
3.8.2 Dropping Missing Values
Missing values can be handled by either dropping the entire row having missing
value or replacing it with appropriate value.
Dropping will remove the entire row (object) having the missing value(s). This
strategy reduces the size of the dataset used in data analysis, hence should be
used in case of missing values on few objects. The dropna() function can be used
to drop an entire row from the DataFrame. For example, calling dropna() function
on the previous example will remove the 4th row having NaN value.
df1 = df.dropna()
print(df1)
df1 = df.dropna()
print(df1)
Now, let us consider the following code:
# marks obtained by Raman in all the unit tests
dfRaman = df[df.Name =="Raman"]
# inplace=true makes changes in the originalDataFrame i.e. dfRaman here
dfRaman.dropna(inplace=True,how="any")
# get the marksscored in Maths
dfMaths = dfRaman["Maths"]
print("Marks Scored by Raman in Maths:\n")
print(dfMaths)
# marks obtained by Raman in all the unit tests
dfRaman = df[df.Name == "Raman"]
# inplace=true makes changes in the originalDataFrame i.e. dfRaman here
dfRaman.dropna(inplace=True, how="any")
# get the marksscored in Maths
dfMaths = dfRaman["Maths"]
print("Marks Scored by Raman in Maths:\n")
print(dfMaths)
row =len(dfMaths)
print("Percentage of Marks Scored byRaman in Maths:\n")
print(dfMaths.sum()*100/ (25* row),"%",sep="")
row = len(dfMaths)
print("Percentage of Marks Scored byRaman in Maths:\n")
print(dfMaths.sum() * 100 / (25 * row), "%", sep="")
Note that the number of rows in dfRaman is 3 after using dropna. Hence
percentage is computed from marks obtained in 3 Unit Tests.
3.8.3 Estimating Missing Values
Missing values can be filled by using estimations or approximations e.g a value
just before (or after) the missing value, average/minimum/maximum of the values
of that attribute, etc. In some cases, missing values are replaced by zeros (or
ones).
The fillna(num) function can be used to replace missing value(s) by the value
specified in num. For example, fillna(0) replaces missing value by 0. Similarly
fillna(1) replaces missing value by 1. Following code replaces missing values by
0 and computes the percentage of marks scored by Raman in Science.
# Marks Scored by Raman in all the subjects across the tests
dfRaman = df.loc[df["Name"] =="Raman"]
(row, col) = dfRaman.shape
dfScience = dfRaman.loc[:, "Science"]
print("Marks Scored by Raman in Science:\n")
print(dfScience)
# Marks Scored by Raman in all the subjects across the tests
dfRaman = df.loc[df["Name"] == "Raman"]
(row, col) = dfRaman.shape
dfScience = dfRaman.loc[:, "Science"]
print("Marks Scored by Raman in Science:\n")
print(dfScience)
dfFillZeroScience = dfScience.fillna(0)
print("Marks Scored by Raman in Science with Missing Values Replaced with Zero:\n")
print(dfFillZeroScience)
dfFillZeroScience = dfScience.fillna(0)
print("Marks Scored by Raman in Science with Missing Values Replaced with Zero:\n")
print(dfFillZeroScience)
print("Percentage of Marks Scored by Raman in Science:\n")
print("Percentage of Marks Scored by Raman in Science:\n")
print(dfFillZeroScience.sum() * 100 / (25 * row), "%", sep="")
df.fillna(method='pad') replaces the missing value by the value before the
missing value while df.fillna(method='bfill') replaces the missing value by the
value after the missing value.
Following code replaces the missing value in Unit
Test 4 of English test by the marks of Unit Test 3 and then computes the
percentage of marks obtained by Raman.
dfEng = dfRaman.loc[:, "Eng"]
print("Marks Scored by Raman in English:\n")
print(dfEng)
dfEng = dfRaman.loc[:, "Eng"]
print("Marks Scored by Raman in English:\n")
print(dfEng)
dfFillPadEng = dfEng.ffill()
print(
"Marks Scored by Raman in English with Missing Values Replaced by Previous Test Marks:\n"
)
print(dfFillPadEng)
dfFillPadEng = dfEng.ffill()
print(
"Marks Scored by Raman in English with Missing Values Replaced by Previous Test Marks:\n"
)
print(dfFillPadEng)
print("Percentage of Marks Scored by Raman in English:\n")
print("Percentage of Marks Scored by Raman in English:\n")
print(dfFillPadEng.sum() * 100 / (25 * row), "%", sep="")
In this section, we have discussed various ways of handling missing values.
Missing value is loss of information and replacing missing values by some
estimation will surely change the dataset. In all cases, data analysis results
will not be actual results but will be a good approximation of actual results.
3.9 Import and export oF data Between pandas and MySQL
So far, we have directly entered data and created a DataFrame and learned how to
analyse data in a DataFrame. However, in actual scenarios, data need not be
typed or copy pasted everytime. Rather, data is available most of the time in a
file (text or csv) or in a database. Thus, in real-world scenarios, we will be
required to bring data directly from a database and load to a DataFrame. This is
called importing data from a database. Likewise, after analysis, we will be
required to store data back to a database. This is called exporting data to a
database.
Data from DataFrame can be read from and written to MySQL database. To do this,
a connection is required with the MySQL database using the pymysql database
driver. And for this, the driver should be installed in the python environment
using the following command:
Terminal window
pipinstallpymysql
sqlalchemy is a library used to interact with the MySQL database by providing
the required credentials. This library can be installed using the following
command:
Terminal window
pipinstallsqlalchemy
Once it is installed, sqlalchemy provides a function create_engine() that
enables this connection to be established. The string inside the function is
known as connection string. The connection string is composed of multiple
parameters like the name of the database with which we want to establish the
connection, username, password, host, port number and finally the name of the
database. And, this function returns an engine object based on this connection
string. The syntax for the same is discussed below:
Driver = mysql+pymysql
username = User name of the mysql (normally it is root)
password = Password of the MySQL
port = usually we connect to localhost with port number 3306 (Default port
number)
Name of the Database = Your database
In the following subsections, importing and exporting data between Pandas and
MySQL applications are demonstrated. For this, we will use the same database
CARSHOWROOM and Table INVENTORY created in Chapter 1 of this book.
Importing data from MySQL to pandas basically refers to the process of reading a
table from MySQL database and loading it to a pandas DataFrame. After
establishing the connection, in order to fetch data from the table of the
database we have the following three functions:
pandas.read_sql_query(query,sql_conn) It is used to read an sql query
(query) into a DataFrame using the connection identifier (sql_conn) returned
from the create_engine().
pandas.read_sql_table(table_name,sql_conn) It is used to read an sql table
(table_name) into a DataFrame using the connection identifier (sql_conn).
pandas.read_sql(sql, sql_conn) It is used to read either an sql query or an
sql table (sql) into a DataFrame using the connection identifier (sql_conn).
df = pd.read_sql_query("SELECT * FROM INVENTORY", engine)
print(df)
CarId CarName Price Model YearManufacture Fueltype
0 D001 Car1 582613.00 LXI 2017 Petrol
1 D002 Car1 673112.00 VXI 2018 Petrol
2 B001 Car2 567031.00 Sigma1.2 2019 Petrol
3 B002 Car2 647858.00 Delta1.2 2018 Petrol
4 E001 Car3 355205.00 5STR STD 2017 CNG
5 E002 Car3 654914.00 CARE 2018 CNG
6 S001 Car4 514000.00 LXI 2017 Petrol
7 S002 Car4 614000.00 VXI 2018 Petrol
3.9.2 Exporting Data from Pandas to MySQL
Exporting data from Pandas to MySQL basically refers to the process of writing a
pandas DataFrame to a table of MySQL database. For this purpose, we have the
following function:
Table specifies the name of the table in which we want to create or append
DataFrame values. It is used to write the specified DataFrame to the table the
connection identifier (sql_conn) returned from the create_engine().
The parameter if_exists specifies “the way data from the DataFrame should be
entered in the table. It can have the following three values: “fail”,
“replace”, “append”.
“fail” is the default value that indicates a ValueError if the table already
exists in the database.
“replace” specifies that the previous content of the table should be updated
by the contents of the DataFrame.
“append” specifies that the contents of the DataFrame should be appended to
the existing table and when updated the format must be the same (column name
sequences).
Index — By default index is True means DataFrame index will be copied to MySQL
table. If False, then it will ignore the DataFrame indexing.
After running this python script, a MySQL table with the name “showroom_info”
will be created in the database.
Summary
Exercise
Question 1
Write the statement to install the python connector to connect MySQL i.e.
pymysql.
Question 2
Explain the difference between pivot() and pivot_table() function?
Question 3
What is sqlalchemy?
Question 4
Can you sort a DataFrame with respect to multiple columns?
Question 5
What are missing values? What are the strategies to handle them?
Question 6
Define the following terms: Median, Standard Deviation and variance.
Question 7
What do you understand by the term MODE? Name the function which is used to
calculate it.
Question 8
Write the purpose of Data aggregation.
Question 9
Explain the concept of groupby with help on an example.
Question 10
Write the steps required to read data from a MySQL database to a DataFrame.
Question 11
Explain the importance of reshaping of data with an example.
Question 12
Why estimation is an important concept in data analysis?
Question 13
Assuming the given table: Product. Write the python code for the following:
Item
Rupees
USD
TV LG
12000
700
TV VIDEOCON
10000
650
TV LG
15000
800
AC SONY
14000
750
a) To create the data frame for the above table.
b) To add the new rows in the data frame.
c) To display the maximum price of LG TV.
d) To display the Sum of all products.
e) To display the median of the USD of Sony products.
f) To sort the data according to the Rupees and transfer the data to MySQL.
g) To transfer the new dataframe into the MySQL with new values.
Question 14
Write the python statement for the following question on the basis of given
dataset:
Name
Degree
Score
0
Aparna
MBA
90.0
1
Pankaj
BCA
NaN
2
Ram
M.Tech
80.0
3
Ramesh
MBA
98.0
4
Naveen
Nan
97.0
5
Krrishnav
BCA
78.0
6
Bhawna
MBA
89.0
a) To create the above DataFrame.
b) To print the Degree and maximum marks in each stream.
c) To fill the NaN with 76.
d) To set the index to Name.
e) To display the name and degree wise average marks of each student.
f) To count the number of students in MBA.
g) To print the mode marks BCA.
Solved case study Based on open datasets
UCI dataset is a collection of open datasets, available to the public for
experimentation and research purposes. ‘auto-mpg’ is one such open dataset.
It contains data related to fuel consumption by automobiles in a city.
Consumption is measured in miles per gallon (mpg), hence the name of the dataset
is auto-mpg. The data has 398 rows (also known as items or instances or objects)
and nine columns (also known as attributes).
The attributes are: mpg, cylinders, displacement, horsepower, weight,
acceleration, model year, origin, car name. Three attributes, cylinders, model
year and origin have categorical values, car name is a string with a unique
value for every row, while the remaining five attributes have numeric value.