What is Pandas?

Pandas is an open-source Python Library providing high-performance data manipulation and analysis tool using its powerful data structures. The name Pandas is derived from the word Panel Data.

In 2008, developer Wes McKinney started developing pandas when in need of high performance, flexible tool for analysis of data.

Prior to Pandas, Python was majorly used for data munging and preparation. It had very little contribution towards data analysis. Pandas solved this problem. Using Pandas, we can accomplish five typical steps in the processing and analysis of data, regardless of the origin of data — load, prepare, manipulate, model, and analyze.

Python with Pandas is used in a wide range of fields including academic and commercial domains including finance, engineering, chemistry etc.

Key Features of Pandas

Data Structures

Pandas deals with the following three data structures:

These data structures are built on top of Numpy array, which means they are fast. The best way to think of these data structures is that the higher dimensional data structure is a container of its lower dimensional data structure. For example, DataFrame is a container of Series, Panel is a container of DataFrame.

Data Structure Dimensions Description
Series 1 1D labeled homogeneous array, sizeimmutable.
Data Frames 2 General 2D labeled, size-mutable tabular structure with potentially heterogeneously typed columns.
Panel 3 General 3D labeled, size-mutable array.

Building and handling two or more dimensional arrays is a tedious task, burden is placed on the user to consider the orientation of the data set when writing functions. But using Pandas data structures, the mental effort of the user is reduced.

For example, with tabular data DataFrame it is more semantically helpful to think of the index (the rows) and the columns rather than axis 0 and axis 1.

?> All Pandas data structures are value mutable (can be changed) and except Series all are size mutable. Series is size immutable.

!> DataFrame is widely used and one of the most important data structures. Panel and Series is used much less.

Series

Series is a one-dimensional array like structure with homogeneous data. For example, the following series is a collection of integers 10, 23, 56, …

0 1 2 3 4 5 6 7 8 9
10 23 56 17 52 61 73 90 26 72

DataFrame

DataFrame is a two-dimensional array with heterogeneous data. For example:

Name Age Gender Rating
Steve 32 Male 3.45
Lia 28 Female 4.6
Vin 45 Male 3.9
Katie 38 Female 2.78

The table represents the data of a sales team of an organization with their overall performance rating. The data is represented in rows and columns. Each column represents an attribute and each row represents a person. The data types of the four columns are as follows:

Column Type
sName String
sAge Integer
sGender String
sRating Float

Panel

Panel is a three-dimensional data structure with heterogeneous data. It is hard to represent the panel in graphical representation. But a panel can be illustrated as a container of DataFrame. Panel will not be explained.

Series

Series is a one-dimensional labeled array capable of holding data of any type (integer, string, float, python objects, etc.). The axis labels are collectively called index.

Create a Series

Empty Series

A basic series, which can be created is an empty Series.

#import the pandas library and aliasing as pd import pandas as pd s = pd.Series() print(s) # output: "Series([], dtype: float64)"

From ndarray

If data is an ndarray, then index passed must be of the same length. If no index is passed, then by default index will be range(n) where n is array length, i.e., [0,1,2,3... range(len(array))-1].

data = np.array(['a','b','c','d']) s = pd.Series(data) print(s)

Output is:

index value
0 a
1 b
2 c
3 d

We did not pass any index, so by default, it assigned the indexes ranging from 0 to len(data)-1, i.e., 0 to 3. If we passed the index values we can see the customized indexed values in the output:

data = np.array(['a','b','c','d']) s = pd.Series(data,index=[100,101,102,103]) print(s)
index value
100 a
101 b
102 c
103 d

From dictionary

A dictionary can be passed as input and if no index is specified, then the dictionary keys are taken in a sorted order to construct index. If index is passed, the values in data corresponding to the labels in the index will be pulled out.

data = {'a' : 0., 'b' : 1., 'c' : 2.} s = pd.Series(data) print(s)

Which gives:

index value
a 0.0
b 1.0
c 2.0

From scalar

If data is a scalar value, an index must be provided. The value will be repeated to match the length of index:

s = pd.Series(5, index=[0, 1, 2, 3]) print(s)

Outputs:

index value
0 5
1 5
2 5
3 5

Retrieve with Position

Data in the series can be accessed similar to that in an ndarray.

s = pd.Series([1,2,3,4,5], index = ['a','b','c','d','e']) print(s[0]) # 1

Retrieve the first three elements in the Series. If a : is inserted in front of it, all items from that index onwards will be extracted. If two parameters (with : between them) is used, items between the two indexes (not including the stop index):

s = pd.Series([1,2,3,4,5], index = ['a','b','c','d','e']) print(s[:3]) # 3 4 5

Retrieve with Index

A Series is like a fixed-size dictionary in that you can get and set values by index label.

s = pd.Series([1,2,3,4,5], index = ['a','b','c','d','e']) print(s['a']) # 1

Retrieve the first three elements in the Series. If a : is inserted in front of it, all items from that index onwards will be extracted. If two parameters (with : between them) is used, items between the two indexes (not including the stop index):

s = pd.Series([1,2,3,4,5], index = ['a','b','c','d','e']) print([['a','c','d']]) # 3 4 5

DataFrame

A Data frame is a two-dimensional data structure, i.e., data is aligned in a tabular fashion in rows and columns. You can think of it as a spreadsheet data representation.

Create a DataFrame

Empty DataFrame

A basic DataFrame, which can be created is an empty DataFrame.

s = pd.DataFrame() print(s) # output: # Columns: [] # Index: []

From ndarray

The DataFrame can be created using a single list or a list of lists:

data = [['Alex',10],['Bob',12],['Clarke',13]] df = pd.DataFrame(data,columns=['Name','Age']) print(df)

Output is:

index Name Age
0 Alex 10.0
1 Bob 12.0
2 Clarke 13.0

?> Observe, the type of Age column is a floating point.

From dictionary of lists

All the ndarrays must be of same length. If index is passed, then the length of the index should equal to the length of the arrays.

If no index is passed, then by default, index will be range(n), where n is the array length.

data = {'Name':['Tom', 'Jack', 'Steve', 'Ricky'],'Age':[28,34,29,42]} df = pd.DataFrame(data,columns=['Name','Age']) print(df)

Output is:

index Name Age
0 28 Tome
1 34 Jack
2 29 Steve
3 42 Ricky

?> Observe the values 0,1,2,3. They are the default index assigned to each using the function range(n).

From list of dictionaries

List of dictionaries can be passed as input data to create a DataFrame. The dictionary keys are by default taken as column names.

data = [{'a': 1, 'b': 2},{'a': 5, 'b': 10, 'c': 20}] df = pd.DataFrame(data) print(df)

Output is:

index a b c
0 1 2 NaN
1 5 20 20.0

!> Column c is NaN at index 0.

From dictionary of series

Dictionary of series can be passed to form a DataFrame. The resultant index is the union of all the series indexes passed.

data = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']), 'two' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])} df = pd.DataFrame(data) print(df)

Output is:

index one two
0 1.0 1
1 2.0 2
2 3.0 3
3 NaN 4

!> For the series one, there is no label "d" passed, but in the result, for the d label, NaN is appended with NaN.

Add column

Adding a new column is as easy as passing a Series:

data = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']), 'two' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])} df = pd.DataFrame(data) df['three'] = pd.Series([10,20,30],index=['a','b','c']) print(df)

Output is:

index one two three
0 1.0 1 10.0
1 2.0 2 20.0
2 3.0 3 30.0
3 NaN 4 NaN

Adding a new column using the existing columns in DataFrame:

df['four'] = df['one'] + df['three'] print(df)

Output as:

index one two three four
0 1.0 1 10.0 11.0
1 2.0 2 20.0 22.0
2 3.0 3 30.0 33.0
3 NaN 4 NaN NaN

Delete column

Columns can be deleted:

data = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']), 'two' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])} df = pd.DataFrame(data) df.pop('two') print(df)

Output is:

index value
one 1.0
two 2.0
2 3.0
3 NaN

Row selection

By label

data = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']), 'two' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])} df = pd.DataFrame(data) print(df.loc('b'))

Output is:

index value
one 2.0
two 2.0

The result is a series with labels as column names of the DataFrame. And, the ame of the series is the label with which it is retrieved.

By integer location

data = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']), 'two' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])} df = pd.DataFrame(data) print(df.iloc(2))

Output is:

index value
one 3.0
two 3.0

Multiple rows can be selected using : operator:

print(df.iloc(2:4))

Output is:

index one two
c 3.0 3
d NaN 4

Add row

Add new rows to a DataFrame using the append function. This function will append the rows at the end.

df = pd.DataFrame([[1, 2], [3, 4]], columns = ['a','b']) df2 = pd.DataFrame([[5, 6], [7, 8]], columns = ['a','b']) df = df.append(df2) print(df)

Output is:

index a b
0 1 2
1 3 4
0 5 6
1 7 8

!> The index will also append to the DataFrame. To reset the index use the reset_index() function. For this example: print(df.reset_index()).

Delete row

Use index label to delete or drop rows from a DataFrame. If label is duplicated, then multiple rows will be dropped.

If you observe, in the above example, the labels are duplicate. Let us drop a label and will see how many rows will get dropped.

df = pd.DataFrame([[1, 2], [3, 4]], columns = ['a','b']) df2 = pd.DataFrame([[5, 6], [7, 8]], columns = ['a','b']) df = df.append(df2) df = df.drop(0) print(df)

Output is:

index a b
1 3 4
1 7 8

!> In the above example, two rows were dropped because those two contain the same label 0.

Basic functionality

Head and tail

Add new rows to a DataFrame using the append function. This function will append the rows at the end.

head() returns the first n rows (observe the index values). The default number of elements to display is five, but you may pass a custom number. tail() returns the last n rows (observe the index values). The default number of elements to display is five, but you may pass a custom number.

s = pd.Series(np.random.randn(4)) print(s.tail(2)) print(s.head(2))

Transpose

Returns the transpose of the DataFrame. The rows and columns will interchange:

d = {'Name':pd.Series(['Tom','James','Ricky','Vin','Steve','Smith','Jack']), 'Age':pd.Series([25,26,25,23,30,29,23]), 'Rating':pd.Series([4.23,3.24,3.98,2.56,3.20,4.6,3.8])} df = pd.DataFrame(d) print(df.T)

The transpose of the DataFrame is:

0 1 2 3 4 5 6
Age 25 26 25 23 30 29 23
Name Tom James Ricky Vin Steve Smith Jack
Rating 4.23 3.24 3.98 2.56 3.2 4.6 3.8

Shape

Returns a tuple representing the dimensionality of the DataFrame. Tuple (a, b), where a represents the number of rows and b represents the number of columns.

d = {'Name':pd.Series(['Tom','James','Ricky','Vin','Steve','Smith','Jack']), 'Age':pd.Series([25,26,25,23,30,29,23]), 'Rating':pd.Series([4.23,3.24,3.98,2.56,3.20,4.6,3.8])} df = pd.DataFrame(d) print(df.shape) # (7, 3)

Size

Returns the number of elements in the DataFrame.

d = {'Name':pd.Series(['Tom','James','Ricky','Vin','Steve','Smith','Jack']), 'Age':pd.Series([25,26,25,23,30,29,23]), 'Rating':pd.Series([4.23,3.24,3.98,2.56,3.20,4.6,3.8])} df = pd.DataFrame(d) print(df.size) # 21

Descriptive statistics

The following table list down the important functions available on the DataFrame object:

function description
count() number of non-null observations
sum() sum of values
mean() mean of values
median() median of values
mode() mode of values
std() standard deviation of the values
min() minimum value
max() maximum value
abs() absolute value
prod() product of values
cumsum() cumulative Sum
cumprod() cumulative product
describe() summary of statistics

For example, sum() returns the sum of the values:

d = {'Name':pd.Series(['Tom','James','Ricky','Vin','Steve','Smith','Jack', 'Lee','David','Gasper','Betina','Andres']), 'Age':pd.Series([25,26,25,23,30,29,23,34,40,30,51,46]), 'Rating':pd.Series([4.23,3.24,3.98,2.56,3.20,4.6,3.8,3.78,2.98,4.80,4.10,3.65]) } df = pd.DataFrame(d) print(df.sum())

Output is:

column value
Age 382
Name TomJamesRickyVinSteveSmithJackLeeDavidGasperBe...
Rating 44.92

sum() takes one argument which is the axis to take the sum of. By default it takes the sum for every column.

mean() returns the average value. Using the same example as above:

print(df.mean())

Output is:

column value
Age 31.833333
Rating 3.743333

!> Notice that string columns are ignored.

The mean can also be found using:

print(df['Age'].sum() / df['Age'].count())

The describe() function computes a summary of statistics pertaining to the DataFrame columns. It gives the count, mean, std and IQR values. Using the same DataFrame as above:

print(df.describe())

Output is:

Age Rating
count 12.00000 12.00000
mean 31.83333 3.743333
std 9.232682 0.661628
min 23.000000 2.560000
25% 25.000000 3.230000
50% 29.500000 3.790000
75% 35.500000 4.132500
max 51.000000 4.800000

Sorting

Sorting a DataFrame by column:

d = {'Name':pd.Series(['Tom','James','Ricky','Vin','Steve','Smith','Jack', 'Lee','David','Gasper','Betina','Andres']), 'Age':pd.Series([25,26,25,23,30,29,23,34,40,30,51,46]), 'Rating':pd.Series([4.23,3.24,3.98,2.56,3.20,4.6,3.8,3.78,2.98,4.80,4.10,3.65]) } df = pd.DataFrame(d) print(df.sort_values(by='Age')) # DataFrame will be sorted on Age column

Rename columns

Rename columns using rename(). Using the DataFrame as above:

df = df.rename(columns={ 'Name': 'N', 'Age': 'A', 'Rating': 'R' }) print(df)

CSV

A comma-separated values (CSV) file is a delimited text file that uses a comma to separate values. A CSV file stores tabular data. Each line of the file is a data record. Each record consists of one or more fields, separated by commas. The use of the comma as a field separator is the source of the name for this file format and is called delimiter. Other delimiters such as | can also be used for CSV files. CSV files can be used to store our DataFrame object as a file.

Write

To save a DataFrame as CSV file using | as delimiter/separator:

df = pd.DataFrame( np.random.rand(10, 4), columns=['a', 'b', 'c', 'd'] ) df.to_csv('random.csv', sep='|')

Read

Reading a CSV file in pandas is as easy as:

df = pd.read_csv('random.csv', delimiter='|') print(df)