What is Pandas?

Pandas is an open-source Python Library providing high-performance data manipulation and analysis tool using its powerful data structures. The name Pandas is derived from the word Panel Data.

In 2008, developer Wes McKinney started developing pandas when in need of high performance, flexible tool for analysis of data.

Prior to Pandas, Python was majorly used for data munging and preparation. It had very little contribution towards data analysis. Pandas solved this problem. Using Pandas, we can accomplish five typical steps in the processing and analysis of data, regardless of the origin of data — load, prepare, manipulate, model, and analyze.

Python with Pandas is used in a wide range of fields including academic and commercial domains including finance, engineering, chemistry etc.

Key Features of Pandas

Fast and efficient DataFrame object with default and customized indexing.
Tools for loading data into in-memory data objects from different file formats.
Data alignment and integrated handling of missing data.
Reshaping and pivoting of date sets.
Label-based slicing, indexing and subsetting of large data sets.
Columns from a data structure can be deleted or inserted.
Group by data for aggregation and transformations.
High performance merging and joining of data.
Time Series functionality.

Data Structures

Pandas deals with the following three data structures:

Series
DataFrame
Panel

These data structures are built on top of Numpy array, which means they are fast. The best way to think of these data structures is that the higher dimensional data structure is a container of its lower dimensional data structure. For example, DataFrame is a container of Series, Panel is a container of DataFrame.

Data Structure	Dimensions	Description
Series	1	1D labeled homogeneous array, sizeimmutable.
Data Frames	2	General 2D labeled, size-mutable tabular structure with potentially heterogeneously typed columns.
Panel	3	General 3D labeled, size-mutable array.

Building and handling two or more dimensional arrays is a tedious task, burden is placed on the user to consider the orientation of the data set when writing functions. But using Pandas data structures, the mental effort of the user is reduced.

For example, with tabular data DataFrame it is more semantically helpful to think of the index (the rows) and the columns rather than axis 0 and axis 1.

?> All Pandas data structures are value mutable (can be changed) and except Series all are size mutable. Series is size immutable.

!> DataFrame is widely used and one of the most important data structures. Panel and Series is used much less.

Series

Series is a one-dimensional array like structure with homogeneous data. For example, the following series is a collection of integers 10, 23, 56, …

0	1	2	3	4	5	6	7	8	9
10	23	56	17	52	61	73	90	26	72

DataFrame

DataFrame is a two-dimensional array with heterogeneous data. For example:

Name	Age	Gender	Rating
Steve	32	Male	3.45
Lia	28	Female	4.6
Vin	45	Male	3.9
Katie	38	Female	2.78

The table represents the data of a sales team of an organization with their overall performance rating. The data is represented in rows and columns. Each column represents an attribute and each row represents a person. The data types of the four columns are as follows:

Column	Type
sName	String
sAge	Integer
sGender	String
sRating	Float

Panel

Panel is a three-dimensional data structure with heterogeneous data. It is hard to represent the panel in graphical representation. But a panel can be illustrated as a container of DataFrame. Panel will not be explained.

Series

Series is a one-dimensional labeled array capable of holding data of any type (integer, string, float, python objects, etc.). The axis labels are collectively called index.

Create a Series

Empty Series

A basic series, which can be created is an empty Series.

#import the pandas library and aliasing as pd
import pandas as pd
s = pd.Series()
print(s)

# output: "Series([], dtype: float64)"

From ndarray

If data is an ndarray, then index passed must be of the same length. If no index is passed, then by default index will be range(n) where n is array length, i.e., [0,1,2,3... range(len(array))-1].

data = np.array(['a','b','c','d'])
s = pd.Series(data)
print(s)

Output is:

index	value
0	a
1	b
2	c
3	d

We did not pass any index, so by default, it assigned the indexes ranging from 0 to len(data)-1, i.e., 0 to 3. If we passed the index values we can see the customized indexed values in the output:

data = np.array(['a','b','c','d'])
s = pd.Series(data,index=[100,101,102,103])
print(s)

index	value
100	a
101	b
102	c
103	d

From dictionary

A dictionary can be passed as input and if no index is specified, then the dictionary keys are taken in a sorted order to construct index. If index is passed, the values in data corresponding to the labels in the index will be pulled out.

data = {'a' : 0., 'b' : 1., 'c' : 2.}
s = pd.Series(data)
print(s)

Which gives:

index	value
a	0.0
b	1.0
c	2.0

From scalar

If data is a scalar value, an index must be provided. The value will be repeated to match the length of index:

s = pd.Series(5, index=[0, 1, 2, 3])
print(s)

Outputs:

index	value
0	5
1	5
2	5
3	5

Retrieve with Position

Data in the series can be accessed similar to that in an ndarray.

s = pd.Series([1,2,3,4,5], index = ['a','b','c','d','e'])
print(s[0]) # 1

Retrieve the first three elements in the Series. If a : is inserted in front of it, all items from that index onwards will be extracted. If two parameters (with : between them) is used, items between the two indexes (not including the stop index):

s = pd.Series([1,2,3,4,5], index = ['a','b','c','d','e'])
print(s[:3]) # 3 4 5

Retrieve with Index

A Series is like a fixed-size dictionary in that you can get and set values by index label.

s = pd.Series([1,2,3,4,5], index = ['a','b','c','d','e'])
print(s['a']) # 1

s = pd.Series([1,2,3,4,5], index = ['a','b','c','d','e'])
print([['a','c','d']]) # 3 4 5

DataFrame

A Data frame is a two-dimensional data structure, i.e., data is aligned in a tabular fashion in rows and columns. You can think of it as a spreadsheet data representation.

Create a DataFrame

Empty DataFrame

A basic DataFrame, which can be created is an empty DataFrame.

s = pd.DataFrame()
print(s)

# output:
# Columns: []
# Index: []

From ndarray

The DataFrame can be created using a single list or a list of lists:

data = [['Alex',10],['Bob',12],['Clarke',13]]
df = pd.DataFrame(data,columns=['Name','Age'])
print(df)

Output is:

index	Name	Age
0	Alex	10.0
1	Bob	12.0
2	Clarke	13.0

?> Observe, the type of Age column is a floating point.

From dictionary of lists

All the ndarrays must be of same length. If index is passed, then the length of the index should equal to the length of the arrays.

If no index is passed, then by default, index will be range(n), where n is the array length.

data = {'Name':['Tom', 'Jack', 'Steve', 'Ricky'],'Age':[28,34,29,42]}
df = pd.DataFrame(data,columns=['Name','Age'])
print(df)

Output is:

index	Name	Age
0	28	Tome
1	34	Jack
2	29	Steve
3	42	Ricky

?> Observe the values 0,1,2,3. They are the default index assigned to each using the function range(n).

From list of dictionaries

List of dictionaries can be passed as input data to create a DataFrame. The dictionary keys are by default taken as column names.

data = [{'a': 1, 'b': 2},{'a': 5, 'b': 10, 'c': 20}]
df = pd.DataFrame(data)
print(df)

Output is:

index	a	b	c
0	1	2	NaN
1	5	20	20.0

!> Column c is NaN at index 0.

From dictionary of series

Dictionary of series can be passed to form a DataFrame. The resultant index is the union of all the series indexes passed.

data = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']),
   'two' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])}
df = pd.DataFrame(data)
print(df)

Output is:

index	one	two
0	1.0	1
1	2.0	2
2	3.0	3
3	NaN	4

!> For the series one, there is no label "d" passed, but in the result, for the d label, NaN is appended with NaN.

Add column

Adding a new column is as easy as passing a Series:

data = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']),
   'two' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])}
df = pd.DataFrame(data)
df['three'] = pd.Series([10,20,30],index=['a','b','c'])
print(df)

Output is:

index	one	two	three
0	1.0	1	10.0
1	2.0	2	20.0
2	3.0	3	30.0
3	NaN	4	NaN

Adding a new column using the existing columns in DataFrame:

df['four'] = df['one'] + df['three']
print(df)

Output as:

index	one	two	three	four
0	1.0	1	10.0	11.0
1	2.0	2	20.0	22.0
2	3.0	3	30.0	33.0
3	NaN	4	NaN	NaN

Delete column

Columns can be deleted:

data = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']),
   'two' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])}
df = pd.DataFrame(data)
df.pop('two')
print(df)

Output is:

index	value
one	1.0
two	2.0
2	3.0
3	NaN

Row selection

By label

data = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']),
   'two' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])}
df = pd.DataFrame(data)
print(df.loc('b'))

Output is:

index	value
one	2.0
two	2.0

The result is a series with labels as column names of the DataFrame. And, the ame of the series is the label with which it is retrieved.

By integer location

data = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']),
   'two' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])}
df = pd.DataFrame(data)
print(df.iloc(2))

Output is:

index	value
one	3.0
two	3.0

Multiple rows can be selected using : operator:

print(df.iloc(2:4))

Output is:

index	one	two
c	3.0	3
d	NaN	4

Add row

Add new rows to a DataFrame using the append function. This function will append the rows at the end.

df = pd.DataFrame([[1, 2], [3, 4]], columns = ['a','b'])
df2 = pd.DataFrame([[5, 6], [7, 8]], columns = ['a','b'])

df = df.append(df2)
print(df)

Output is:

index	a	b
0	1	2
1	3	4
0	5	6
1	7	8

!> The index will also append to the DataFrame. To reset the index use the reset_index() function. For this example: print(df.reset_index()).

Delete row

Use index label to delete or drop rows from a DataFrame. If label is duplicated, then multiple rows will be dropped.

If you observe, in the above example, the labels are duplicate. Let us drop a label and will see how many rows will get dropped.

df = pd.DataFrame([[1, 2], [3, 4]], columns = ['a','b'])
df2 = pd.DataFrame([[5, 6], [7, 8]], columns = ['a','b'])

df = df.append(df2)
df = df.drop(0)
print(df)

Output is:

index	a	b
1	3	4
1	7	8

!> In the above example, two rows were dropped because those two contain the same label 0.

Basic functionality

Head and tail

Add new rows to a DataFrame using the append function. This function will append the rows at the end.

head() returns the first n rows (observe the index values). The default number of elements to display is five, but you may pass a custom number. tail() returns the last n rows (observe the index values). The default number of elements to display is five, but you may pass a custom number.

s = pd.Series(np.random.randn(4))
print(s.tail(2))
print(s.head(2))

Transpose

Returns the transpose of the DataFrame. The rows and columns will interchange:

d = {'Name':pd.Series(['Tom','James','Ricky','Vin','Steve','Smith','Jack']),
   'Age':pd.Series([25,26,25,23,30,29,23]),
   'Rating':pd.Series([4.23,3.24,3.98,2.56,3.20,4.6,3.8])}
df = pd.DataFrame(d)
print(df.T)

The transpose of the DataFrame is:

	0	1	2	3	4	5	6
Age	25	26	25	23	30	29	23
Name	Tom	James	Ricky	Vin	Steve	Smith	Jack
Rating	4.23	3.24	3.98	2.56	3.2	4.6	3.8

Shape

Returns a tuple representing the dimensionality of the DataFrame. Tuple (a, b), where a represents the number of rows and b represents the number of columns.

d = {'Name':pd.Series(['Tom','James','Ricky','Vin','Steve','Smith','Jack']),
   'Age':pd.Series([25,26,25,23,30,29,23]),
   'Rating':pd.Series([4.23,3.24,3.98,2.56,3.20,4.6,3.8])}
df = pd.DataFrame(d)
print(df.shape) # (7, 3)

Size

Returns the number of elements in the DataFrame.

d = {'Name':pd.Series(['Tom','James','Ricky','Vin','Steve','Smith','Jack']),
   'Age':pd.Series([25,26,25,23,30,29,23]),
   'Rating':pd.Series([4.23,3.24,3.98,2.56,3.20,4.6,3.8])}
df = pd.DataFrame(d)
print(df.size) # 21

Descriptive statistics

The following table list down the important functions available on the DataFrame object:

function	description
`count()`	number of non-null observations
`sum()`	sum of values
`mean()`	mean of values
`median()`	median of values
`mode()`	mode of values
`std()`	standard deviation of the values
`min()`	minimum value
`max()`	maximum value
`abs()`	absolute value
`prod()`	product of values
`cumsum()`	cumulative Sum
`cumprod()`	cumulative product
`describe()`	summary of statistics

For example, sum() returns the sum of the values:

d = {'Name':pd.Series(['Tom','James','Ricky','Vin','Steve','Smith','Jack',
   'Lee','David','Gasper','Betina','Andres']),
   'Age':pd.Series([25,26,25,23,30,29,23,34,40,30,51,46]),
   'Rating':pd.Series([4.23,3.24,3.98,2.56,3.20,4.6,3.8,3.78,2.98,4.80,4.10,3.65])
}
df = pd.DataFrame(d)
print(df.sum())

Output is:

column	value
Age	382
Name	TomJamesRickyVinSteveSmithJackLeeDavidGasperBe...
Rating	44.92

sum() takes one argument which is the axis to take the sum of. By default it takes the sum for every column.

mean() returns the average value. Using the same example as above:

print(df.mean())

Output is:

column	value
Age	31.833333
Rating	3.743333

!> Notice that string columns are ignored.

The mean can also be found using:

print(df['Age'].sum() / df['Age'].count())

The describe() function computes a summary of statistics pertaining to the DataFrame columns. It gives the count, mean, std and IQR values. Using the same DataFrame as above:

print(df.describe())

Output is:

	Age	Rating
count	12.00000	12.00000
mean	31.83333	3.743333
std	9.232682	0.661628
min	23.000000	2.560000
25%	25.000000	3.230000
50%	29.500000	3.790000
75%	35.500000	4.132500
max	51.000000	4.800000

Sorting

Sorting a DataFrame by column:

d = {'Name':pd.Series(['Tom','James','Ricky','Vin','Steve','Smith','Jack',
   'Lee','David','Gasper','Betina','Andres']),
   'Age':pd.Series([25,26,25,23,30,29,23,34,40,30,51,46]),
   'Rating':pd.Series([4.23,3.24,3.98,2.56,3.20,4.6,3.8,3.78,2.98,4.80,4.10,3.65])
}
df = pd.DataFrame(d)
print(df.sort_values(by='Age')) # DataFrame will be sorted on Age column

Rename columns

Rename columns using rename(). Using the DataFrame as above:

df = df.rename(columns={
    'Name': 'N',
    'Age': 'A',
    'Rating': 'R'
})
print(df)

CSV

A comma-separated values (CSV) file is a delimited text file that uses a comma to separate values. A CSV file stores tabular data. Each line of the file is a data record. Each record consists of one or more fields, separated by commas. The use of the comma as a field separator is the source of the name for this file format and is called delimiter. Other delimiters such as | can also be used for CSV files. CSV files can be used to store our DataFrame object as a file.

Write

To save a DataFrame as CSV file using | as delimiter/separator:

df = pd.DataFrame(
    np.random.rand(10, 4),
    columns=['a', 'b', 'c', 'd']
)
df.to_csv('random.csv', sep='|')

Read

Reading a CSV file in pandas is as easy as:

df = pd.read_csv('random.csv', delimiter='|')
print(df)