Pandas is an open-source Python Library providing high-performance data manipulation and analysis tool using its powerful data structures. The name Pandas is derived from the word Panel Data.
In 2008, developer Wes McKinney started developing pandas when in need of high performance, flexible tool for analysis of data.
Prior to Pandas, Python was majorly used for data munging and preparation. It had very little contribution towards data analysis. Pandas solved this problem. Using Pandas, we can accomplish five typical steps in the processing and analysis of data, regardless of the origin of data — load, prepare, manipulate, model, and analyze.
Python with Pandas is used in a wide range of fields including academic and commercial domains including finance, engineering, chemistry etc.
Pandas deals with the following three data structures:
These data structures are built on top of Numpy array, which means they are fast. The best way to think of these data structures is that the higher dimensional data structure is a container of its lower dimensional data structure. For example, DataFrame
is a container of Series
, Panel
is a container of DataFrame
.
Data Structure | Dimensions | Description |
---|---|---|
Series | 1 | 1D labeled homogeneous array, sizeimmutable. |
Data Frames | 2 | General 2D labeled, size-mutable tabular structure with potentially heterogeneously typed columns. |
Panel | 3 | General 3D labeled, size-mutable array. |
Building and handling two or more dimensional arrays is a tedious task, burden is placed on the user to consider the orientation of the data set when writing functions. But using Pandas data structures, the mental effort of the user is reduced.
For example, with tabular data DataFrame
it is more semantically helpful to think of the index (the rows) and the columns rather than axis 0 and axis 1.
?> All Pandas data structures are value mutable (can be changed) and except Series all are size mutable. Series is size immutable.
!> DataFrame is widely used and one of the most important data structures. Panel and Series is used much less.
Series is a one-dimensional array like structure with homogeneous data. For example, the following series is a collection of integers 10, 23, 56, …
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 |
---|---|---|---|---|---|---|---|---|---|
10 | 23 | 56 | 17 | 52 | 61 | 73 | 90 | 26 | 72 |
DataFrame is a two-dimensional array with heterogeneous data. For example:
Name | Age | Gender | Rating |
---|---|---|---|
Steve | 32 | Male | 3.45 |
Lia | 28 | Female | 4.6 |
Vin | 45 | Male | 3.9 |
Katie | 38 | Female | 2.78 |
The table represents the data of a sales team of an organization with their overall performance rating. The data is represented in rows and columns. Each column represents an attribute and each row represents a person. The data types of the four columns are as follows:
Column | Type |
---|---|
sName | String |
sAge | Integer |
sGender | String |
sRating | Float |
Panel is a three-dimensional data structure with heterogeneous data. It is hard to represent the panel in graphical representation. But a panel can be illustrated as a container of DataFrame. Panel will not be explained.
Series is a one-dimensional labeled array capable of holding data of any type (integer, string, float, python objects, etc.). The axis labels are collectively called index.
A basic series, which can be created is an empty Series.
#import the pandas library and aliasing as pd
import pandas as pd
s = pd.Series()
print(s)
# output: "Series([], dtype: float64)"
If data is an ndarray, then index passed must be of the same length. If no index is passed, then by default index will be range(n)
where n
is array length, i.e., [0,1,2,3... range(len(array))-1]
.
data = np.array(['a','b','c','d'])
s = pd.Series(data)
print(s)
Output is:
index | value |
---|---|
0 | a |
1 | b |
2 | c |
3 | d |
We did not pass any index, so by default, it assigned the indexes ranging from 0
to len(data)-1
, i.e., 0 to 3. If we passed the index values we can see the customized indexed values in the output:
data = np.array(['a','b','c','d'])
s = pd.Series(data,index=[100,101,102,103])
print(s)
index | value |
---|---|
100 | a |
101 | b |
102 | c |
103 | d |
A dictionary can be passed as input and if no index is specified, then the dictionary keys are taken in a sorted order to construct index. If index is passed, the values in data corresponding to the labels in the index will be pulled out.
data = {'a' : 0., 'b' : 1., 'c' : 2.}
s = pd.Series(data)
print(s)
Which gives:
index | value |
---|---|
a | 0.0 |
b | 1.0 |
c | 2.0 |
If data is a scalar value, an index must be provided. The value will be repeated to match the length of index:
s = pd.Series(5, index=[0, 1, 2, 3])
print(s)
Outputs:
index | value |
---|---|
0 | 5 |
1 | 5 |
2 | 5 |
3 | 5 |
Data in the series can be accessed similar to that in an ndarray.
s = pd.Series([1,2,3,4,5], index = ['a','b','c','d','e'])
print(s[0]) # 1
Retrieve the first three elements in the Series. If a : is inserted in front of it, all items from that index onwards will be extracted. If two parameters (with : between them) is used, items between the two indexes (not including the stop index):
s = pd.Series([1,2,3,4,5], index = ['a','b','c','d','e'])
print(s[:3]) # 3 4 5
A Series is like a fixed-size dictionary in that you can get and set values by index label.
s = pd.Series([1,2,3,4,5], index = ['a','b','c','d','e'])
print(s['a']) # 1
Retrieve the first three elements in the Series. If a : is inserted in front of it, all items from that index onwards will be extracted. If two parameters (with : between them) is used, items between the two indexes (not including the stop index):
s = pd.Series([1,2,3,4,5], index = ['a','b','c','d','e'])
print([['a','c','d']]) # 3 4 5
A Data frame is a two-dimensional data structure, i.e., data is aligned in a tabular fashion in rows and columns. You can think of it as a spreadsheet data representation.
A basic DataFrame, which can be created is an empty DataFrame.
s = pd.DataFrame()
print(s)
# output:
# Columns: []
# Index: []
The DataFrame can be created using a single list or a list of lists:
data = [['Alex',10],['Bob',12],['Clarke',13]]
df = pd.DataFrame(data,columns=['Name','Age'])
print(df)
Output is:
index | Name | Age |
---|---|---|
0 | Alex | 10.0 |
1 | Bob | 12.0 |
2 | Clarke | 13.0 |
?> Observe, the type of Age column is a floating point.
All the ndarrays must be of same length. If index is passed, then the length of the index should equal to the length of the arrays.
If no index is passed, then by default, index will be range(n)
, where n is the array length.
data = {'Name':['Tom', 'Jack', 'Steve', 'Ricky'],'Age':[28,34,29,42]}
df = pd.DataFrame(data,columns=['Name','Age'])
print(df)
Output is:
index | Name | Age |
---|---|---|
0 | 28 | Tome |
1 | 34 | Jack |
2 | 29 | Steve |
3 | 42 | Ricky |
?> Observe the values 0,1,2,3. They are the default index assigned to each using the function range(n)
.
List of dictionaries can be passed as input data to create a DataFrame. The dictionary keys are by default taken as column names.
data = [{'a': 1, 'b': 2},{'a': 5, 'b': 10, 'c': 20}]
df = pd.DataFrame(data)
print(df)
Output is:
index | a | b | c |
---|---|---|---|
0 | 1 | 2 | NaN |
1 | 5 | 20 | 20.0 |
!> Column c is NaN at index 0.
Dictionary of series can be passed to form a DataFrame. The resultant index is the union of all the series indexes passed.
data = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']),
'two' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])}
df = pd.DataFrame(data)
print(df)
Output is:
index | one | two |
---|---|---|
0 | 1.0 | 1 |
1 | 2.0 | 2 |
2 | 3.0 | 3 |
3 | NaN | 4 |
!> For the series one, there is no label "d" passed, but in the result, for the d label, NaN is appended with NaN.
Adding a new column is as easy as passing a Series:
data = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']),
'two' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])}
df = pd.DataFrame(data)
df['three'] = pd.Series([10,20,30],index=['a','b','c'])
print(df)
Output is:
index | one | two | three |
---|---|---|---|
0 | 1.0 | 1 | 10.0 |
1 | 2.0 | 2 | 20.0 |
2 | 3.0 | 3 | 30.0 |
3 | NaN | 4 | NaN |
Adding a new column using the existing columns in DataFrame:
df['four'] = df['one'] + df['three']
print(df)
Output as:
index | one | two | three | four |
---|---|---|---|---|
0 | 1.0 | 1 | 10.0 | 11.0 |
1 | 2.0 | 2 | 20.0 | 22.0 |
2 | 3.0 | 3 | 30.0 | 33.0 |
3 | NaN | 4 | NaN | NaN |
Columns can be deleted:
data = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']),
'two' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])}
df = pd.DataFrame(data)
df.pop('two')
print(df)
Output is:
index | value |
---|---|
one | 1.0 |
two | 2.0 |
2 | 3.0 |
3 | NaN |
data = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']),
'two' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])}
df = pd.DataFrame(data)
print(df.loc('b'))
Output is:
index | value |
---|---|
one | 2.0 |
two | 2.0 |
The result is a series with labels as column names of the DataFrame. And, the ame of the series is the label with which it is retrieved.
data = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']),
'two' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])}
df = pd.DataFrame(data)
print(df.iloc(2))
Output is:
index | value |
---|---|
one | 3.0 |
two | 3.0 |
Multiple rows can be selected using :
operator:
print(df.iloc(2:4))
Output is:
index | one | two |
---|---|---|
c | 3.0 | 3 |
d | NaN | 4 |
Add new rows to a DataFrame using the append function. This function will append the rows at the end.
df = pd.DataFrame([[1, 2], [3, 4]], columns = ['a','b'])
df2 = pd.DataFrame([[5, 6], [7, 8]], columns = ['a','b'])
df = df.append(df2)
print(df)
Output is:
index | a | b |
---|---|---|
0 | 1 | 2 |
1 | 3 | 4 |
0 | 5 | 6 |
1 | 7 | 8 |
!> The index will also append to the DataFrame. To reset the index use the reset_index()
function. For this example: print(df.reset_index())
.
Use index label to delete or drop rows from a DataFrame. If label is duplicated, then multiple rows will be dropped.
If you observe, in the above example, the labels are duplicate. Let us drop a label and will see how many rows will get dropped.
df = pd.DataFrame([[1, 2], [3, 4]], columns = ['a','b'])
df2 = pd.DataFrame([[5, 6], [7, 8]], columns = ['a','b'])
df = df.append(df2)
df = df.drop(0)
print(df)
Output is:
index | a | b |
---|---|---|
1 | 3 | 4 |
1 | 7 | 8 |
!> In the above example, two rows were dropped because those two contain the same label 0.
Add new rows to a DataFrame using the append function. This function will append the rows at the end.
head()
returns the first n
rows (observe the index values). The default number of elements to display is five, but you may pass a custom number. tail()
returns the last n
rows (observe the index values). The default number of elements to display is five, but you may pass a custom number.
s = pd.Series(np.random.randn(4))
print(s.tail(2))
print(s.head(2))
Returns the transpose of the DataFrame. The rows and columns will interchange:
d = {'Name':pd.Series(['Tom','James','Ricky','Vin','Steve','Smith','Jack']),
'Age':pd.Series([25,26,25,23,30,29,23]),
'Rating':pd.Series([4.23,3.24,3.98,2.56,3.20,4.6,3.8])}
df = pd.DataFrame(d)
print(df.T)
The transpose of the DataFrame is:
0 | 1 | 2 | 3 | 4 | 5 | 6 | |
---|---|---|---|---|---|---|---|
Age | 25 | 26 | 25 | 23 | 30 | 29 | 23 |
Name | Tom | James | Ricky | Vin | Steve | Smith | Jack |
Rating | 4.23 | 3.24 | 3.98 | 2.56 | 3.2 | 4.6 | 3.8 |
Returns a tuple representing the dimensionality of the DataFrame. Tuple (a, b)
, where a
represents the number of rows and b
represents the number of columns.
d = {'Name':pd.Series(['Tom','James','Ricky','Vin','Steve','Smith','Jack']),
'Age':pd.Series([25,26,25,23,30,29,23]),
'Rating':pd.Series([4.23,3.24,3.98,2.56,3.20,4.6,3.8])}
df = pd.DataFrame(d)
print(df.shape) # (7, 3)
Returns the number of elements in the DataFrame.
d = {'Name':pd.Series(['Tom','James','Ricky','Vin','Steve','Smith','Jack']),
'Age':pd.Series([25,26,25,23,30,29,23]),
'Rating':pd.Series([4.23,3.24,3.98,2.56,3.20,4.6,3.8])}
df = pd.DataFrame(d)
print(df.size) # 21
The following table list down the important functions available on the DataFrame object:
function | description |
---|---|
count() |
number of non-null observations |
sum() |
sum of values |
mean() |
mean of values |
median() |
median of values |
mode() |
mode of values |
std() |
standard deviation of the values |
min() |
minimum value |
max() |
maximum value |
abs() |
absolute value |
prod() |
product of values |
cumsum() |
cumulative Sum |
cumprod() |
cumulative product |
describe() |
summary of statistics |
For example, sum()
returns the sum of the values:
d = {'Name':pd.Series(['Tom','James','Ricky','Vin','Steve','Smith','Jack',
'Lee','David','Gasper','Betina','Andres']),
'Age':pd.Series([25,26,25,23,30,29,23,34,40,30,51,46]),
'Rating':pd.Series([4.23,3.24,3.98,2.56,3.20,4.6,3.8,3.78,2.98,4.80,4.10,3.65])
}
df = pd.DataFrame(d)
print(df.sum())
Output is:
column | value |
---|---|
Age | 382 |
Name | TomJamesRickyVinSteveSmithJackLeeDavidGasperBe... |
Rating | 44.92 |
sum()
takes one argument which is the axis to take the sum of. By default it takes the sum for every column.
mean()
returns the average value. Using the same example as above:
print(df.mean())
Output is:
column | value |
---|---|
Age | 31.833333 |
Rating | 3.743333 |
!> Notice that string columns are ignored.
The mean can also be found using:
print(df['Age'].sum() / df['Age'].count())
The describe()
function computes a summary of statistics pertaining to the DataFrame columns. It gives the count, mean, std and IQR values. Using the same DataFrame as above:
print(df.describe())
Output is:
Age | Rating | |
---|---|---|
count | 12.00000 | 12.00000 |
mean | 31.83333 | 3.743333 |
std | 9.232682 | 0.661628 |
min | 23.000000 | 2.560000 |
25% | 25.000000 | 3.230000 |
50% | 29.500000 | 3.790000 |
75% | 35.500000 | 4.132500 |
max | 51.000000 | 4.800000 |
Sorting a DataFrame by column:
d = {'Name':pd.Series(['Tom','James','Ricky','Vin','Steve','Smith','Jack',
'Lee','David','Gasper','Betina','Andres']),
'Age':pd.Series([25,26,25,23,30,29,23,34,40,30,51,46]),
'Rating':pd.Series([4.23,3.24,3.98,2.56,3.20,4.6,3.8,3.78,2.98,4.80,4.10,3.65])
}
df = pd.DataFrame(d)
print(df.sort_values(by='Age')) # DataFrame will be sorted on Age column
Rename columns using rename()
. Using the DataFrame as above:
df = df.rename(columns={
'Name': 'N',
'Age': 'A',
'Rating': 'R'
})
print(df)
A comma-separated values (CSV) file is a delimited text file that uses a comma to separate values. A CSV file stores tabular data. Each line of the file is a data record. Each record consists of one or more fields, separated by commas. The use of the comma as a field separator is the source of the name for this file format and is called delimiter. Other delimiters such as |
can also be used for CSV files. CSV files can be used to store our DataFrame object as a file.
To save a DataFrame as CSV file using |
as delimiter/separator:
df = pd.DataFrame(
np.random.rand(10, 4),
columns=['a', 'b', 'c', 'd']
)
df.to_csv('random.csv', sep='|')
Reading a CSV file in pandas is as easy as:
df = pd.read_csv('random.csv', delimiter='|')
print(df)