Pandas dataframes
Reading CSV tabular data
Pandas is a widely-used Python library for working with tabular data, borrows heavily from R’s dataframes,
built on top of NumPy. Let’s try reading some public-domain data about Jeopardy questions with pandas
(31MB
file, so it might take a while):
import pandas as pd
data = pd.read_csv("https://raw.githubusercontent.com/razoumov/publish/master/jeopardy.csv")
data.shape # shape is a member variable inside data => 216930 rows, 7 columns
print(data)
data # this prints out the table nicely in Jupyter Notebook!
data.info() # info is a *member method inside data*
data.head(10) # first 10 rows
data.tail() # last 5 rows
data.columns # names of the columns
Let’s download the same online data as a file into our current directory:
!wget https://raw.githubusercontent.com/razoumov/publish/master/jeopardy.csv
This time let’s read from this file naming the rows by their respective answer:
data = pd.read_csv("jeopardy.csv", index_col='Answer')
data
data.shape # one fewer column
data.columns
data.index # row names
Subsetting
Pandas lets you subset elements using either their numerical indices or their row/column names. Long time ago
Pandas used to have a single function to do both. Now there are two separate functions, .iloc
and
.loc
. Let’s print one element:
data.iloc[0,5] # using row/column numbers
data.loc["Sinking of the Titanic", "Question"] # using row/column names
Printing a row:
data.loc['Sinking of the Titanic',:] # usual Python's slicing notation - show all columns in that row
data.loc['Sinking of the Titanic'] # exactly the same
data.loc['Sinking of the Titanic',] # exactly the same
Printing a column:
data.loc[:,'Category'] # show all rows in that column
data['Category'] # exactly the same; single index refers to columns
data.Category # most compact notation; does not work with numerical-only names
Combining .iloc
and .loc
– let’s say we want to retrieve a cell by its row number and its column name:
data.iloc[0] # returns a pandas series (with many rows) => use .loc to select its row
data.iloc[0].loc["Question"] # combining numbers and names
data.iloc[0]["Question"] # the same
Printing a range:
data.iloc[10:15] # rows 10,11,12,13,14
data.columns
data.iloc[10:15].loc[:,'Category':'Question'] # rows 10-14, columns 'Category' - 'Question'
data.iloc[10:15].loc[:,['Category','Question']] # rows 10-14, columns 'Category' and 'Question'
data.loc['Copernicus', 'Air Date':'Category'] # a range of columns ... why so many lines?
data.loc['Copernicus', ['Air Date','Question']] # print two selected columns
data['Category'] # many different categories, truncated output ...
list(data['Category']) # print all categories ... huge list
len(set(data['Category'])) # 27,983 unique categories
data['Category']=='HISTORY' # return either True or False for each row
You can use the last expression as a mask to return only those rows where Category
is “HISTORY”:
data.loc[data['Category']=='HISTORY']
data.loc[data['Category']=='HISTORY'].shape # 349 matches
data.loc[data['Category']=='HISTORY'].to_csv("history.csv") # write to a file
Let’s take a look at the value column:
list(data['Value']) # some of these contain nan
data.shape # original table: 216,930 rows
clean = data.dropna() # drop rows with nan's
clean.shape # after dropping rows with missing values: 213,144 rows
clean['Value'] # one column
clean['Value'].apply(lambda x: type(x)) # show the type of each element (fixed for each column)
clean['Value'].apply(lambda x: x.replace('$','')) # remove all $ signs
values = clean['Value'].apply(lambda x: int(x.replace('$','').replace(',','')))
mask = values>5000
clean[mask] # only show rows with Value > $5000
clean[mask].shape # thera are 345 such rows
Let’s replace the “Values” column in-place – here using the same expression on the right-hand side as before:
clean.loc[:,'Value'] = clean['Value'].apply(lambda x: int(x.replace('$','').replace(',','')))
clean
Now clean
’s Value
column is all numerical.
Question 11.1
Explain in simple terms what .idxmin()
and .idxmax()
do in the short program below. When would you use these
methods?
clean.idxmin()
clean.idxmax()
Hint: Try running these:
clean.loc["Freddie And The Dreamers"]
clean.loc["Suriname"]
or use help pages. This simpler example could also help:
col1 = ["1","a","A"]
col2 = ["4","b","B"]
data = pd.DataFrame({'a': col1, 'b': col2}) # dataframe from a dictionary
data.idxmin()
Finally, let’s check what time period is covered by these data:
data["Air Date"]
data["Air Date"][0][-2:] # first row, last two digits is the year
year = data["Air Date"].apply(lambda x: x[-2:]) # last two digits of the year from all rows
year.min(); year.max() # '00' and '99' - not very informative, wraps at the turn of the century
for y in range(100):
twoDigits = str(y).zfill(2)
print(twoDigits, sum(year==twoDigits))
This shows that this table covers years from 1984 to 2012.
Creating a dataframe from scratch
How do you create a dataframe from scratch? There are many ways; perhaps, the easiest is by defining columns (as you saw in the last exercise):
col1 = [1,2,3]
col2 = [4,5,6]
pd.DataFrame({'a': col1, 'b': col2}) # dataframe from a dictionary
We can index (assign names to) the rows with this syntax:
pd.DataFrame({'a': col1, 'b': col2}, index=['a1','a2','a3'])
Three solutions to a classification problem
Fizz buzz is a children’s game to practice divisions. Players take turn counting out loud while replacing:
- any number divisible by 3 with the word “Fizz”,
- any number divisible by 5 with the word “Buzz”,
- any number divisible by both 3 and 5 with the word “FizzBuzz”.
Let’s implement this in pandas! First, create a simple dataframe from scratch:
import pandas as pd
import numpy as np
df = pd.DataFrame()
size = 10_000
df['number'] = np.arange(1, size+1) # create a column called "number" containing 1,2,...,size
Define for pretty printing:
def show(frame):
print(df.tail(15).to_string(index=False)) # print last 15 rows without the row index
show(df)
Let’s built a new column response
containing either “Fizz” or “Buzz” or “FizzBuzz” or the original
number, based on the number
value in that row. Let’s start by defining a function to process a row:
def count(row):
if (row['number'] % 3 == 0) and (row['number'] % 5 == 0):
return 'FizzBuzz'
elif row['number'] % 3 == 0:
return 'Fizz'
elif row['number'] % 5 == 0:
return 'Buzz'
else:
return str(row['number'])
This is how you would use this function:
count(df.iloc[2]) # returns 'Fizz'
count(df.iloc[14]) # returns 'FizzBuzz'
(1) We can apply this function to each row in a loop:
%%timeit
for index, row in df.iterrows():
df.loc[index, 'response'] = count(row)
413 ms ± 11.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
show(df)
(2) We can use df.apply()
to apply this function to each row:
%%timeit
df['response'] = df.apply(count, axis=1) # axis=1 means apply along the column
69.1 ms ± 380 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
show(df)
(3) Or we could use a mask to only assign correct responses to the corresponding rows:
%%timeit
df['response'] = df['number'].astype(str)
df.loc[df['number'] % 3==0, 'response'] = 'Fizz'
df.loc[df['number'] % 5==0, 'response'] = 'Buzz'
df.loc[(df['number'] % 3==0) & (df['number'] % 5==0), 'response'] = 'FizzBuzz'
718 µs ± 10.6 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
show(df)