One of the essential features that a data analysis tool must provide users for working with large data-sets is the ability to select, slice, and filter data easily. Pandas provide this feature through the use of DataFrames. A data frame consists of data, which is arranged in rows and columns, and row and column labels. You can easily select, slice or take a subset of the data in several different ways, for example by using labels, by index location, by value and so on. Here we demonstrate some of these operations using a sample DataFrame.
First and foremost, let's create a DataFrame with a dataset that contains 5 rows and 4 columns and values from ranging from 0 to 19. We will use the arange()
and reshape()
functions from NumPy library to create a two-dimensional array and this array is passed to the Pandas DataFrame constructor function.
import numpy as np import pandas as pd df = pd.DataFrame(np.arange(20).reshape(5,4), columns=["A","B","C","D"]) print(df)
Output:
A B C D 0 0 1 2 3 1 4 5 6 7 2 8 9 10 11 3 12 13 14 15 4 16 17 18 19
Select rows and columns using labels
You can select rows and columns in a Pandas DataFrame by using their corresponding labels.
To select a single column.
df.loc[:,"A"] or df["A"] or df.A
Output:
0 0 1 4 2 8 3 12 4 16 Name: A, dtype: int32
To select multiple columns.
df.loc[:, ["A", "C"]] or df[["A", "C"]]
Output:
A C 0 0 2 1 4 6 2 8 10 3 12 14 4 16 18
Select a row by its label.
df.loc[0]
Output:
A 0 B 1 C 2 D 3 Name: 0, dtype: int32
Select multiple rows by label.
df.loc[[0,1]]
Output:
A B C D 0 0 1 2 3 1 4 5 6 7
Accessing values by row and column label.
df.loc[0,"D"]
Output:
3
Accessing values from multiple columns of same row.
df.loc[1,["A", "C"]]
Output:
A 4 C 6 Name: 1, dtype: int32
Accessing values from multiple rows but same column.
df.loc[[0,1],"B"]
Output:
0 1 1 5 Name: B, dtype: int32
Select by Index Position
You can select data from a Pandas DataFrame by its location. Note, Pandas indexing starts from zero.
Select a row by index location.
df.iloc[0]
Output:
A 0 B 1 C 2 D 3 Name: 0, dtype: int32
Select a column by index location.
df.iloc[:, 3]
Output:
0 3 1 7 2 11 3 15 4 19 Name: D, dtype: int32
Select data at the specified row and column location.
df.iloc[0,3]
Output:
3
Select list of rows and columns.
df.iloc[[1,2],[0, 1]]
Output:
A B 1 4 5 2 8 9
Slicing Rows and Columns using labels
You can select a range of rows or columns using labels or by position. To slice by labels you use loc attribute of the DataFrame.
Slice rows by label.
df.loc[1:3, :]
Output:
A B C D 1 4 5 6 7 2 8 9 10 11 3 12 13 14 15
Slice columns by label.
df.loc[:, "B":"D"]
Output:
B C D 0 1 2 3 1 5 6 7 2 9 10 11 3 13 14 15 4 17 18 19
Slice rows and columns by label.
df.loc[1:3, "B":"D"]
Output:
B C D 1 5 6 7 2 9 10 11 3 13 14 15
Slicing Rows and Columns by position
To slice a Pandas dataframe by position use the iloc attribute. Remember index starts from 0 to (number of rows/columns - 1).
To slice rows by index position.
df.iloc[0:2,:]
Output:
A B C D 0 0 1 2 3 1 4 5 6 7
To slice columns by index position.
df.iloc[:,1:3]
Output:
B C 0 1 2 1 5 6 2 9 10 3 13 14 4 17 18
To slice row and columns by index position.
df.iloc[1:2,1:3]
Output:
B C 1 5 6
df.iloc[:2,:2]
Output:
A B 0 0 1 1 4 5
Subsetting by boolean conditions
You can use boolean conditions to obtain a subset of the data from the DataFrame.
Select rows based on column value
To select all rows whose column contain the specified value(s).
df[df.B == 9] or df.loc[df.B == 9]
Output:
A B C D 2 8 9 10 11
You can also use the isin() method
df[df.B.isin([9,13])]
Output:
A B C D 2 8 9 10 11 3 12 13 14 15
Rows that match multiple boolean conditions.
df[(df.B == 5) | (df.C == 10)]
Output:
A B C D 1 4 5 6 7 2 8 9 10 11
Select rows whose column does not contain the specified values.
df[~df.B.isin([9,13])]
Output:
A B C D 0 0 1 2 3 1 4 5 6 7 4 16 17 18 19
Select columns based on row value
To select columns whose rows contain the specified value.
df.loc[:,df.isin([9,12]).any()]
Output:
A B 0 0 1 1 4 5 2 8 9 3 12 13 4 16 17
Subsetting using filter method
Subsets can be created using the filter method like below.
df.filter(items=["A","D"])
Output:
A D 0 0 3 1 4 7 2 8 11 3 12 15 4 16 19
df.filter(like="2", axis=0)
Output:
A B C D 2 8 9 10 11
df.filter(regex="[^AB]")
Output:
C D 0 2 3 1 6 7 2 10 11 3 14 15 4 18 19