Selecting, Slicing and Filtering data in a Pandas DataFrame

Posted on 16th October 2019

One of the essential features that a data analysis tool must provide users for working with large data-sets is the ability to select, slice, and filter data easily. Pandas provide this feature through the use of DataFrames. A data frame consists of data, which is arranged in rows and columns, and row and column labels. You can easily select, slice or take a subset of the data in several different ways, for example by using labels, by index location, by value and so on. Here we demonstrate some of these operations using a sample DataFrame.

First and foremost, let's create a DataFrame with a dataset that contains 5 rows and 4 columns and values from ranging from 0 to 19. We will use the arange() and reshape() functions from NumPy library to create a two-dimensional array and this array is passed to the Pandas DataFrame constructor function.

import numpy as np
import pandas as pd

df = pd.DataFrame(np.arange(20).reshape(5,4), columns=["A","B","C","D"])
print(df)

Output:

    A   B   C   D
0   0   1   2   3
1   4   5   6   7
2   8   9  10  11
3  12  13  14  15
4  16  17  18  19

Select rows and columns using labels

You can select rows and columns in a Pandas DataFrame by using their corresponding labels.

  • To select a single column.

    df.loc[:,"A"]
      or
    df["A"]
      or
    df.A
    

    Output:

    0     0
    1     4
    2     8
    3    12
    4    16
    Name: A, dtype: int32
    
  • To select multiple columns.

    df.loc[:, ["A", "C"]]
      or
    df[["A", "C"]]
    

    Output:

        A   C
    0   0   2
    1   4   6
    2   8  10
    3  12  14
    4  16  18
    
  • Select a row by its label.

     df.loc[0]
    

    Output:

    A    0
    B    1
    C    2
    D    3
    Name: 0, dtype: int32
    
  • Select multiple rows by label.

    df.loc[[0,1]]
    

    Output:

       A  B  C  D
    0  0  1  2  3
    1  4  5  6  7
    
  • Accessing values by row and column label.

    df.loc[0,"D"]
    

    Output:

    3
    
  • Accessing values from multiple columns of same row.

    df.loc[1,["A", "C"]]
    

    Output:

    A    4
    C    6
    Name: 1, dtype: int32
    
  • Accessing values from multiple rows but same column.

    df.loc[[0,1],"B"]
    

    Output:

    0    1
    1    5
    Name: B, dtype: int32
    

Select by Index Position

You can select data from a Pandas DataFrame by its location. Note, Pandas indexing starts from zero.

  • Select a row by index location.

    df.iloc[0]
    

    Output:

    A    0
    B    1
    C    2
    D    3
    Name: 0, dtype: int32
    
  • Select a column by index location.

    df.iloc[:, 3]
    

    Output:

    0     3
    1     7
    2    11
    3    15
    4    19
    Name: D, dtype: int32
    
  • Select data at the specified row and column location.

    df.iloc[0,3]
    

    Output:

    3
    
  • Select list of rows and columns.

    df.iloc[[1,2],[0, 1]]
    

    Output:

       A  B
    1  4  5
    2  8  9
    

Slicing Rows and Columns using labels

You can select a range of rows or columns using labels or by position. To slice by labels you use loc attribute of the DataFrame.

  • Slice rows by label.

    df.loc[1:3, :]
    

    Output:

        A   B   C   D
    1   4   5   6   7
    2   8   9  10  11
    3  12  13  14  15
    
  • Slice columns by label.

    df.loc[:, "B":"D"]
    

    Output:

        B   C   D
    0   1   2   3
    1   5   6   7
    2   9  10  11
    3  13  14  15
    4  17  18  19
    
  • Slice rows and columns by label.

     
    df.loc[1:3, "B":"D"]
    

    Output:

        B   C   D
    1   5   6   7
    2   9  10  11
    3  13  14  15
    

Slicing Rows and Columns by position

To slice a Pandas dataframe by position use the iloc attribute. Remember index starts from 0 to (number of rows/columns - 1).

  • To slice rows by index position.

    df.iloc[0:2,:]
    

    Output:

       A  B  C  D
    0  0  1  2  3
    1  4  5  6  7
    
  • To slice columns by index position.

    df.iloc[:,1:3]
    

    Output:

        B   C
    0   1   2
    1   5   6
    2   9  10
    3  13  14
    4  17  18
    
  • To slice row and columns by index position.

    df.iloc[1:2,1:3]
    

    Output:

       B  C
    1  5  6
    
    df.iloc[:2,:2]
    

    Output:

       A  B
    0  0  1
    1  4  5
    

Subsetting by boolean conditions

You can use boolean conditions to obtain a subset of the data from the DataFrame.

  • Select rows based on column value

    To select all rows whose column contain the specified value(s).

    df[df.B == 9]
     or
    df.loc[df.B == 9]
    

    Output:

      A  B   C   D
    2  8  9  10  11
    

    You can also use the isin() method

    df[df.B.isin([9,13])]

    Output:

        A   B   C   D
    2   8   9  10  11
    3  12  13  14  15
    
  • Rows that match multiple boolean conditions.

    df[(df.B == 5) | (df.C == 10)]
    

    Output:

       A  B   C   D
    1  4  5   6   7
    2  8  9  10  11
    
  • Select rows whose column does not contain the specified values.

     df[~df.B.isin([9,13])]
    

    Output:

        A   B   C   D
    0   0   1   2   3
    1   4   5   6   7
    4  16  17  18  19
    
  • Select columns based on row value

    To select columns whose rows contain the specified value.

    df.loc[:,df.isin([9,12]).any()]
    

    Output:

        A   B
    0   0   1
    1   4   5
    2   8   9
    3  12  13
    4  16  17
    
  • Subsetting using filter method

    Subsets can be created using the filter method like below.

    df.filter(items=["A","D"])
    

    Output:

        A   D
    0   0   3
    1   4   7
    2   8  11
    3  12  15
    4  16  19
    
    df.filter(like="2", axis=0)
    

    Output:

       A  B   C   D
    2  8  9  10  11
    
    df.filter(regex="[^AB]")
    

    Output:

        C   D
    0   2   3
    1   6   7
    2  10  11
    3  14  15
    4  18  19
    

Post a comment

Comments

Nothing yet..be the first to share wisdom.