Different ways to create a Pandas DataFrame

Posted on 14th October 2019

Pandas is a Python library that provides several high-performance data structures and functions for data manipulation and analysis. It is one of the most popular libraries used by Data scientists who work on Python. A DataFrame is a Pandas data structure for storing data in tabular form (rows and columns). A data frame consists of index, column names and the data itself. The index is like a label for each row and together with the column names acts as an address to each data element. The data in a DataFrame can be heterogeneous, i.e., they can be of different data types. In this post we discuss the following ways of creating a Pandas DataFrame in Python.

Creating a DataFrame from a NumPy array

NumPy is another Python library used for scientific calculations and provides functions for manipulating large multidimensional arrays and matrices. You can create a DataFrame from a NumPy array as below:

# Import libraries
import numpy as np
import pandas as pd

# Create a NumPy array
array = np.array([[5, 6], [7, 8]])
print("Numpy Array:\n", array)

# Create DataFram from array
df = pd.DataFrame(array)
print("Pandas DataFrame:\n", df)

Output

Numpy Array:
 [[5 6]
 [7 8]]
Pandas DataFrame:
   0  1
0  5  6
1  7  8

Specifying the index and columns for the dataframe

In the above examples, the index(0,1) and column names(0,1) are automatically assigned as illustrated in the figure below.

Pandas DataFrame

You could also specify the index and column name values while creating the dataframe.

pd.DataFrame(array, columns=["Col1","Col2"], index=["Row1","Row2"])

Output

      Col1  Col2
Row1     5     6
Row2     7     8

Taking index and columns from the array

You may have NumPy array, where the first row contains column names and the first column contains the indices. For example, you have a NumPy array like this:

array =  np.array([["","Col1","Col2"],["Row1",1,2],["Row2",1,2]])
print(array)

Output

[['' 'Col1' 'Col2']
 ['Row1' '1' '2']
 ['Row2' '1' '2']]
 

To create a Dataframe from the above NumPy array, call the Pandas DataFrame constructor function like below:

df = pd.DataFrame(data=array[1:,1:], index=array[1:,0], columns=array[0,1:])
print(df)

Output

     Col1 Col2
Row1    1    2
Row2    1    2

Creating DataFrame from a Python List

You can create a Pandas Dataframe from a Pythons list as below:

lst=["Apple","Orange","Grapes","Banana"]
pd.DataFrame(lst, columns=["Fruit"])
    Fruit
0   Apple
1  Orange
2  Grapes
3  Banana

and from nested lists like below.

lst=[[1,2,3],[4,5,6]]
pd.DataFrame(lst, columns=["A","B","C"])
   A  B  C
0  1  2  3
1  4  5  6

Creating DataFrame from Python a dictionary object

When you pass a Python dictionary object to the DataFrame constructor function, the resulting dataframe will have the keys in the dictionary as column names.

dict={"name":["Tom","Bob","Phil"], "Age":[25,26,27]}
pd.DataFrame(dict)
   name  Age
0   Tom   25
1   Bob   26
2  Phil   27

Creating Empty DataFrames

Calling the DataFrame function without any arguments will create and empty dataframe.

pd.DataFrame()
Empty DataFrame
Columns: []
Index: []

Optionally you can pass a list to the columns argument to define the columns in the dataframe.

pd.DataFrame(columns=["Coll","Col2"])
Empty DataFrame
Columns: [Coll, Col2]
Index: []

Empty DataFrame of NaNs

You can create a empty dataframe filled with NaNs by passing the index argument as below.

pd.DataFrame(columns=["Coll","Col2"], index=range(3))
  Coll Col2
0  NaN  NaN
1  NaN  NaN
2  NaN  NaN

Creating DataFrame from a CSV file

The read_csv function in Pandas allows you to create a DataFrame with data populated from a CSV (comma-separated values) file.

Let's say you have a CSV file named myfile.csv with the following data.

ID,Name,Score,
5010,Peter,75,
8321,Sandra,95,
1532,Kumar,98,

You can create a dataframe from this CSV file by calling the read_csv function.

import pandas as pd

df = pd.read_csv("myfile.csv")
print(df)

Output

     ID    Name  Score
0  5010   Peter     75
1  8321  Sandra     95
2  1532   Kumar     98

Note that in the above example, the first row is taken as column names. If the the CSV file does not contain column names(like below), you can specify them explicitly using the names argument.

5010,Peter,75,
8321,Sandra,95,
1532,Kumar,98,
df = pd.read_csv("myfile.csv", names=["ID","Name","Score"])

Output

     ID    Name  Score
0  5010   Peter     75
1  8321  Sandra     95
2  1532   Kumar     98

Post a comment

Comments

Nothing yet..be the first to share wisdom.