Pandas is a Python library that provides several high-performance data structures and functions for data manipulation and analysis. It is one of the most popular libraries used by Data scientists who work on Python. A DataFrame is a Pandas data structure for storing data in tabular form (rows and columns). A data frame consists of index, column names and the data itself. The index is like a label for each row and together with the column names acts as an address to each data element. The data in a DataFrame can be heterogeneous, i.e., they can be of different data types. In this post we discuss the following ways of creating a Pandas DataFrame in Python.
- Creating DataFrame from a NumPy array
- Creating DataFrame from a Python List
- Creating DataFrame from a Python Dictionary
- Creating Empty DataFrames
- Creating DataFrame from CSV file
Creating a DataFrame from a NumPy array
NumPy is another Python library used for scientific calculations and provides functions for manipulating large multidimensional arrays and matrices. You can create a DataFrame from a NumPy array as below:
# Import libraries import numpy as np import pandas as pd # Create a NumPy array array = np.array([[5, 6], [7, 8]]) print("Numpy Array:\n", array) # Create DataFram from array df = pd.DataFrame(array) print("Pandas DataFrame:\n", df)
Output
Numpy Array: [[5 6] [7 8]] Pandas DataFrame: 0 1 0 5 6 1 7 8
Specifying the index and columns for the dataframe
In the above examples, the index(0,1) and column names(0,1) are automatically assigned as illustrated in the figure below.
You could also specify the index and column name values while creating the dataframe.
pd.DataFrame(array, columns=["Col1","Col2"], index=["Row1","Row2"])
Output
Col1 Col2 Row1 5 6 Row2 7 8
Taking index and columns from the array
You may have NumPy array, where the first row contains column names and the first column contains the indices. For example, you have a NumPy array like this:
array = np.array([["","Col1","Col2"],["Row1",1,2],["Row2",1,2]]) print(array)
Output
[['' 'Col1' 'Col2'] ['Row1' '1' '2'] ['Row2' '1' '2']]
To create a Dataframe from the above NumPy array, call the Pandas DataFrame constructor function like below:
df = pd.DataFrame(data=array[1:,1:], index=array[1:,0], columns=array[0,1:]) print(df)
Output
Col1 Col2 Row1 1 2 Row2 1 2
Creating DataFrame from a Python List
You can create a Pandas Dataframe from a Pythons list as below:
lst=["Apple","Orange","Grapes","Banana"] pd.DataFrame(lst, columns=["Fruit"])
Fruit 0 Apple 1 Orange 2 Grapes 3 Banana
and from nested lists like below.
lst=[[1,2,3],[4,5,6]] pd.DataFrame(lst, columns=["A","B","C"])
A B C 0 1 2 3 1 4 5 6
Creating DataFrame from Python a dictionary object
When you pass a Python dictionary object to the DataFrame constructor function, the resulting dataframe will have the keys in the dictionary as column names.
dict={"name":["Tom","Bob","Phil"], "Age":[25,26,27]} pd.DataFrame(dict)
name Age 0 Tom 25 1 Bob 26 2 Phil 27
Creating Empty DataFrames
Calling the DataFrame function without any arguments will create and empty dataframe.
pd.DataFrame()
Empty DataFrame Columns: [] Index: []
Optionally you can pass a list to the columns argument to define the columns in the dataframe.
pd.DataFrame(columns=["Coll","Col2"])
Empty DataFrame Columns: [Coll, Col2] Index: []
Empty DataFrame of NaNs
You can create a empty dataframe filled with NaNs by passing the index argument as below.
pd.DataFrame(columns=["Coll","Col2"], index=range(3))
Coll Col2 0 NaN NaN 1 NaN NaN 2 NaN NaN
Creating DataFrame from a CSV file
The read_csv function in Pandas allows you to create a DataFrame with data populated from a CSV (comma-separated values) file.
Let's say you have a CSV file named myfile.csv
with the following data.
ID,Name,Score, 5010,Peter,75, 8321,Sandra,95, 1532,Kumar,98,
You can create a dataframe from this CSV file by calling the read_csv function.
import pandas as pd df = pd.read_csv("myfile.csv") print(df)
Output
ID Name Score 0 5010 Peter 75 1 8321 Sandra 95 2 1532 Kumar 98
Note that in the above example, the first row is taken as column names. If the the CSV file does not contain column names(like below), you can specify them explicitly using the names argument.
5010,Peter,75, 8321,Sandra,95, 1532,Kumar,98,
df = pd.read_csv("myfile.csv", names=["ID","Name","Score"])
Output
ID Name Score 0 5010 Peter 75 1 8321 Sandra 95 2 1532 Kumar 98