Skip to Content
Course content

8.2 Introduction to Pandas

Pandas is an open-source data manipulation and analysis library for Python. It provides high-performance, easy-to-use data structures (like DataFrames and Series) and data analysis tools, making it an essential library for data scientists and analysts.

1. What is Pandas?

Pandas is built on top of NumPy and integrates seamlessly with many other Python libraries, including Matplotlib for data visualization and scikit-learn for machine learning. The library provides two primary data structures:

  • Series: A one-dimensional labeled array, like a column in a spreadsheet or a database table.
  • DataFrame: A two-dimensional table of data, similar to an Excel spreadsheet or a SQL table, consisting of rows and columns.

Pandas helps with various tasks such as data cleaning, exploration, transformation, and analysis. It also supports handling time-series data and merging and joining datasets.

2. Why Use Pandas?

  • Data Cleaning: It simplifies handling missing data, duplicate entries, and data type conversions.
  • Data Analysis: Provides easy ways to aggregate, summarize, and analyze large datasets.
  • Data Transformation: Allows reshaping, grouping, and pivoting datasets, making it easy to transform data as needed.
  • Data I/O: Supports reading and writing data from various file formats, including CSV, Excel, SQL, and JSON.
  • Time Series Support: Pandas has built-in features to work with time-series data, such as resampling, frequency conversion, and time-based indexing.

3. Installing Pandas

You can install Pandas using pip:

pip install pandas

Alternatively, if you're using Anaconda, Pandas can be installed through conda:

conda install pandas

4. Creating and Manipulating DataFrames

A DataFrame is the core data structure in Pandas. It is used to store and manipulate two-dimensional data. DataFrames can be created from various sources like dictionaries, lists, or reading from external files like CSVs.

a. Creating a DataFrame
import pandas as pd

# Creating a DataFrame from a dictionary
data = {
    'Name': ['Alice', 'Bob', 'Charlie'],
    'Age': [25, 30, 35],
    'City': ['New York', 'Los Angeles', 'Chicago']
}

df = pd.DataFrame(data)
print(df)

Output:

       Name  Age         City
0     Alice   25     New York
1       Bob   30  Los Angeles
2   Charlie   35      Chicago
b. Creating a DataFrame from CSV
df = pd.read_csv('data.csv')
print(df.head())  # Display first 5 rows of the DataFrame

5. Basic Operations on DataFrames

Pandas allows you to perform various operations on DataFrames.

a. Viewing the Data
  • df.head() — Displays the first 5 rows of the DataFrame.
  • df.tail() — Displays the last 5 rows.
  • df.shape — Returns the number of rows and columns (shape) of the DataFrame.
  • df.info() — Provides a summary of the DataFrame, including the data types and non-null counts.
print(df.head())
print(df.shape)
print(df.info())
b. Accessing Columns

You can access specific columns of a DataFrame as follows:

# Accessing a single column
print(df['Name'])

# Accessing multiple columns
print(df[['Name', 'City']])
c. Accessing Rows

Rows can be accessed using .iloc[] (by position) or .loc[] (by label).

# Accessing the first row (index 0)
print(df.iloc[0])

# Accessing a specific row by label (index 1)
print(df.loc[1])

6. Data Manipulation

Pandas offers a wide range of methods for manipulating and transforming data.

a. Sorting Data

You can sort a DataFrame by one or more columns.

# Sort by Age in ascending order
df_sorted = df.sort_values(by='Age')
print(df_sorted)

# Sort by Age in descending order
df_sorted_desc = df.sort_values(by='Age', ascending=False)
print(df_sorted_desc)
b. Filtering Data

You can filter the rows of a DataFrame based on conditions.

# Filter rows where Age is greater than 30
filtered_df = df[df['Age'] > 30]
print(filtered_df)
c. Adding/Modifying Columns
# Adding a new column
df['Salary'] = [50000, 60000, 70000]

# Modifying an existing column
df['Age'] = df['Age'] + 1  # Add 1 to each value in the 'Age' column
print(df)

7. Handling Missing Data

Pandas has powerful tools to deal with missing or NA values.

a. Checking for Missing Data
print(df.isnull())  # Returns True if the value is missing, otherwise False
b. Dropping Missing Data

You can drop rows or columns with missing data using .dropna().

df = df.dropna()  # Drops rows with any missing data
c. Filling Missing Data

You can fill missing data with specific values using .fillna().

df = df.fillna(0)  # Fill missing values with 0

8. Groupby Operations

Pandas allows you to group data by one or more columns, which is particularly useful for aggregation or summarization.

grouped = df.groupby('City').mean()  # Group by 'City' and calculate mean
print(grouped)

9. Conclusion

Pandas is an essential library for data manipulation and analysis in Python. Its efficient data structures (Series and DataFrame) and powerful operations make it easy to work with large datasets. Whether you're working with structured data (like CSV files) or performing complex data analyses, Pandas provides the tools you need for the job. It integrates seamlessly with other libraries, making it a critical component of any data science or analysis workflow.

Commenting is not enabled on this course.