Skip to Content
Course content

8.2.1 DataFrames and data manipulation

In Pandas, a DataFrame is the primary data structure used for storing and manipulating tabular data. It can be thought of as a 2-dimensional array or table consisting of rows and columns, similar to an Excel spreadsheet or a SQL table.

DataFrames provide many powerful methods for data manipulation, including filtering, sorting, adding new columns, handling missing values, and more. Below is an overview of DataFrames and common data manipulation techniques in Pandas.

1. Creating a DataFrame

A DataFrame is created from various data sources such as dictionaries, lists, or external data files (e.g., CSV, Excel).

a. Creating a DataFrame from a Dictionary

import pandas as pd

# Create a dictionary with data
data = {
    'Name': ['Alice', 'Bob', 'Charlie'],
    'Age': [25, 30, 35],
    'City': ['New York', 'Los Angeles', 'Chicago']
}

# Create a DataFrame from the dictionary
df = pd.DataFrame(data)

# Display the DataFrame
print(df)

Output:

       Name  Age         City
0     Alice   25     New York
1       Bob   30  Los Angeles
2   Charlie   35      Chicago

b. Creating a DataFrame from a List of Lists

data = [['Alice', 25, 'New York'], ['Bob', 30, 'Los Angeles'], ['Charlie', 35, 'Chicago']]
df = pd.DataFrame(data, columns=['Name', 'Age', 'City'])

# Display the DataFrame
print(df)

Output:

       Name  Age         City
0     Alice   25     New York
1       Bob   30  Los Angeles
2   Charlie   35      Chicago

2. Accessing and Selecting Data

Data can be accessed through both columns and rows.

a. Accessing Columns

You can access a column by referring to its name (as a string):

# Accessing a single column
print(df['Name'])

# Accessing multiple columns
print(df[['Name', 'City']])

b. Accessing Rows

  • By Index (Position): Using .iloc[] for row selection by position (integer-based index).
# Accessing the first row (index 0)
print(df.iloc[0])
  • By Label: Using .loc[] for row selection by label (index-based).
# Accessing a specific row by label (index 1)
print(df.loc[1])

3. Data Manipulation

a. Adding/Modifying Columns

  • Add a new column:
df['Salary'] = [50000, 60000, 70000]  # Adding a new column
print(df)
  • Modify an existing column:
df['Age'] = df['Age'] + 1  # Increment each value in the 'Age' column by 1
print(df)

b. Sorting Data

You can sort DataFrame by one or more columns using .sort_values().

# Sort DataFrame by 'Age' in ascending order
df_sorted = df.sort_values(by='Age')
print(df_sorted)

# Sort DataFrame by 'Age' in descending order
df_sorted_desc = df.sort_values(by='Age', ascending=False)
print(df_sorted_desc)

c. Filtering Data

Data can be filtered using conditional statements.

# Filter rows where Age is greater than 30
filtered_df = df[df['Age'] > 30]
print(filtered_df)

4. Handling Missing Data

Pandas provides several methods to handle missing data (i.e., NaN values) in a DataFrame.

a. Checking for Missing Data

# Check for missing values in the DataFrame
print(df.isnull())

b. Dropping Missing Data

# Drop rows with missing values
df_cleaned = df.dropna()
print(df_cleaned)

# Drop columns with missing values
df_cleaned_columns = df.dropna(axis=1)
print(df_cleaned_columns)

c. Filling Missing Data

You can fill missing values using .fillna().

# Fill missing data with a specific value (e.g., 0)
df_filled = df.fillna(0)
print(df_filled)

5. GroupBy Operations

Pandas allows grouping data by one or more columns and performing aggregation or transformations. This is commonly used for summarizing and analyzing data.

# Grouping by 'City' and calculating the mean of 'Age'
grouped = df.groupby('City')['Age'].mean()
print(grouped)

6. Merging DataFrames

You can merge multiple DataFrames into a single DataFrame using .merge().

# Example: Merging two DataFrames based on a common column 'Name'
df1 = pd.DataFrame({'Name': ['Alice', 'Bob'], 'Age': [25, 30]})
df2 = pd.DataFrame({'Name': ['Alice', 'Bob'], 'City': ['New York', 'Los Angeles']})

# Merging the DataFrames on 'Name'
merged_df = pd.merge(df1, df2, on='Name')
print(merged_df)

7. Pivot Tables

Pivot tables are useful for summarizing data and performing aggregation in a more flexible way.

# Pivot table for aggregating the data by 'City'
pivot_df = df.pivot_table(values='Age', index='City', aggfunc='mean')
print(pivot_df)

8. Conclusion

DataFrames in Pandas offer a powerful way to store, manipulate, and analyze data. With functions to perform operations like filtering, sorting, grouping, merging, and handling missing values, Pandas provides a comprehensive toolset for data analysis. The flexibility and efficiency of Pandas make it an essential tool in data science, machine learning, and general data manipulation tasks.

Commenting is not enabled on this course.