-
1. Introduction to Python
-
2. Python Basics
-
3. Working with Data Structures
-
4. Functions and Modules
-
5. Object-Oriented Programming (OOP)
-
6. File Handling
-
7. Error and Exception Handling
-
8. Python for Data Analysis
-
9. Advanced Topics in Python
-
10. Working with APIs
-
11. Python for Automation
-
12. Capstone Projects
- 13. Final Assessment and Quizzes
8.2.1 DataFrames and data manipulation
In Pandas, a DataFrame is the primary data structure used for storing and manipulating tabular data. It can be thought of as a 2-dimensional array or table consisting of rows and columns, similar to an Excel spreadsheet or a SQL table.
DataFrames provide many powerful methods for data manipulation, including filtering, sorting, adding new columns, handling missing values, and more. Below is an overview of DataFrames and common data manipulation techniques in Pandas.
1. Creating a DataFrame
A DataFrame is created from various data sources such as dictionaries, lists, or external data files (e.g., CSV, Excel).
a. Creating a DataFrame from a Dictionary
import pandas as pd # Create a dictionary with data data = { 'Name': ['Alice', 'Bob', 'Charlie'], 'Age': [25, 30, 35], 'City': ['New York', 'Los Angeles', 'Chicago'] } # Create a DataFrame from the dictionary df = pd.DataFrame(data) # Display the DataFrame print(df)
Output:
Name Age City 0 Alice 25 New York 1 Bob 30 Los Angeles 2 Charlie 35 Chicago
b. Creating a DataFrame from a List of Lists
data = [['Alice', 25, 'New York'], ['Bob', 30, 'Los Angeles'], ['Charlie', 35, 'Chicago']] df = pd.DataFrame(data, columns=['Name', 'Age', 'City']) # Display the DataFrame print(df)
Output:
Name Age City 0 Alice 25 New York 1 Bob 30 Los Angeles 2 Charlie 35 Chicago
2. Accessing and Selecting Data
Data can be accessed through both columns and rows.
a. Accessing Columns
You can access a column by referring to its name (as a string):
# Accessing a single column print(df['Name']) # Accessing multiple columns print(df[['Name', 'City']])
b. Accessing Rows
- By Index (Position): Using .iloc[] for row selection by position (integer-based index).
# Accessing the first row (index 0) print(df.iloc[0])
- By Label: Using .loc[] for row selection by label (index-based).
# Accessing a specific row by label (index 1) print(df.loc[1])
3. Data Manipulation
a. Adding/Modifying Columns
- Add a new column:
df['Salary'] = [50000, 60000, 70000] # Adding a new column print(df)
- Modify an existing column:
df['Age'] = df['Age'] + 1 # Increment each value in the 'Age' column by 1 print(df)
b. Sorting Data
You can sort DataFrame by one or more columns using .sort_values().
# Sort DataFrame by 'Age' in ascending order df_sorted = df.sort_values(by='Age') print(df_sorted) # Sort DataFrame by 'Age' in descending order df_sorted_desc = df.sort_values(by='Age', ascending=False) print(df_sorted_desc)
c. Filtering Data
Data can be filtered using conditional statements.
# Filter rows where Age is greater than 30 filtered_df = df[df['Age'] > 30] print(filtered_df)
4. Handling Missing Data
Pandas provides several methods to handle missing data (i.e., NaN values) in a DataFrame.
a. Checking for Missing Data
# Check for missing values in the DataFrame print(df.isnull())
b. Dropping Missing Data
# Drop rows with missing values df_cleaned = df.dropna() print(df_cleaned) # Drop columns with missing values df_cleaned_columns = df.dropna(axis=1) print(df_cleaned_columns)
c. Filling Missing Data
You can fill missing values using .fillna().
# Fill missing data with a specific value (e.g., 0) df_filled = df.fillna(0) print(df_filled)
5. GroupBy Operations
Pandas allows grouping data by one or more columns and performing aggregation or transformations. This is commonly used for summarizing and analyzing data.
# Grouping by 'City' and calculating the mean of 'Age' grouped = df.groupby('City')['Age'].mean() print(grouped)
6. Merging DataFrames
You can merge multiple DataFrames into a single DataFrame using .merge().
# Example: Merging two DataFrames based on a common column 'Name' df1 = pd.DataFrame({'Name': ['Alice', 'Bob'], 'Age': [25, 30]}) df2 = pd.DataFrame({'Name': ['Alice', 'Bob'], 'City': ['New York', 'Los Angeles']}) # Merging the DataFrames on 'Name' merged_df = pd.merge(df1, df2, on='Name') print(merged_df)
7. Pivot Tables
Pivot tables are useful for summarizing data and performing aggregation in a more flexible way.
# Pivot table for aggregating the data by 'City' pivot_df = df.pivot_table(values='Age', index='City', aggfunc='mean') print(pivot_df)
8. Conclusion
DataFrames in Pandas offer a powerful way to store, manipulate, and analyze data. With functions to perform operations like filtering, sorting, grouping, merging, and handling missing values, Pandas provides a comprehensive toolset for data analysis. The flexibility and efficiency of Pandas make it an essential tool in data science, machine learning, and general data manipulation tasks.
Commenting is not enabled on this course.