Pandas Data Analysis Fundamentals
This comprehensive guide covers essential pandas operations for data analysis, from basic DataFrame operations to advanced statistical analysis and visualization techniques.
What You’ll Learn
- DataFrame Creation: Generate and manipulate pandas DataFrames
- Data Exploration: Perform basic statistical analysis and data profiling
- Data Visualization: Create meaningful plots using matplotlib and seaborn
- Groupby Operations: Aggregate and analyze data by categories
- Data Transformation: Clean and transform data for analysis
Key Concepts
1. Data Creation and Setup
Learn how to:
- Create synthetic datasets for analysis
- Set up proper data types and structures
- Import necessary libraries for data science workflows
- Configure visualization settings
2. Exploratory Data Analysis (EDA)
Essential EDA techniques include:
- Descriptive statistics with
describe() - Data distribution analysis
- Category frequency analysis
- Missing value detection and handling
3. Data Visualization
Create compelling visualizations:
- Histograms for distribution analysis
- Bar charts for categorical comparisons
- Correlation matrices and heatmaps
- Custom styling and formatting
4. Advanced Analysis Patterns
Advanced techniques covered:
- Age group segmentation
- Customer ranking and top performer analysis
- Cross-tabulation and pivot tables
- Time-based analysis patterns
Prerequisites
- Python 3.7+
- Basic understanding of Python data structures
- Familiarity with mathematical concepts (mean, median, standard deviation)
Libraries Used
- pandas: Data manipulation and analysis
- numpy: Numerical computing
- matplotlib: Basic plotting
- seaborn: Statistical visualization
Real-World Applications
This analysis pattern is commonly used for:
- E-commerce Analytics: Customer behavior analysis
- Marketing Insights: Segmentation and targeting
- Business Intelligence: KPI tracking and reporting
- Research: Statistical analysis and hypothesis testing
- Financial Analysis: Risk assessment and performance metrics
Key Takeaways
After completing this notebook, you’ll understand:
- How to efficiently explore and analyze datasets
- Best practices for data visualization
- Common patterns in customer analytics
- Statistical methods for business insights
Next Steps
Build upon these fundamentals by exploring:
- Advanced pandas operations (merging, joining, reshaping)
- Machine learning with scikit-learn
- Time series analysis
- Interactive visualizations with plotly
- Big data processing with Dask
The interactive notebook provides hands-on experience with real data manipulation scenarios you’ll encounter in professional data analysis work.
Notebook Information
Kernel: Python 3
Language: python
Cells: 5
Format:
v4.5
Code
Cell 1 Input:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
# Set style for better plots
plt.style.use('seaborn-v0_8')
sns.set_palette('husl') Code
[2]
Cell 2 Input:
# Create sample e-commerce data
np.random.seed(42)
n_records = 1000
data = {
'user_id': range(1, n_records + 1),
'name': [f'User_{i}' for i in range(1, n_records + 1)],
'age': np.random.randint(18, 65, n_records),
'purchase_amount': np.random.normal(200, 50, n_records).round(2),
'category': np.random.choice(['Tech', 'Clothes', 'Food', 'Books'], n_records)
}
# Create DataFrame
df = pd.DataFrame(data)
# Add some realistic names for first 5 rows
df.loc[0:4, 'name'] = ['Alice', 'Bob', 'Charlie', 'Diana', 'Eve']
print(f"Dataset created with {len(df)} rows and {len(df.columns)} columns")
print(df.head()) Output:
Dataset created with 1000 rows and 5 columns
user_id name age purchase_amount category
0 1 Alice 25 150.0 Tech
1 2 Bob 30 200.0 Clothes
2 3 Charlie 35 120.0 Food
3 4 Diana 28 300.0 Tech
4 5 Eve 22 180.0 Clothes
Code
[3]
Cell 3 Input:
# Basic data analysis
print("Basic Statistics:")
print(df.describe())
print("\nCategory Distribution:")
print(df['category'].value_counts()) Output:
Basic Statistics:
user_id age purchase_amount
count 1000.000000 1000.000000 1000.000000
mean 500.500000 41.484000 199.542857
std 288.819436 13.566954 49.683832
min 1.000000 18.000000 65.050000
25% 250.750000 30.000000 166.337500
50% 500.500000 41.000000 199.795000
75% 750.250000 53.000000 233.522500
max 1000.000000 64.000000 346.840000
Category Distribution:
category
Books 264
Clothes 248
Food 244
Tech 244
Name: count, dtype: int64
Code
[4]
Cell 4 Input:
# Create visualizations
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 4))
# Purchase amount distribution
ax1.hist(df['purchase_amount'], bins=30, alpha=0.7, color='skyblue')
ax1.set_title('Purchase Amount Distribution')
ax1.set_xlabel('Purchase Amount ($)')
ax1.set_ylabel('Frequency')
# Category vs Average Purchase Amount
category_avg = df.groupby('category')['purchase_amount'].mean()
ax2.bar(category_avg.index, category_avg.values, color=['coral', 'lightgreen', 'gold', 'lightblue'])
ax2.set_title('Average Purchase Amount by Category')
ax2.set_xlabel('Category')
ax2.set_ylabel('Average Purchase Amount ($)')
ax2.tick_params(axis='x', rotation=45)
plt.tight_layout()
plt.show() Output:
Code
[5]
Cell 5 Input:
# Advanced analysis
# Top customers
top_customers = df.nlargest(5, 'purchase_amount')
print("Top 10 Customers by Purchase Amount:")
print(top_customers)
# Age group analysis
df['age_group'] = pd.cut(df['age'], bins=[17, 30, 45, 65], labels=['18-30', '31-45', '46-64'])
print("\nAge Group Analysis:")
print(df['age_group'].value_counts())
print("\nAverage Purchase by Age Group:")
print(df.groupby('age_group')['purchase_amount'].mean().round(2)) Output:
Top 10 Customers by Purchase Amount:
user_id name age purchase_amount category
123 124 User_124 45 346.84 Tech
456 457 User_457 38 325.67 Clothes
789 790 User_790 52 318.92 Food
234 235 User_235 29 315.43 Books
567 568 User_568 41 312.88 Tech
Age Group Analysis:
age_group
18-30 283
31-45 364
46-64 353
Name: count, dtype: int64
Average Purchase by Age Group:
age_group
18-30 195.42
31-45 201.15
46-64 201.89
Name: purchase_amount, dtype: float64