Pandas in Practice
Pandas is the go-to Python library for working with tabular data (think spreadsheets, CSVs, SQL tables). If NumPy is “arrays,” pandas is “tables with labels,” plus a ton of tools for cleaning, reshaping, and summarizing data.
Setup and the core objects
Pandas has two main data structures:
Series (1D labeled array)
DataFrame (2D table with rows and columns)
df = pd.DataFrame({
"name": ["Ada", "Linus", "Grace"],
"age": [32, 54, 40],
"role": ["admin", "user", "admin"]
})
df
Reading data (CSV is the most common)
Read a CSV
Common options:
df = pd.read_csv("data.csv",
na_values=["", "NA", "null"],
parse_dates=["date"],
dtype={"id": "string"})
Quick inspection
2. Selecting data: columns, rows, and cells
Columns
Rows by position: .iloc
Rows by label: .loc
If your index has labels (or you set one), use .loc:
Rule of thumb
- Use
.ilocwhen you mean “by position” - Use
.locwhen you mean “by label”
Filtering rows (the daily work)
Boolean filter
Multiple conditions (use & and parentheses)
Membership
String contains (handle missing with na=False)
Creating and updating columns
Simple derived column
Conditional column (vectorized)
More complex: np.where
Best practice: .assign() (nice for pipelines)
df = df.assign(
is_admin=df["role"].eq("admin"),
age_group=pd.cut(df["age"], bins=[0, 35, 50, 120], labels=["young", "mid", "older"])
)
Handling missing data (NaN)
Detect missing
Drop missing (use carefully)
Fill missing
Sorting
Groupby (the “power feature”)
Groupby is how you do “pivot table”-style summaries.
Example dataset
sales = pd.DataFrame({
"date": pd.to_datetime(["2026-01-01","2026-01-01","2026-01-02","2026-01-02"]),
"store": ["A","B","A","B"],
"item": ["apple","apple","banana","apple"],
"units": [10, 5, 7, 3],
"price": [1.0, 1.0, 2.0, 1.0]
})
sales["revenue"] = sales["units"] * sales["price"]
sales
Total revenue by store
Multiple aggregations
sales.groupby("store").agg(
total_units=("units", "sum"),
total_revenue=("revenue", "sum"),
avg_units=("units", "mean")
)
Group by multiple columns
Reshaping: pivot, melt, wide vs long
Pivot (long → wide)
Melt (wide → long)
long_again = pivot.reset_index().melt(id_vars="date", var_name="store", value_name="revenue")
long_again
Rule of thumb
- Many analyses + plotting prefer long format
- Many reports prefer wide format
Joining data (like SQL)
Example: customer table + orders table
customers = pd.DataFrame({
"customer_id": [1, 2, 3],
"name": ["Ada", "Linus", "Grace"]
})
orders = pd.DataFrame({
"order_id": [101, 102, 103],
"customer_id": [1, 1, 3],
"total": [50.0, 20.0, 99.0]
})
Merge (join)
Join types:
how="inner": only matching keyshow="left": keep all left rowshow="right",how="outer"similarly
Working with dates (very common)
Filter by date:
Resampling time series:
Practical example: cleaning a messy CSV
Imagine a CSV with:
- weird column names
- missing values
- strings that should be numbers
Here’s a “typical cleanup pipeline”:
df = pd.read_csv("messy.csv")
df.columns = (
df.columns
.str.strip()
.str.lower()
.str.replace(" ", "_")
)
# Convert numeric columns safely
df["price"] = pd.to_numeric(df["price"], errors="coerce")
df["units"] = pd.to_numeric(df["units"], errors="coerce")
# Handle missing
df["units"] = df["units"].fillna(0).astype(int)
df["price"] = df["price"].fillna(df["price"].median())
# Create derived metric
df["revenue"] = df["units"] * df["price"]
# Summary
summary = df.groupby("category").agg(
total_units=("units", "sum"),
total_revenue=("revenue", "sum")
).sort_values("total_revenue", ascending=False)
summary
This is very “real pandas.”
Saving output
CSV
Excel
Parquet (fast, compact; great for big data)
Pandas vs NumPy: when to use which
Use NumPy when:
- pure numeric arrays
- heavy math
- performance-critical inner loops
Use pandas when:
- labeled columns
- missing values
- joins/groupby/pivots
- messy real-world datasets
A lot of the time:
- pandas is the “data plumbing”
- NumPy is the “math engine”
Common pandas mistakes
- Using Python loops over rows (
for i, row in df.iterrows()) instead of vectorized ops - Forgetting parentheses with
&/| - Modifying a slice and getting
SettingWithCopyWarning(use.locexplicitly) - Confusing
.locvs.iloc
If you see yourself looping rows, pause and ask:
“Can I do this with a column operation,
.map,.merge, orgroupby?”