String formatting is how you build messages, logs, and output from variables. String methods help you clean, transform, and validate text data. In Data Engineering, you’ll constantly parse CSVs, validate input, construct SQL queries, and log pipeline events—all requiring solid string skills.


Modern String Formatting: F-Strings

F-strings (formatted string literals, Python 3.6+) are the modern, preferred way to format strings. They’re faster, more readable, and less error-prone than older methods.

Basic F-String Syntax

name = "Alice"
age = 30
 
# Simple variable insertion
message = f"Hello, {name}!"  # "Hello, Alice!"
 
# Expressions inside braces
result = f"In 5 years, {name} will be {age + 5}"  # "In 5 years, Alice will be 35"
 
# Multiple variables
data = f"Name: {name}, Age: {age}"  # "Name: Alice, Age: 30"

Formatting Numbers in F-Strings

# Decimal places for floats
price = 19.99
print(f"Price: ${price:.2f}")  # "Price: $19.99"
 
# Padding integers with zeros
user_id = 42
print(f"ID: {user_id:05d}")  # "ID: 00042"
 
# Thousands separator
population = 1000000
print(f"Population: {population:,}")  # "Population: 1,000,000"
 
# Percentage
success_rate = 0.9543
print(f"Success: {success_rate:.1%}")  # "Success: 95.4%"

Formatting with Expressions

# Method calls inside f-strings
name = "alice"
print(f"Welcome, {name.upper()}!")  # "Welcome, ALICE!"
 
# List comprehensions
numbers = [1, 2, 3, 4, 5]
print(f"Squares: {[x**2 for x in numbers]}")  # "Squares: [1, 4, 9, 16, 25]"
 
# Conditional expressions
age = 20
status = f"Adult" if age >= 18 else "Minor"
print(f"Status: {status}")  # "Status: Adult"

Multi-line F-Strings

user = {"name": "Alice", "age": 30, "city": "Paris"}
message = f"""
User Profile:
  Name: {user['name']}
  Age: {user['age']}
  City: {user['city']}
"""
print(message)

Older String Formatting (For Legacy Code)

You may encounter older code using these methods. Understand them but prefer f-strings in new code.

.format() Method

name = "Bob"
age = 25
 
# Basic usage
print("{} is {} years old".format(name, age))  # "Bob is 25 years old"
 
# Named parameters
print("{name} is {age} years old".format(name="Bob", age=25))
 
# Formatting numbers
print("Price: ${:.2f}".format(19.99))  # "Price: $19.99"

Modulo Operator (%)

name = "Charlie"
age = 35
 
# %s for strings, %d for integers, %f for floats
print("%s is %d years old" % (name, age))  # "Charlie is 35 years old"
 
# Formatting floats
print("Price: $%.2f" % 19.99)  # "Price: $19.99"

Essential String Methods

String methods transform, validate, and analyze text data.

Cleaning and Whitespace

# Remove leading/trailing whitespace
text = "  hello world  "
print(text.strip())      # "hello world"
print(text.lstrip())     # "hello world  " (left only)
print(text.rstrip())     # "  hello world" (right only)
 
# Remove leading/trailing specific characters
path = "/data/file.csv/"
print(path.strip("/"))   # "data/file.csv"

Case Conversion

text = "Hello World"
 
print(text.upper())      # "HELLO WORLD"
print(text.lower())      # "hello world"
print(text.capitalize()) # "Hello world" (first char only)
print(text.title())      # "Hello World" (each word)
print(text.swapcase())   # "hELLO wORLD"

Searching and Checking

text = "data science"
 
# Check if substring exists
"science" in text           # True
text.startswith("data")     # True
text.endswith("science")    # True
text.find("science")        # 5 (index of first occurrence)
text.count("a")             # 2 (occurrences of "a")
 
# Check character types
"123".isdigit()             # True
"hello".isalpha()           # True
"  ".isspace()              # True
"Hello123".isalnum()        # True

Splitting and Joining

# Split string into list
csv_line = "Alice,30,Engineer"
parts = csv_line.split(",")  # ["Alice", "30", "Engineer"]
 
# Split with limit
text = "a:b:c:d"
limited = text.split(":", 2)  # ["a", "b", "c:d"] (max 2 splits)
 
# Join list into string
names = ["Alice", "Bob", "Charlie"]
print(", ".join(names))  # "Alice, Bob, Charlie"
 
# Split lines
csv_data = "name,age\nAlice,30\nBob,25"
lines = csv_data.split("\n")  # ["name,age", "Alice,30", "Bob,25"]

Finding and Replacing

text = "The quick brown fox jumps over the lazy dog"
 
# Find substring (returns index, or -1 if not found)
print(text.find("brown"))      # 10
print(text.find("missing"))    # -1
 
# Replace occurrences
print(text.replace("fox", "cat"))      # "The quick brown cat..."
print(text.replace("the", "a", 1))     # Limit replacements: "a quick brown fox..."
 
# Replace with method chaining (common in ETL)
dirty = "  john doe  "
clean = dirty.strip().lower().replace(" ", "_")  # "john_doe"

Data Engineering Patterns

Parsing CSV Data

def parse_csv_line(line: str) -> dict[str, str]:
    """Parse CSV line into dictionary."""
    # Remove whitespace, split on comma
    parts = [p.strip() for p in line.split(",")]
    keys = ["id", "name", "age", "city"]
    return dict(zip(keys, parts))
 
# Usage
line = "  1  , alice  , 30 , Paris"
record = parse_csv_line(line)
# {"id": "1", "name": "alice", "age": "30", "city": "Paris"}

Validating Input Strings

def validate_email(email: str) -> bool:
    """Basic email validation."""
    # Remove whitespace
    email = email.strip().lower()
    
    # Check format
    if "@" not in email or "." not in email.split("@")[1]:
        return False
    
    return True
 
def validate_phone(phone: str) -> bool:
    """Validate phone number (digits and hyphens only)."""
    phone = phone.strip()
    
    # Remove common separators
    cleaned = phone.replace("-", "").replace(" ", "")
    
    # Check if all digits
    return cleaned.isdigit() and 10 <= len(cleaned) <= 15

Normalizing Data

def normalize_record(record: dict[str, str]) -> dict[str, str]:
    """Normalize string fields in a record."""
    normalized = {}
    
    for key, value in record.items():
        # Strip whitespace, lowercase
        normalized[key] = value.strip().lower()
        
        # Remove special characters from specific fields
        if key == "phone":
            normalized[key] = value.replace("-", "").replace(" ", "")
    
    return normalized
 
# Usage
raw = {"name": "  ALICE  ", "email": "  [email protected]  ", "phone": "555-123-4567"}
clean = normalize_record(raw)
# {"name": "alice", "email": "[email protected]", "phone": "5551234567"}

Building Dynamic Queries

def build_sql_filter(filters: dict[str, str]) -> str:
    """Build WHERE clause from filter dict."""
    conditions = []
    
    for key, value in filters.items():
        # Safely handle string values
        value_safe = value.strip().replace("'", "''")  # Escape quotes
        conditions.append(f"{key} = '{value_safe}'")
    
    return " AND ".join(conditions)
 
# Usage
filters = {"status": "active", "country": "France"}
where = build_sql_filter(filters)
# "status = 'active' AND country = 'France'"

Logging Pipeline Events

import logging
from datetime import datetime
 
logger = logging.getLogger(__name__)
 
def log_pipeline_event(
    stage: str,
    status: str,
    count: int,
    duration: float,
    error: str | None = None
) -> None:
    """Log ETL pipeline event."""
    timestamp = datetime.now().isoformat()
    
    message = (
        f"[{timestamp}] {stage:15} | "
        f"Status: {status:10} | "
        f"Records: {count:8,} | "
        f"Duration: {duration:6.2f}s"
    )
    
    if error:
        message += f" | Error: {error}"
        logger.error(message)
    else:
        logger.info(message)
 
# Usage
log_pipeline_event("Extract", "Success", 50000, 12.5)
# [2026-01-22T14:30:15.123456] Extract      | Status: Success    | Records:    50,000 | Duration:  12.50s

Comparison: F-String vs Other Methods

MethodSyntaxSpeedReadabilityUse Case
F-stringf"{var}"⚡ Fastest⭐⭐⭐ ExcellentModern code (Python 3.6+)
.format()"{0}".format(var)⭐ Medium⭐⭐ GoodLegacy code, older Python
% operator"%s" % var⭐ Slower⭐ PoorOld code only

Tips & Gotchas

  • F-strings are expressions, not strings. They’re evaluated at runtime.
# ✅ Works
price = 19.99
message = f"Price: ${price:.2f}"
 
# ❌ Doesn't work (can't use f-string in raw string)
raw_sql = fr"SELECT * FROM users WHERE id = {user_id}"  # rf"..." is ok
  • Escape braces with double braces in f-strings.
# ❌ Wrong
print(f"JSON: {{'name': 'Alice'}}")  # SyntaxError
 
# ✅ Correct
print(f"JSON: {{'name': 'Alice'}}")  # "JSON: {'name': 'Alice'}"
  • .strip() removes whitespace, not specific characters by default.
# ❌ Wrong expectation
text = "data.csv"
print(text.strip(".csv"))  # "data" (removes 'd', 'a' from ends too!)
 
# ✅ For removing file extensions, use different approach
print(text.rsplit(".", 1)[0])  # "data"
  • String methods don’t modify in-place—they return new strings.
text = "  hello  "
text.strip()      # ❌ Doesn't modify text
text = text.strip()  # ✅ Reassign to update
  • Use .lower() for case-insensitive comparison, not .upper().
# Compare emails (case-insensitive)
email1 = "[email protected]"
email2 = "[email protected]"
 
# ✅ Correct
print(email1.lower() == email2.lower())  # True
  • Split can produce empty strings—filter them if needed.
text = "a,,b,c"
parts = text.split(",")           # ["a", "", "b", "c"]
parts = [p for p in parts if p]   # ["a", "b", "c"] (filter empty)


Key Takeaway:
Use f-strings for modern code—they’re fast, readable, and expressive. Master string methods like .strip(), .split(), .join(), and .replace() for data cleaning. Test your formatting and validation logic thoroughly; data quality depends on it.