ADR-004: Error Handling and Recovery Strategy

Status

Accepted

Context

Sifaka interacts with multiple external systems (LLM APIs, storage backends, web services) and processes user input, making it susceptible to various failure modes: - Network timeouts and connection errors - API rate limiting and authentication failures - Invalid user input and configuration errors - Out-of-memory conditions and resource exhaustion - Plugin failures and compatibility issues

We need a comprehensive error handling strategy that: - Provides clear, actionable error messages - Enables graceful degradation when possible - Supports retry logic for transient failures - Maintains system stability under adverse conditions

Decision

We will implement a hierarchical exception system with structured error handling, automatic retry mechanisms, and graceful degradation strategies.

# Structured exceptions with suggestions
try:
    result = await improve("text")
except ModelProviderError as e:
    print(f"LLM API error: {e.message}")
    print(f"Suggestion: {e.suggestion}")
    print(f"Provider: {e.provider}")
    print(f"Error code: {e.error_code}")

Exception Hierarchy

Base Exception

class SifakaError(Exception):
    def __init__(self, message: str, suggestion: str = None):
        self.message = message
        self.suggestion = suggestion
        super().__init__(message)

    def __str__(self):
        if self.suggestion:
            return f"{self.message}\n💡 Suggestion: {self.suggestion}"
        return self.message

Specific Exception Types

ConfigurationError: Invalid configuration parameters
ModelProviderError: LLM API failures
CriticError: Critic evaluation failures
ValidationError: Text validation failures
StorageError: Storage backend issues
PluginError: Plugin loading/execution failures
TimeoutError: Operation time limits exceeded
MemoryError: Memory bounds reached

Error Classification

1. Transient Errors (Retryable)

Network timeouts
Rate limiting
Server errors (5xx)
Temporary resource unavailability

2. Permanent Errors (Non-retryable)

Authentication failures
Invalid requests (4xx)
Configuration errors
Missing resources

3. Partial Errors (Recoverable)

Single critic failures
Optional feature unavailability
Non-critical validation failures

Retry Strategy

Configuration

@dataclass
class RetryConfig:
    max_attempts: int = 3
    delay: float = 1.0
    backoff: float = 2.0

    def calculate_delay(self, attempt: int) -> float:
        return self.delay * (self.backoff ** attempt)

Implementation

@with_retry(RetryConfig(max_attempts=3, delay=1.0, backoff=2.0))
async def call_llm_api(prompt: str) -> str:
    # API call implementation
    pass

Retry Logic

Exponential backoff with jitter
Selective retry based on error type
Configurable retry limits
Circuit breaker pattern for persistent failures

Graceful Degradation

1. Critic Failures

When a critic fails: - Log the error with context - Continue with remaining critics - Include failure information in results - Provide fallback suggestions

2. Storage Failures

When storage fails: - Fall back to memory storage - Warn about data loss risk - Continue processing - Attempt to recover on next operation

3. Validation Failures

When validation fails: - Log validation errors - Continue with text improvement - Include validation status in results - Provide best-effort quality assessment

4. Tool Failures

When external tools fail: - Disable tool-dependent features - Use cached results if available - Continue with available tools - Provide reduced functionality notifications

Error Recovery Mechanisms

1. Automatic Recovery

class ErrorRecovery:
    async def recover_from_api_failure(self, error: ModelProviderError):
        if error.error_code == "rate_limit":
            await asyncio.sleep(error.retry_after or 60)
            return await self.retry_operation()

        if error.error_code == "authentication":
            await self.refresh_api_key()
            return await self.retry_operation()

2. Fallback Strategies

Alternative API providers
Cached responses
Simplified operations
Default configurations

3. Recovery Workflows

Health check mechanisms
Automatic failover
Connection pooling
Resource cleanup

Error Reporting

1. Structured Logging

logger.error(
    "Critic failure",
    extra={
        "critic": critic.name,
        "error_type": type(error).__name__,
        "error_code": getattr(error, 'error_code', None),
        "retryable": getattr(error, 'retryable', False),
        "text_length": len(text),
        "iteration": result.iteration,
    }
)

2. Error Metrics

Error rate by type
Recovery success rate
Performance impact
User impact assessment

3. User Feedback

Clear error messages
Actionable suggestions
Progress indicators
Status updates

Implementation Examples

1. Configuration Validation

def validate_config(config: Config):
    if config.temperature < 0 or config.temperature > 2:
        raise ConfigurationError(
            f"Temperature {config.temperature} is invalid",
            parameter="temperature",
            valid_range="0.0-2.0"
        )

2. API Error Handling

async def call_openai_api(prompt: str):
    try:
        response = await openai.ChatCompletion.acreate(...)
        return response
    except openai.RateLimitError as e:
        raise ModelProviderError(
            "Rate limit exceeded",
            provider="OpenAI",
            error_code="rate_limit"
        ) from e

3. Graceful Critic Failure

async def run_critics(text: str, critics: List[Critic]) -> List[CritiqueResult]:
    results = []
    for critic in critics:
        try:
            result = await critic.critique(text)
            results.append(result)
        except Exception as e:
            logger.warning(f"Critic {critic.name} failed: {e}")
            # Continue with other critics
    return results

Consequences

Positive

Robust error handling improves reliability
Clear error messages reduce user confusion
Automatic recovery reduces manual intervention
Graceful degradation maintains functionality
Structured logging aids debugging

Negative

Additional complexity in error handling code
Potential performance impact from retry logic
Risk of masking underlying problems
Complexity in testing error scenarios

Mitigation

Comprehensive error handling tests
Performance monitoring for retry logic
Clear documentation of error behaviors
Configurable error handling strategies
Regular review of error patterns