Skip to content

ADR-004: Error Handling and Recovery Strategy

Status

Accepted

Context

Sifaka interacts with multiple external systems (LLM APIs, storage backends, web services) and processes user input, making it susceptible to various failure modes: - Network timeouts and connection errors - API rate limiting and authentication failures - Invalid user input and configuration errors - Out-of-memory conditions and resource exhaustion - Plugin failures and compatibility issues

We need a comprehensive error handling strategy that: - Provides clear, actionable error messages - Enables graceful degradation when possible - Supports retry logic for transient failures - Maintains system stability under adverse conditions

Decision

We will implement a hierarchical exception system with structured error handling, automatic retry mechanisms, and graceful degradation strategies.

# Structured exceptions with suggestions
try:
    result = await improve("text")
except ModelProviderError as e:
    print(f"LLM API error: {e.message}")
    print(f"Suggestion: {e.suggestion}")
    print(f"Provider: {e.provider}")
    print(f"Error code: {e.error_code}")

Exception Hierarchy

Base Exception

class SifakaError(Exception):
    def __init__(self, message: str, suggestion: str = None):
        self.message = message
        self.suggestion = suggestion
        super().__init__(message)

    def __str__(self):
        if self.suggestion:
            return f"{self.message}\n💡 Suggestion: {self.suggestion}"
        return self.message

Specific Exception Types

  • ConfigurationError: Invalid configuration parameters
  • ModelProviderError: LLM API failures
  • CriticError: Critic evaluation failures
  • ValidationError: Text validation failures
  • StorageError: Storage backend issues
  • PluginError: Plugin loading/execution failures
  • TimeoutError: Operation time limits exceeded
  • MemoryError: Memory bounds reached

Error Classification

1. Transient Errors (Retryable)

  • Network timeouts
  • Rate limiting
  • Server errors (5xx)
  • Temporary resource unavailability

2. Permanent Errors (Non-retryable)

  • Authentication failures
  • Invalid requests (4xx)
  • Configuration errors
  • Missing resources

3. Partial Errors (Recoverable)

  • Single critic failures
  • Optional feature unavailability
  • Non-critical validation failures

Retry Strategy

Configuration

@dataclass
class RetryConfig:
    max_attempts: int = 3
    delay: float = 1.0
    backoff: float = 2.0

    def calculate_delay(self, attempt: int) -> float:
        return self.delay * (self.backoff ** attempt)

Implementation

@with_retry(RetryConfig(max_attempts=3, delay=1.0, backoff=2.0))
async def call_llm_api(prompt: str) -> str:
    # API call implementation
    pass

Retry Logic

  • Exponential backoff with jitter
  • Selective retry based on error type
  • Configurable retry limits
  • Circuit breaker pattern for persistent failures

Graceful Degradation

1. Critic Failures

When a critic fails: - Log the error with context - Continue with remaining critics - Include failure information in results - Provide fallback suggestions

2. Storage Failures

When storage fails: - Fall back to memory storage - Warn about data loss risk - Continue processing - Attempt to recover on next operation

3. Validation Failures

When validation fails: - Log validation errors - Continue with text improvement - Include validation status in results - Provide best-effort quality assessment

4. Tool Failures

When external tools fail: - Disable tool-dependent features - Use cached results if available - Continue with available tools - Provide reduced functionality notifications

Error Recovery Mechanisms

1. Automatic Recovery

class ErrorRecovery:
    async def recover_from_api_failure(self, error: ModelProviderError):
        if error.error_code == "rate_limit":
            await asyncio.sleep(error.retry_after or 60)
            return await self.retry_operation()

        if error.error_code == "authentication":
            await self.refresh_api_key()
            return await self.retry_operation()

2. Fallback Strategies

  • Alternative API providers
  • Cached responses
  • Simplified operations
  • Default configurations

3. Recovery Workflows

  • Health check mechanisms
  • Automatic failover
  • Connection pooling
  • Resource cleanup

Error Reporting

1. Structured Logging

logger.error(
    "Critic failure",
    extra={
        "critic": critic.name,
        "error_type": type(error).__name__,
        "error_code": getattr(error, 'error_code', None),
        "retryable": getattr(error, 'retryable', False),
        "text_length": len(text),
        "iteration": result.iteration,
    }
)

2. Error Metrics

  • Error rate by type
  • Recovery success rate
  • Performance impact
  • User impact assessment

3. User Feedback

  • Clear error messages
  • Actionable suggestions
  • Progress indicators
  • Status updates

Implementation Examples

1. Configuration Validation

def validate_config(config: Config):
    if config.temperature < 0 or config.temperature > 2:
        raise ConfigurationError(
            f"Temperature {config.temperature} is invalid",
            parameter="temperature",
            valid_range="0.0-2.0"
        )

2. API Error Handling

async def call_openai_api(prompt: str):
    try:
        response = await openai.ChatCompletion.acreate(...)
        return response
    except openai.RateLimitError as e:
        raise ModelProviderError(
            "Rate limit exceeded",
            provider="OpenAI",
            error_code="rate_limit"
        ) from e

3. Graceful Critic Failure

async def run_critics(text: str, critics: List[Critic]) -> List[CritiqueResult]:
    results = []
    for critic in critics:
        try:
            result = await critic.critique(text)
            results.append(result)
        except Exception as e:
            logger.warning(f"Critic {critic.name} failed: {e}")
            # Continue with other critics
    return results

Consequences

Positive

  • Robust error handling improves reliability
  • Clear error messages reduce user confusion
  • Automatic recovery reduces manual intervention
  • Graceful degradation maintains functionality
  • Structured logging aids debugging

Negative

  • Additional complexity in error handling code
  • Potential performance impact from retry logic
  • Risk of masking underlying problems
  • Complexity in testing error scenarios

Mitigation

  • Comprehensive error handling tests
  • Performance monitoring for retry logic
  • Clear documentation of error behaviors
  • Configurable error handling strategies
  • Regular review of error patterns