torrent-gateway/docs/troubleshooting.md
enki 76979d055b
Some checks are pending
CI Pipeline / Run Tests (push) Waiting to run
CI Pipeline / Lint Code (push) Waiting to run
CI Pipeline / Security Scan (push) Waiting to run
CI Pipeline / Build Docker Images (push) Blocked by required conditions
CI Pipeline / E2E Tests (push) Blocked by required conditions
Transcoding and Nip71 update
2025-08-21 19:32:26 -07:00

9.3 KiB

Troubleshooting Guide

Common Issues and Solutions

Service Startup Issues

Gateway Won't Start

Symptoms: Container exits immediately or health checks fail

Diagnostic Steps:

# Check container logs
docker-compose -f docker-compose.prod.yml logs gateway

# Check database file
ls -la data/metadata.db

# Test database connection
sqlite3 data/metadata.db "SELECT COUNT(*) FROM files;"

Common Causes & Solutions:

  1. Database permissions:

    sudo chown -R $USER:$USER data/
    chmod -R 755 data/
    
  2. Port conflicts:

    # Check what's using port 9876
    sudo netstat -tulpn | grep 9876
    # Kill conflicting process or change port
    
  3. Insufficient disk space:

    df -h
    # Free up space or add storage
    

Redis Connection Issues

Symptoms: Gateway logs show Redis connection errors

Solutions:

# Check Redis container
docker-compose -f docker-compose.prod.yml logs redis

# Test Redis connection
docker exec -it torrentgateway_redis_1 redis-cli ping

# Restart Redis
docker-compose -f docker-compose.prod.yml restart redis

Performance Issues

High CPU Usage

Diagnostic:

# Check container resource usage
docker stats

# Check system resources
top
htop

Solutions:

  1. Scale gateway instances:

    docker-compose -f docker-compose.prod.yml up -d --scale gateway=2
    
  2. Optimize database:

    ./scripts/migrate.sh  # Runs VACUUM and ANALYZE
    
  3. Add resource limits:

    services:
      gateway:
        deploy:
          resources:
            limits:
              cpus: '1.0'
              memory: 1G
    

High Memory Usage

Diagnostic:

# Check memory usage by container
docker stats --format "table {{.Container}}\t{{.CPUPerc}}\t{{.MemUsage}}"

# Check for memory leaks in logs
docker-compose logs gateway | grep -i "memory\|leak\|oom"

Solutions:

  1. Restart affected containers:

    docker-compose -f docker-compose.prod.yml restart gateway
    
  2. Implement memory limits:

    services:
      gateway:
        deploy:
          resources:
            limits:
              memory: 2G
    

Slow Response Times

Diagnostic:

# Test API response time
curl -w "@curl-format.txt" -o /dev/null -s http://localhost:9876/api/health

# Check database performance
sqlite3 data/metadata.db "EXPLAIN QUERY PLAN SELECT * FROM files LIMIT 10;"

Solutions:

  1. Add database indexes:

    ./scripts/migrate.sh  # Applies performance indexes
    
  2. Optimize storage:

    # Check storage I/O
    iostat -x 1 5
    

Database Issues

Database Corruption

Symptoms: SQLite errors, integrity check failures

Diagnostic:

# Check database integrity
sqlite3 data/metadata.db "PRAGMA integrity_check;"

# Check database size and structure
sqlite3 data/metadata.db ".schema"
ls -lh data/metadata.db

Recovery:

# Attempt repair
sqlite3 data/metadata.db "VACUUM;"

# If repair fails, restore from backup
./scripts/restore.sh $(ls backups/ | grep gateway_backup | tail -1 | sed 's/gateway_backup_\(.*\).tar.gz/\1/')

Database Lock Issues

Symptoms: "database is locked" errors

Solutions:

# Find processes using database
lsof data/metadata.db

# Force unlock (dangerous - stop gateway first)
docker-compose -f docker-compose.prod.yml stop gateway
rm -f data/metadata.db-wal data/metadata.db-shm

Storage Issues

Disk Space Full

Diagnostic:

# Check disk usage
df -h
du -sh data/*

# Find large files
find data/ -type f -size +100M -exec ls -lh {} \;

Solutions:

  1. Clean up old files:

    # Remove files older than 30 days
    find data/blobs/ -type f -mtime +30 -delete
    find data/chunks/ -type f -mtime +30 -delete
    
  2. Cleanup orphaned data:

    ./scripts/migrate.sh  # Removes orphaned chunks
    

Storage Corruption

Symptoms: File integrity check failures

Diagnostic:

# Run E2E tests to verify storage
./test/e2e/run_all_tests.sh

# Check file system
fsck /dev/disk/by-label/data

Network Issues

API Timeouts

Diagnostic:

# Test network connectivity
curl -v http://localhost:9876/api/health

# Check Docker network
docker network ls
docker network inspect torrentgateway_default

Solutions:

# Restart networking
docker-compose -f docker-compose.prod.yml down
docker-compose -f docker-compose.prod.yml up -d

# Increase timeouts in client
curl --connect-timeout 30 --max-time 60 http://localhost:9876/api/health

Port Binding Issues

Symptoms: "Port already in use" errors

Diagnostic:

# Check port usage
sudo netstat -tulpn | grep :9876
sudo lsof -i :9876

Solutions:

# Kill conflicting process
sudo kill $(sudo lsof -t -i:9876)

# Or change port in docker-compose.yml

Monitoring Issues

Prometheus Not Scraping

Diagnostic:

# Check Prometheus targets
curl -s http://localhost:9090/api/v1/targets

# Check metrics endpoint
curl -s http://localhost:9876/metrics

Solutions:

# Restart Prometheus
docker-compose -f docker-compose.prod.yml restart prometheus

# Check configuration
docker-compose -f docker-compose.prod.yml exec prometheus cat /etc/prometheus/prometheus.yml

Grafana Dashboard Issues

Common Problems:

  1. No data in dashboards:

    • Check Prometheus data source configuration
    • Verify metrics are being collected
  2. Dashboard import failures:

    • Check JSON syntax
    • Verify dashboard version compatibility

Log Analysis

Finding Specific Errors

# Gateway application logs
docker-compose -f docker-compose.prod.yml logs gateway | grep -i error

# System logs with timestamps
docker-compose -f docker-compose.prod.yml logs --timestamps

# Follow logs in real-time
docker-compose -f docker-compose.prod.yml logs -f gateway

Log Rotation Issues

# Check log sizes
docker-compose -f docker-compose.prod.yml exec gateway ls -lh /app/logs/

# Manually rotate logs
docker-compose -f docker-compose.prod.yml exec gateway logrotate /etc/logrotate.conf

Emergency Procedures

Complete Service Failure

  1. Stop all services:

    docker-compose -f docker-compose.prod.yml down
    
  2. Check system resources:

    df -h
    free -h
    top
    
  3. Restore from backup:

    ./scripts/restore.sh <timestamp>
    

Data Recovery

  1. Create immediate backup:

    ./scripts/backup.sh emergency
    
  2. Assess data integrity:

    sqlite3 data/metadata.db "PRAGMA integrity_check;"
    
  3. Restore if necessary:

    ./scripts/restore.sh <last_good_backup>
    

Getting Help

Log Collection

Before reporting issues, collect relevant logs:

# Create diagnostics package
mkdir -p diagnostics
docker-compose -f docker-compose.prod.yml logs > diagnostics/service_logs.txt
./scripts/health_check.sh > diagnostics/health_check.txt 2>&1
cp data/metadata.db diagnostics/ 2>/dev/null || echo "Database not accessible"
tar -czf diagnostics_$(date +%Y%m%d_%H%M%S).tar.gz diagnostics/

Health Check Output

Always include health check results:

./scripts/health_check.sh | tee health_status.txt

System Information

# Collect system info
echo "Docker version: $(docker --version)" > system_info.txt
echo "Docker Compose version: $(docker-compose --version)" >> system_info.txt
echo "System: $(uname -a)" >> system_info.txt
echo "Memory: $(free -h)" >> system_info.txt
echo "Disk: $(df -h)" >> system_info.txt
echo "FFmpeg: $(ffmpeg -version 2>/dev/null | head -1 || echo 'Not installed')" >> system_info.txt

Video Transcoding Issues

FFmpeg Not Found

Symptoms: Transcoding fails with "ffmpeg not found" errors

Solution:

# Install FFmpeg
sudo apt install ffmpeg  # Ubuntu/Debian
sudo yum install ffmpeg  # CentOS/RHEL
brew install ffmpeg      # macOS

# Verify installation
ffmpeg -version

Transcoding Jobs Stuck

Symptoms: Videos remain in "queued" or "processing" status

Diagnostic Steps:

# Check transcoding status
curl -H "Authorization: Bearer $TOKEN" \
  http://localhost:9877/api/users/me/files/$HASH/transcoding-status

# Check process resources
ps aux | grep ffmpeg
top -p $(pgrep ffmpeg)

Common Causes:

  • Insufficient disk space in work directory
  • Memory limits exceeded
  • Invalid video format
  • Corrupted source file

High Resource Usage

Symptoms: System slow during transcoding, high CPU/memory usage

Solutions:

# Reduce concurrent jobs
transcoding:
  concurrent_jobs: 2        # Lower from 4

# Limit CPU usage
transcoding:
  max_cpu_percent: 50       # Reduce from 80
  nice_level: 15            # Increase from 10

# Increase minimum file size threshold
transcoding:
  min_file_size: 200MB      # Skip more small files

Failed Transcoding Jobs

Symptoms: Jobs marked as "failed" in status API

Diagnostic Steps:

# Check transcoding logs
grep "transcoding" /var/log/torrent-gateway.log

# Check FFmpeg error output
journalctl -u torrent-gateway | grep ffmpeg

Common Solutions:

  • Verify source file is not corrupted
  • Check available disk space
  • Ensure FFmpeg supports input format
  • Review resource limits