9.3 KiB
Troubleshooting Guide
Common Issues and Solutions
Service Startup Issues
Gateway Won't Start
Symptoms: Container exits immediately or health checks fail
Diagnostic Steps:
# Check container logs
docker-compose -f docker-compose.prod.yml logs gateway
# Check database file
ls -la data/metadata.db
# Test database connection
sqlite3 data/metadata.db "SELECT COUNT(*) FROM files;"
Common Causes & Solutions:
-
Database permissions:
sudo chown -R $USER:$USER data/ chmod -R 755 data/
-
Port conflicts:
# Check what's using port 9876 sudo netstat -tulpn | grep 9876 # Kill conflicting process or change port
-
Insufficient disk space:
df -h # Free up space or add storage
Redis Connection Issues
Symptoms: Gateway logs show Redis connection errors
Solutions:
# Check Redis container
docker-compose -f docker-compose.prod.yml logs redis
# Test Redis connection
docker exec -it torrentgateway_redis_1 redis-cli ping
# Restart Redis
docker-compose -f docker-compose.prod.yml restart redis
Performance Issues
High CPU Usage
Diagnostic:
# Check container resource usage
docker stats
# Check system resources
top
htop
Solutions:
-
Scale gateway instances:
docker-compose -f docker-compose.prod.yml up -d --scale gateway=2
-
Optimize database:
./scripts/migrate.sh # Runs VACUUM and ANALYZE
-
Add resource limits:
services: gateway: deploy: resources: limits: cpus: '1.0' memory: 1G
High Memory Usage
Diagnostic:
# Check memory usage by container
docker stats --format "table {{.Container}}\t{{.CPUPerc}}\t{{.MemUsage}}"
# Check for memory leaks in logs
docker-compose logs gateway | grep -i "memory\|leak\|oom"
Solutions:
-
Restart affected containers:
docker-compose -f docker-compose.prod.yml restart gateway
-
Implement memory limits:
services: gateway: deploy: resources: limits: memory: 2G
Slow Response Times
Diagnostic:
# Test API response time
curl -w "@curl-format.txt" -o /dev/null -s http://localhost:9876/api/health
# Check database performance
sqlite3 data/metadata.db "EXPLAIN QUERY PLAN SELECT * FROM files LIMIT 10;"
Solutions:
-
Add database indexes:
./scripts/migrate.sh # Applies performance indexes
-
Optimize storage:
# Check storage I/O iostat -x 1 5
Database Issues
Database Corruption
Symptoms: SQLite errors, integrity check failures
Diagnostic:
# Check database integrity
sqlite3 data/metadata.db "PRAGMA integrity_check;"
# Check database size and structure
sqlite3 data/metadata.db ".schema"
ls -lh data/metadata.db
Recovery:
# Attempt repair
sqlite3 data/metadata.db "VACUUM;"
# If repair fails, restore from backup
./scripts/restore.sh $(ls backups/ | grep gateway_backup | tail -1 | sed 's/gateway_backup_\(.*\).tar.gz/\1/')
Database Lock Issues
Symptoms: "database is locked" errors
Solutions:
# Find processes using database
lsof data/metadata.db
# Force unlock (dangerous - stop gateway first)
docker-compose -f docker-compose.prod.yml stop gateway
rm -f data/metadata.db-wal data/metadata.db-shm
Storage Issues
Disk Space Full
Diagnostic:
# Check disk usage
df -h
du -sh data/*
# Find large files
find data/ -type f -size +100M -exec ls -lh {} \;
Solutions:
-
Clean up old files:
# Remove files older than 30 days find data/blobs/ -type f -mtime +30 -delete find data/chunks/ -type f -mtime +30 -delete
-
Cleanup orphaned data:
./scripts/migrate.sh # Removes orphaned chunks
Storage Corruption
Symptoms: File integrity check failures
Diagnostic:
# Run E2E tests to verify storage
./test/e2e/run_all_tests.sh
# Check file system
fsck /dev/disk/by-label/data
Network Issues
API Timeouts
Diagnostic:
# Test network connectivity
curl -v http://localhost:9876/api/health
# Check Docker network
docker network ls
docker network inspect torrentgateway_default
Solutions:
# Restart networking
docker-compose -f docker-compose.prod.yml down
docker-compose -f docker-compose.prod.yml up -d
# Increase timeouts in client
curl --connect-timeout 30 --max-time 60 http://localhost:9876/api/health
Port Binding Issues
Symptoms: "Port already in use" errors
Diagnostic:
# Check port usage
sudo netstat -tulpn | grep :9876
sudo lsof -i :9876
Solutions:
# Kill conflicting process
sudo kill $(sudo lsof -t -i:9876)
# Or change port in docker-compose.yml
Monitoring Issues
Prometheus Not Scraping
Diagnostic:
# Check Prometheus targets
curl -s http://localhost:9090/api/v1/targets
# Check metrics endpoint
curl -s http://localhost:9876/metrics
Solutions:
# Restart Prometheus
docker-compose -f docker-compose.prod.yml restart prometheus
# Check configuration
docker-compose -f docker-compose.prod.yml exec prometheus cat /etc/prometheus/prometheus.yml
Grafana Dashboard Issues
Common Problems:
-
No data in dashboards:
- Check Prometheus data source configuration
- Verify metrics are being collected
-
Dashboard import failures:
- Check JSON syntax
- Verify dashboard version compatibility
Log Analysis
Finding Specific Errors
# Gateway application logs
docker-compose -f docker-compose.prod.yml logs gateway | grep -i error
# System logs with timestamps
docker-compose -f docker-compose.prod.yml logs --timestamps
# Follow logs in real-time
docker-compose -f docker-compose.prod.yml logs -f gateway
Log Rotation Issues
# Check log sizes
docker-compose -f docker-compose.prod.yml exec gateway ls -lh /app/logs/
# Manually rotate logs
docker-compose -f docker-compose.prod.yml exec gateway logrotate /etc/logrotate.conf
Emergency Procedures
Complete Service Failure
-
Stop all services:
docker-compose -f docker-compose.prod.yml down
-
Check system resources:
df -h free -h top
-
Restore from backup:
./scripts/restore.sh <timestamp>
Data Recovery
-
Create immediate backup:
./scripts/backup.sh emergency
-
Assess data integrity:
sqlite3 data/metadata.db "PRAGMA integrity_check;"
-
Restore if necessary:
./scripts/restore.sh <last_good_backup>
Getting Help
Log Collection
Before reporting issues, collect relevant logs:
# Create diagnostics package
mkdir -p diagnostics
docker-compose -f docker-compose.prod.yml logs > diagnostics/service_logs.txt
./scripts/health_check.sh > diagnostics/health_check.txt 2>&1
cp data/metadata.db diagnostics/ 2>/dev/null || echo "Database not accessible"
tar -czf diagnostics_$(date +%Y%m%d_%H%M%S).tar.gz diagnostics/
Health Check Output
Always include health check results:
./scripts/health_check.sh | tee health_status.txt
System Information
# Collect system info
echo "Docker version: $(docker --version)" > system_info.txt
echo "Docker Compose version: $(docker-compose --version)" >> system_info.txt
echo "System: $(uname -a)" >> system_info.txt
echo "Memory: $(free -h)" >> system_info.txt
echo "Disk: $(df -h)" >> system_info.txt
echo "FFmpeg: $(ffmpeg -version 2>/dev/null | head -1 || echo 'Not installed')" >> system_info.txt
Video Transcoding Issues
FFmpeg Not Found
Symptoms: Transcoding fails with "ffmpeg not found" errors
Solution:
# Install FFmpeg
sudo apt install ffmpeg # Ubuntu/Debian
sudo yum install ffmpeg # CentOS/RHEL
brew install ffmpeg # macOS
# Verify installation
ffmpeg -version
Transcoding Jobs Stuck
Symptoms: Videos remain in "queued" or "processing" status
Diagnostic Steps:
# Check transcoding status
curl -H "Authorization: Bearer $TOKEN" \
http://localhost:9877/api/users/me/files/$HASH/transcoding-status
# Check process resources
ps aux | grep ffmpeg
top -p $(pgrep ffmpeg)
Common Causes:
- Insufficient disk space in work directory
- Memory limits exceeded
- Invalid video format
- Corrupted source file
High Resource Usage
Symptoms: System slow during transcoding, high CPU/memory usage
Solutions:
# Reduce concurrent jobs
transcoding:
concurrent_jobs: 2 # Lower from 4
# Limit CPU usage
transcoding:
max_cpu_percent: 50 # Reduce from 80
nice_level: 15 # Increase from 10
# Increase minimum file size threshold
transcoding:
min_file_size: 200MB # Skip more small files
Failed Transcoding Jobs
Symptoms: Jobs marked as "failed" in status API
Diagnostic Steps:
# Check transcoding logs
grep "transcoding" /var/log/torrent-gateway.log
# Check FFmpeg error output
journalctl -u torrent-gateway | grep ffmpeg
Common Solutions:
- Verify source file is not corrupted
- Check available disk space
- Ensure FFmpeg supports input format
- Review resource limits