torrent-gateway/docs/troubleshooting.md
enki 76979d055b
Some checks are pending
CI Pipeline / Run Tests (push) Waiting to run
CI Pipeline / Lint Code (push) Waiting to run
CI Pipeline / Security Scan (push) Waiting to run
CI Pipeline / Build Docker Images (push) Blocked by required conditions
CI Pipeline / E2E Tests (push) Blocked by required conditions
Transcoding and Nip71 update
2025-08-21 19:32:26 -07:00

473 lines
9.3 KiB
Markdown

# Troubleshooting Guide
## Common Issues and Solutions
### Service Startup Issues
#### Gateway Won't Start
**Symptoms:** Container exits immediately or health checks fail
**Diagnostic Steps:**
```bash
# Check container logs
docker-compose -f docker-compose.prod.yml logs gateway
# Check database file
ls -la data/metadata.db
# Test database connection
sqlite3 data/metadata.db "SELECT COUNT(*) FROM files;"
```
**Common Causes & Solutions:**
1. **Database permissions:**
```bash
sudo chown -R $USER:$USER data/
chmod -R 755 data/
```
2. **Port conflicts:**
```bash
# Check what's using port 9876
sudo netstat -tulpn | grep 9876
# Kill conflicting process or change port
```
3. **Insufficient disk space:**
```bash
df -h
# Free up space or add storage
```
#### Redis Connection Issues
**Symptoms:** Gateway logs show Redis connection errors
**Solutions:**
```bash
# Check Redis container
docker-compose -f docker-compose.prod.yml logs redis
# Test Redis connection
docker exec -it torrentgateway_redis_1 redis-cli ping
# Restart Redis
docker-compose -f docker-compose.prod.yml restart redis
```
### Performance Issues
#### High CPU Usage
**Diagnostic:**
```bash
# Check container resource usage
docker stats
# Check system resources
top
htop
```
**Solutions:**
1. **Scale gateway instances:**
```bash
docker-compose -f docker-compose.prod.yml up -d --scale gateway=2
```
2. **Optimize database:**
```bash
./scripts/migrate.sh # Runs VACUUM and ANALYZE
```
3. **Add resource limits:**
```yaml
services:
gateway:
deploy:
resources:
limits:
cpus: '1.0'
memory: 1G
```
#### High Memory Usage
**Diagnostic:**
```bash
# Check memory usage by container
docker stats --format "table {{.Container}}\t{{.CPUPerc}}\t{{.MemUsage}}"
# Check for memory leaks in logs
docker-compose logs gateway | grep -i "memory\|leak\|oom"
```
**Solutions:**
1. **Restart affected containers:**
```bash
docker-compose -f docker-compose.prod.yml restart gateway
```
2. **Implement memory limits:**
```yaml
services:
gateway:
deploy:
resources:
limits:
memory: 2G
```
#### Slow Response Times
**Diagnostic:**
```bash
# Test API response time
curl -w "@curl-format.txt" -o /dev/null -s http://localhost:9876/api/health
# Check database performance
sqlite3 data/metadata.db "EXPLAIN QUERY PLAN SELECT * FROM files LIMIT 10;"
```
**Solutions:**
1. **Add database indexes:**
```bash
./scripts/migrate.sh # Applies performance indexes
```
2. **Optimize storage:**
```bash
# Check storage I/O
iostat -x 1 5
```
### Database Issues
#### Database Corruption
**Symptoms:** SQLite errors, integrity check failures
**Diagnostic:**
```bash
# Check database integrity
sqlite3 data/metadata.db "PRAGMA integrity_check;"
# Check database size and structure
sqlite3 data/metadata.db ".schema"
ls -lh data/metadata.db
```
**Recovery:**
```bash
# Attempt repair
sqlite3 data/metadata.db "VACUUM;"
# If repair fails, restore from backup
./scripts/restore.sh $(ls backups/ | grep gateway_backup | tail -1 | sed 's/gateway_backup_\(.*\).tar.gz/\1/')
```
#### Database Lock Issues
**Symptoms:** "database is locked" errors
**Solutions:**
```bash
# Find processes using database
lsof data/metadata.db
# Force unlock (dangerous - stop gateway first)
docker-compose -f docker-compose.prod.yml stop gateway
rm -f data/metadata.db-wal data/metadata.db-shm
```
### Storage Issues
#### Disk Space Full
**Diagnostic:**
```bash
# Check disk usage
df -h
du -sh data/*
# Find large files
find data/ -type f -size +100M -exec ls -lh {} \;
```
**Solutions:**
1. **Clean up old files:**
```bash
# Remove files older than 30 days
find data/blobs/ -type f -mtime +30 -delete
find data/chunks/ -type f -mtime +30 -delete
```
2. **Cleanup orphaned data:**
```bash
./scripts/migrate.sh # Removes orphaned chunks
```
#### Storage Corruption
**Symptoms:** File integrity check failures
**Diagnostic:**
```bash
# Run E2E tests to verify storage
./test/e2e/run_all_tests.sh
# Check file system
fsck /dev/disk/by-label/data
```
### Network Issues
#### API Timeouts
**Diagnostic:**
```bash
# Test network connectivity
curl -v http://localhost:9876/api/health
# Check Docker network
docker network ls
docker network inspect torrentgateway_default
```
**Solutions:**
```bash
# Restart networking
docker-compose -f docker-compose.prod.yml down
docker-compose -f docker-compose.prod.yml up -d
# Increase timeouts in client
curl --connect-timeout 30 --max-time 60 http://localhost:9876/api/health
```
#### Port Binding Issues
**Symptoms:** "Port already in use" errors
**Diagnostic:**
```bash
# Check port usage
sudo netstat -tulpn | grep :9876
sudo lsof -i :9876
```
**Solutions:**
```bash
# Kill conflicting process
sudo kill $(sudo lsof -t -i:9876)
# Or change port in docker-compose.yml
```
### Monitoring Issues
#### Prometheus Not Scraping
**Diagnostic:**
```bash
# Check Prometheus targets
curl -s http://localhost:9090/api/v1/targets
# Check metrics endpoint
curl -s http://localhost:9876/metrics
```
**Solutions:**
```bash
# Restart Prometheus
docker-compose -f docker-compose.prod.yml restart prometheus
# Check configuration
docker-compose -f docker-compose.prod.yml exec prometheus cat /etc/prometheus/prometheus.yml
```
#### Grafana Dashboard Issues
**Common Problems:**
1. **No data in dashboards:**
- Check Prometheus data source configuration
- Verify metrics are being collected
2. **Dashboard import failures:**
- Check JSON syntax
- Verify dashboard version compatibility
### Log Analysis
#### Finding Specific Errors
```bash
# Gateway application logs
docker-compose -f docker-compose.prod.yml logs gateway | grep -i error
# System logs with timestamps
docker-compose -f docker-compose.prod.yml logs --timestamps
# Follow logs in real-time
docker-compose -f docker-compose.prod.yml logs -f gateway
```
#### Log Rotation Issues
```bash
# Check log sizes
docker-compose -f docker-compose.prod.yml exec gateway ls -lh /app/logs/
# Manually rotate logs
docker-compose -f docker-compose.prod.yml exec gateway logrotate /etc/logrotate.conf
```
## Emergency Procedures
### Complete Service Failure
1. **Stop all services:**
```bash
docker-compose -f docker-compose.prod.yml down
```
2. **Check system resources:**
```bash
df -h
free -h
top
```
3. **Restore from backup:**
```bash
./scripts/restore.sh <timestamp>
```
### Data Recovery
1. **Create immediate backup:**
```bash
./scripts/backup.sh emergency
```
2. **Assess data integrity:**
```bash
sqlite3 data/metadata.db "PRAGMA integrity_check;"
```
3. **Restore if necessary:**
```bash
./scripts/restore.sh <last_good_backup>
```
## Getting Help
### Log Collection
Before reporting issues, collect relevant logs:
```bash
# Create diagnostics package
mkdir -p diagnostics
docker-compose -f docker-compose.prod.yml logs > diagnostics/service_logs.txt
./scripts/health_check.sh > diagnostics/health_check.txt 2>&1
cp data/metadata.db diagnostics/ 2>/dev/null || echo "Database not accessible"
tar -czf diagnostics_$(date +%Y%m%d_%H%M%S).tar.gz diagnostics/
```
### Health Check Output
Always include health check results:
```bash
./scripts/health_check.sh | tee health_status.txt
```
### System Information
```bash
# Collect system info
echo "Docker version: $(docker --version)" > system_info.txt
echo "Docker Compose version: $(docker-compose --version)" >> system_info.txt
echo "System: $(uname -a)" >> system_info.txt
echo "Memory: $(free -h)" >> system_info.txt
echo "Disk: $(df -h)" >> system_info.txt
echo "FFmpeg: $(ffmpeg -version 2>/dev/null | head -1 || echo 'Not installed')" >> system_info.txt
```
## Video Transcoding Issues
### FFmpeg Not Found
**Symptoms:** Transcoding fails with "ffmpeg not found" errors
**Solution:**
```bash
# Install FFmpeg
sudo apt install ffmpeg # Ubuntu/Debian
sudo yum install ffmpeg # CentOS/RHEL
brew install ffmpeg # macOS
# Verify installation
ffmpeg -version
```
### Transcoding Jobs Stuck
**Symptoms:** Videos remain in "queued" or "processing" status
**Diagnostic Steps:**
```bash
# Check transcoding status
curl -H "Authorization: Bearer $TOKEN" \
http://localhost:9877/api/users/me/files/$HASH/transcoding-status
# Check process resources
ps aux | grep ffmpeg
top -p $(pgrep ffmpeg)
```
**Common Causes:**
- Insufficient disk space in work directory
- Memory limits exceeded
- Invalid video format
- Corrupted source file
### High Resource Usage
**Symptoms:** System slow during transcoding, high CPU/memory usage
**Solutions:**
```yaml
# Reduce concurrent jobs
transcoding:
concurrent_jobs: 2 # Lower from 4
# Limit CPU usage
transcoding:
max_cpu_percent: 50 # Reduce from 80
nice_level: 15 # Increase from 10
# Increase minimum file size threshold
transcoding:
min_file_size: 200MB # Skip more small files
```
### Failed Transcoding Jobs
**Symptoms:** Jobs marked as "failed" in status API
**Diagnostic Steps:**
```bash
# Check transcoding logs
grep "transcoding" /var/log/torrent-gateway.log
# Check FFmpeg error output
journalctl -u torrent-gateway | grep ffmpeg
```
**Common Solutions:**
- Verify source file is not corrupted
- Check available disk space
- Ensure FFmpeg supports input format
- Review resource limits