torrent-gateway/docs/performance.md

# Performance Tuning Guide

## Overview

This guide covers optimizing Torrent Gateway performance for different workloads and deployment sizes, including video transcoding workloads.

## Database Optimization

### Indexes

The migration script applies performance indexes automatically:

```sql
-- File lookup optimization
CREATE INDEX idx_files_owner_pubkey ON files(owner_pubkey);
CREATE INDEX idx_files_storage_type ON files(storage_type);
CREATE INDEX idx_files_access_level ON files(access_level);
CREATE INDEX idx_files_size ON files(size);
CREATE INDEX idx_files_last_access ON files(last_access);

-- Chunk optimization
CREATE INDEX idx_chunks_chunk_hash ON chunks(chunk_hash);

-- User statistics
CREATE INDEX idx_users_storage_used ON users(storage_used);

-- Transcoding status optimization
CREATE INDEX idx_transcoding_status ON transcoding_status(status);
CREATE INDEX idx_transcoding_updated ON transcoding_status(updated_at);
```

### Database Maintenance

```bash
# Run regular maintenance
./scripts/migrate.sh

# Manual optimization
sqlite3 data/metadata.db "VACUUM;"
sqlite3 data/metadata.db "ANALYZE;"
```

### Connection Pooling

Configure connection limits in your application:
```go
// In production config
MaxOpenConns: 25
MaxIdleConns: 5
ConnMaxLifetime: 300 * time.Second
```

## Application Tuning

### Memory Management

**Go Runtime Settings:**
```bash
# Set garbage collection target
export GOGC=100

# Set memory limit
export GOMEMLIMIT=2GB
```

**Container Limits:**
```yaml
services:
  gateway:
    deploy:
      resources:
        limits:
          memory: 2G
        reservations:
          memory: 1G
```

### File Handling

**Large File Optimization:**
- Files >10MB use torrent storage (chunked)
- Files <10MB use blob storage (single file)
- Chunk size: 256KB (configurable)

**Storage Path Optimization:**
```bash
# Use SSD for database and small files
ln -s /fast/ssd/path data/blobs

# Use HDD for large file chunks
ln -s /bulk/hdd/path data/chunks
```

## Network Performance

### Connection Limits

**Reverse Proxy (nginx):**
```nginx
upstream gateway {
    server 127.0.0.1:9876 max_fails=3 fail_timeout=30s;
    keepalive 32;
}

server {
    location / {
        proxy_pass http://gateway;
        proxy_http_version 1.1;
        proxy_set_header Connection "";
        proxy_buffering off;
    }
}
```

### Rate Limiting

Configure rate limits based on usage patterns:
```yaml
# In docker-compose.prod.yml
environment:
  - RATE_LIMIT_UPLOAD=10/minute
  - RATE_LIMIT_DOWNLOAD=100/minute
  - RATE_LIMIT_API=1000/minute
```

## Storage Performance

### Storage Backend Selection

**Blob Storage (< 10MB files):**
- Best for: Documents, images, small media
- Performance: Direct file system access
- Scaling: Limited by file system performance

**Torrent Storage (> 10MB files):**
- Best for: Large media, archives, datasets
- Performance: Parallel chunk processing
- Scaling: Horizontal scaling via chunk distribution

### File System Tuning

**For Linux ext4:**
```bash
# Optimize for many small files
tune2fs -o journal_data_writeback /dev/sdb1
mount -o noatime,data=writeback /dev/sdb1 /data
```

**For ZFS:**
```bash
# Optimize for mixed workload
zfs set compression=lz4 tank/data
zfs set atime=off tank/data
zfs set recordsize=64K tank/data
```

## Monitoring and Metrics

### Key Metrics to Watch

**Application Metrics:**
- Request rate and latency
- Error rates by endpoint
- Active connections
- File upload/download rates
- Storage usage growth

**System Metrics:**
- CPU utilization
- Memory usage
- Disk I/O and space
- Network throughput

### Prometheus Queries

**Request Rate:**
```promql
rate(http_requests_total[5m])
```

**95th Percentile Latency:**
```promql
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))
```

**Error Rate:**
```promql
rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m])
```

**Storage Growth:**
```promql
increase(storage_bytes_total[24h])
```

### Alert Thresholds

**Critical Alerts:**
- Error rate > 5%
- Response time > 5s
- Disk usage > 90%
- Memory usage > 85%

**Warning Alerts:**
- Error rate > 1%
- Response time > 2s
- Disk usage > 80%
- Memory usage > 70%

## Load Testing

### Running Load Tests

```bash
# Start with integration load test
go test -v -tags=integration ./test/... -run TestLoadTesting -timeout 15m

# Custom load test with specific parameters
go test -v -tags=integration ./test/... -run TestLoadTesting \
  -concurrent-users=100 \
  -test-duration=300s \
  -timeout 20m
```

### Interpreting Results

**Good Performance Indicators:**
- 95th percentile response time < 1s
- Error rate < 0.1%
- Throughput > 100 requests/second
- Memory usage stable over time

**Performance Bottlenecks:**
- High database response times → Add indexes or scale database
- High CPU usage → Scale horizontally or optimize code
- High memory usage → Check for memory leaks or add limits
- High disk I/O → Use faster storage or optimize queries

## Scaling Strategies

### Vertical Scaling

**Increase Resources:**
```yaml
services:
  gateway:
    deploy:
      resources:
        limits:
          cpus: '2.0'
          memory: 4G
```

### Horizontal Scaling

**Multiple Gateway Instances:**
```bash
# Scale to 3 instances
docker-compose -f docker-compose.prod.yml up -d --scale gateway=3
```

**Load Balancer Configuration:**
```nginx
upstream gateway_cluster {
    server 127.0.0.1:9876;
    server 127.0.0.1:9877;
    server 127.0.0.1:9878;
}
```

### Database Scaling

**Read Replicas:**
- Implement read-only database replicas
- Route read queries to replicas
- Use primary for writes only

**Sharding Strategy:**
- Shard by user pubkey hash
- Distribute across multiple databases
- Implement shard-aware routing

## Caching Strategies

### Application-Level Caching

**Redis Configuration:**
```yaml
redis:
  image: redis:7-alpine
  command: redis-server --maxmemory 1gb --maxmemory-policy allkeys-lru
```

**Cache Patterns:**
- User session data (TTL: 24h)
- File metadata (TTL: 1h)
- API responses (TTL: 5m)
- Authentication challenges (TTL: 10m)

### CDN Integration

For public files, consider CDN integration:
- CloudFlare for global distribution
- AWS CloudFront for AWS deployments
- Custom edge servers for private deployments

## Configuration Tuning

### Environment Variables

**Production Settings:**
```bash
# Application tuning
export MAX_UPLOAD_SIZE=1GB
export CHUNK_SIZE=256KB
export MAX_CONCURRENT_UPLOADS=10
export DATABASE_TIMEOUT=30s

# Performance tuning
export GOMAXPROCS=4
export GOGC=100
export GOMEMLIMIT=2GB

# Logging
export LOG_LEVEL=info
export LOG_FORMAT=json
```

### Docker Compose Optimization

```yaml
services:
  gateway:
    # Use host networking for better performance
    network_mode: host

    # Optimize logging
    logging:
      driver: "json-file"
      options:
        max-size: "10m"
        max-file: "3"

    # Resource reservations
    deploy:
      resources:
        reservations:
          memory: 512M
          cpus: '0.5'
```

## Benchmarking

### Baseline Performance Tests

```bash
# API performance
ab -n 1000 -c 10 http://localhost:9876/api/health

# Upload performance
for i in {1..10}; do
  time curl -X POST -F "file=@test/testdata/small.txt" http://localhost:9876/api/upload
done

# Download performance
time curl -O http://localhost:9876/api/download/[hash]
```

### Continuous Performance Monitoring

**Setup automated benchmarks:**
```bash
# Add to cron
0 2 * * * /path/to/performance_benchmark.sh
```

**Track performance metrics over time:**
- Response time trends
- Throughput capacity
- Resource utilization patterns
- Error rate trends

## Optimization Checklist

### Application Level
- [ ] Database indexes applied
- [ ] Connection pooling configured
- [ ] Caching strategy implemented
- [ ] Resource limits set
- [ ] Garbage collection tuned

### Infrastructure Level
- [ ] Fast storage for database
- [ ] Adequate RAM allocated
- [ ] Network bandwidth sufficient
- [ ] Load balancer configured
- [ ] CDN setup for static content

### Monitoring Level
- [ ] Performance alerts configured
- [ ] Baseline metrics established
- [ ] Regular load testing scheduled
- [ ] Capacity planning reviewed
- [ ] Performance dashboards created

## Video Transcoding Performance

### Hardware Requirements

**CPU:**
- 4+ cores recommended for concurrent transcoding
- Modern CPU with hardware encoding support (Intel QuickSync, AMD VCE)
- Higher core count = more concurrent jobs

**Memory:**
- 2GB+ RAM per concurrent transcoding job
- Additional 1GB+ for temporary file storage
- Consider SSD swap for large files

**Storage:**
- Fast SSD for work directory (`transcoding.work_dir`)
- Separate from main storage to avoid I/O contention
- Plan for 2-3x video file size temporary space

### Configuration Optimization

```yaml
transcoding:
  enabled: true
  concurrent_jobs: 4              # Match CPU cores
  work_dir: "/fast/ssd/transcoding" # Use fastest storage
  max_cpu_percent: 80             # Limit CPU usage
  nice_level: 10                  # Lower priority than main service
  min_file_size: 100MB            # Skip small files
```

### Performance Monitoring

**Key Metrics:**
- Queue depth and processing time
- CPU usage during transcoding
- Storage I/O patterns
- Memory consumption per job
- Failed job retry rates

**Alerts:**
- Queue backlog > 50 jobs
- Average processing time > 5 minutes per GB
- Failed job rate > 10%
- Storage space < 20% free

### Optimization Strategies

1. **Priority System**: Smaller files processed first for user feedback
2. **Resource Limits**: Prevent transcoding from affecting main service
3. **Smart Serving**: Original files served while transcoding in progress
4. **Batch Processing**: Group similar formats for efficiency
5. **Hardware Acceleration**: Use GPU encoding when available