Multi-Speaker Pitch Tracking via Embodied Self-Supervised Learning