Mira-Embeddings-V1: Domain-Adapted Semantic Reranking for Recruitment via LLM-Synthesized Data
A recruitment-domain semantic reranking system that uses LLM-synthesized supervision and boundary-aware reranking to improve candidate retrieval recall.
Overview
Abstract
Candidate sourcing for recruiters is best viewed as a two-stage retrieval and reranking pipeline with recall as the primary objective under a limited review budget. Mira-Embeddings-V1 reshapes the embedding space with LLM-synthesized training data and corrects boundary confusions with a lightweight reranking head. Starting from real job descriptions, the system uses a five-stage prompt pipeline to generate positive and hard-negative samples, followed by two-round LoRA adaptation and a BoundaryHead MLP. On 300 real job descriptions with candidates from a production retriever, it improves Recall@50 from 68.89% to 77.55% and Precision@10 from 35.77% to 39.62%. On a global pool of 44,138 candidates judged with a Qwen3-32B rubric, Recall@200 reaches 0.7047 versus 0.5969 for the baseline.
Evidence
Key findings
- Recall@50 improved from 68.89% to 77.55% on a local pool built from 300 real job descriptions.
- Precision@10 increased from 35.77% to 39.62% without a heavy cross-encoder.
- On 44,138 global candidates, Recall@200 reached 0.7047 versus 0.5969 for the baseline.
Research design
Methodology
The study expands a modest set of real job descriptions into supervision through a five-stage LLM synthesis pipeline. It then applies JD-to-JD contrastive adaptation, JD-to-CV triplet alignment, and a lightweight BoundaryHead MLP to rerank candidates that share titles but differ in role scope.
Subjects
Research topics
- AI recruiting
- semantic reranking
- candidate retrieval
- LLM-synthesized data
- domain-adapted embeddings
Reference
How to cite
Liang, Z., Wang, Z., Cao, R., & Zhang, Y. (2026). Mira-Embeddings-V1: Domain-Adapted Semantic Reranking for Recruitment via LLM-Synthesized Data. arXiv:2604.17738.