NewarXivPreprint

Mira-Embeddings-V1: Domain-Adapted Semantic Reranking for Recruitment via LLM-Synthesized Data

Zhaohua Liang, Zhilin Wang, Renjie Cao, Yining Zhang · April 2026

A recruitment-domain semantic reranking system that uses LLM-synthesized supervision and boundary-aware reranking to improve candidate retrieval recall.

View on arXiv ↗Read PDF ↗

Overview

Abstract

Candidate sourcing for recruiters is best viewed as a two-stage retrieval and reranking pipeline with recall as the primary objective under a limited review budget. Mira-Embeddings-V1 reshapes the embedding space with LLM-synthesized training data and corrects boundary confusions with a lightweight reranking head. Starting from real job descriptions, the system uses a five-stage prompt pipeline to generate positive and hard-negative samples, followed by two-round LoRA adaptation and a BoundaryHead MLP. On 300 real job descriptions with candidates from a production retriever, it improves Recall@50 from 68.89% to 77.55% and Precision@10 from 35.77% to 39.62%. On a global pool of 44,138 candidates judged with a Qwen3-32B rubric, Recall@200 reaches 0.7047 versus 0.5969 for the baseline.

Evidence

Key findings

Recall@50 improved from 68.89% to 77.55% on a local pool built from 300 real job descriptions.
Precision@10 increased from 35.77% to 39.62% without a heavy cross-encoder.
On 44,138 global candidates, Recall@200 reached 0.7047 versus 0.5969 for the baseline.

Research design

Methodology

The study expands a modest set of real job descriptions into supervision through a five-stage LLM synthesis pipeline. It then applies JD-to-JD contrastive adaptation, JD-to-CV triplet alignment, and a lightweight BoundaryHead MLP to rerank candidates that share titles but differ in role scope.

Subjects

Research topics

AI recruiting
semantic reranking
candidate retrieval
LLM-synthesized data
domain-adapted embeddings

Reference

How to cite

Liang, Z., Wang, Z., Cao, R., & Zhang, Y. (2026). Mira-Embeddings-V1: Domain-Adapted Semantic Reranking for Recruitment via LLM-Synthesized Data. arXiv:2604.17738.