Let’s talk about the latest release of Nomic AI’s text embedding model called Nomic Embed. It’s a multi-stage training pipeline, that’s open-source and auditable, and is an upgrade over popular models like OpenAI’s text-embedding-ada-002. The key features of Nomic Embed include an extended context length supporting tasks such as retrieval-augmented-generation (RAG) and semantic search.
The model is built through a multi-stage contrastive learning pipeline that starts with training a BERT model with a context length of 2048 tokens. There are many improvements and training techniques that include things like Rotary position embeddings, SwiGLU activations, Deep speed, FlashAttention, and BF16 precision. The contrastive training involves text pairs and high-quality labeled datasets to ensure the best possible performance.
The truth is that Nomic Embed outperforms existing models on three different benchmarks, showing its strength and credibility. And the commitment to transparency, reproducibility, and the release of model weights, training code, and curated data is a big deal for the open-source community. Nomic Embed’s performance and its call for improved evaluation paradigms highlight its significance in advancing the field of text embeddings.
And if you want to learn more about the great piece here, our consulting intern Pragati Jhunjhunwala at MarktechPost wrote a wonderful article. She is a tech enthusiast and has a keen interest in the scope of software and data science applications. Make sure to give it a read.