Question You are trying to create semantic search for your company in certain domain like legal or e commerce You have a model like bert or transformer from hugginface You feed in your corpus to the model to generate embedding You perform knn on sample queries and found your recall and precision is bad What do you do? Not too sure how to approach this besides Train on larger corpus that is more domain specific Use larger embedding dimensions What else can I say
I'm not a hardcore ML Engineer but these are my thoughts: If you use larger embedding dimensions you risk running into the curse of dimensionality, which KNN is especially vulnerable to. On that front I think you should go with a new approach towards feature engineering. Something else I would consider is looking at the distance metric your KNN model uses and evaluate your options, perhaps use a different model altogether, especially because you can't really interpret KNN wrt the target's relationship with the data.
Few buckets to experiment with - 1. Model used to generate embeddings - try other SOTA models to generate embeddings 2. How you chunked your corpus e.g. sentence, paragraph, tokenization, etc to generate embeddings 3. All tunable stuff in knn - distance metric, choice of k
look at lectures on contrastive methods
World Conflicts
12h
229
Game on - IDF Tanks take the main road in Rafah, encircling half the city
World Conflicts
Yesterday
1078
Why the hell is there a Gaza Strip in the middle of Israel?
Tech Industry
15h
644
A lot of people here serve money and not God
World Conflicts
13h
435
The False Narrative of Israel Committing "Genocide"
Tech Industry
3h
1129
Getting sextorted
You can use stacked approach on top of candidate retrieval..add a lr model or NN to predict rank of document ?.. Perform query expansion on query to add more context specific fields (RAG IF LLM).