Yandex leak of search

Roel M. Hogervorst

2023/02/10

Categories: blog Tags: search inspiration production Catboost

Someone leaked a massive private source code repository from Yandex. What can you as a data scientist learn from this leak? How is advanced machine learning done at these massive companies. Specifically what does the infrastructure for search look like?

I have not downloaded the source code so I’m basing my ideas on what has been said by arstechnica and several twitter threads.

Search on the internet is essentially a monopsony (there are a few companies doing search and everyone else is bought up or destroyed; the buyer controls the market). In the English speaking world (and most of Europe) there is Google, in Chinese there is Baidu and in Russian you have Yandex. Regardless search is a specific problem with a few well-known parts.

Search, or information retrieval is a set of problems related to presenting the ‘right’ information based on a query. The founders of google wrote a seminal paper about this problem, where they presented the PageRank algorithm to determine which webpages are most important.

In general search engines work like this: given a search query you retrieve a huge amount of results and use Machine learning to rerank the results (’learning to rank’ if you look for it in academic papers) to present the best results on top^[1].

Especially in websites you have a lot of data, every click is recorded and so you can use that click information to rank the links and train a model to rerank the results.

According to some docs this is what Yandex does (or did in the past):

What can we see in the yandex leaks?

Like every machine learning problem you need several things: features (input data), an ML model, serving infrastructure. etc.

Approximately 17.000 features go into the ranking algorithm of Yandex. There are link-specific features, site-specific features, and also features related to the searcher and query.

What is inspirational too, is that the features are all documented with a name, description, link to internal wiki, links to people responsible for those features and tags to categorize that feature, to include or exclude the feature for certain languages.

The SEO worshippers on the internet (see links below) have summarized those 17.000 features into the underlying concepts.

According to the excellent Arstechnica article, Yandex ranks results higher that:

very specific things

Conclusion

So search queries are enriched, all relevant documents are retrieved. The results are sorted and the highest scoring ones are shown first.

Like many other ML problems the key to great results is thinking about what information you want the ML model to learn and turning that information into features.

Notes

Generic technical news article by arstechnica: arstechnica: Massive Yandex code leak reveals Russian search engine’s ranking factors

A group of people digged through the source code to see what they could learn for Search engine optimization (SEO).:

More info about other yandex things in the leaks:

References

[1]: Whatever that objective is. Right now I feel the objective for google is showing the right ads, for yandex showing the right website results. For bing, I don’t know what their objective is.