GitHub has introduced its new code search feature, including a redesigned search interface, a new code view, and a search engine rebuilt from scratch to be faster, more capable, and to better understand code, says GitHub software engineer Colin Merkel.
Our goal with the new code search and code view is to enable developers to quickly search, navigate and understand their code, put critical information into context, and ultimately make them more productive.
According to Merkel, the new search engine is twice as fast as the previous one. It also provides more flexibility supporting substring queries, regular expressions, and symbol search. For example, you could search for a string across all repos belonging to your organization without having to clone them beforehand:
org:my_org "string to look for"
You can also restrict your query to files written in a specific language or repo, exclude specific paths, or use many additional possibilities supported by GitHub search query syntax.
The new code view integrates search with a file browser and supports code navigation and browsing, allowing to jump to symbol definitions in over 10 languages.
GitHub engineer Timothy Clem provided a detailed overview of how the new search engine works behind the scenes to achieve its goals in terms of flexibility, performance, and scalability.
At the heart of GitHub search engine lies a powerful indexer, which is a prerequisite to being able to run queries fast. The search index is specialized for code, for example by being able to distinguish between programming languages, not ignoring punctuation, not stripping stop words, and so on. The index must also include ngrams, i.e. sequences of characters of a given length, to support substring queries.
GitHub build its search index by analyzing 45 million repositories, amounting to 115TB of content across 15.5 billion documents, which is a daunting task. Luckily, explains Clem, there are two factors that make it possible to reduce the amount of work to do: using Git blob object IDs to distribute unique documents evenly across shards, and the fact that GitHub hosts a lot of duplicate content.
When a new query is received, it is parsed into an abstract syntax tree and transformed into n concurrent requests sent to distinct shards in the search cluster. The shards carry through low-level processing such as translating regex into substring queries on the ngram indices. Finally, shards return their results to the query service, which aggregates them a selects the top 100.
Our p99 response times from individual shards are on the order of 100 ms, but total response times are a bit longer due to aggregating responses, checking permissions, and things like syntax highlighting. A query ties up a single CPU core on the index server for that 100 ms, so our 64 core hosts have an upper bound of something like 640 queries per second.
Thanks to this approach, GitHub can re-index the entire repository corpus in about 18 hours. The overall index size is 25TB, which is roughly a quarter the of original data.
The new code search is available for free to all GitHub users.