I made a search engine, mostly out of curiosity about what such an undertaking might entail, and also to build my skills with Go.
Update: I ended up shutting down the search engine, after stumbling into altogether too many bugs with robots.txt and crawling pages that should have been ignored. This post remains as a historical record.
I’ve been coding in Go a bit lately as it’s extremely well-suited to server-side workloads. This blog is served by a Go binary which enables full HTTP2 support as well as further customizations as I see fit.
The ease of integration with systems like Prometheus for capturing metrics for monitoring and statistical analysis (e.g. requests per second) has been exceptional as well compared to my experience with similar systems in C++. It has been a great opportunity to also play with Grafana for graphing metrics scraped by Prometheus.
The search engine is at https://search.ideasandcode.xyz/ and crawls take place every 6 hours.
The system builds on a couple of third-party libraries from the thriving Go ecosystem:
- https://github.com/temoto/robotstxt provides robots.txt handling, allowing my custom crawler to respect sites’ specifications for crawlers.
- https://github.com/robfig/cron allows the crawler to crawl periodically while staying alive and reporting metrics about crawls at all times.
And, of course, it also builds on the extremely-powerful native support for HTTP (including HTTP2) in the Go standard library to be able to crawl the web.
Initially this blog is the main starting point for crawls, but I would like to also crawl Lobsters and Hacker News to bring in a fairly tech-news-heavy dataset to the index.
You can also see some interesting data from the index including navigating its full index, inspecting the
TLDs seen by the crawler, or even enumerating the hostnames seen during crawls.
These extra data pages do reveal some of the internal structure of the index.
URL paths, page descriptions, and titles are stored with full-text search indexes to enable querying.
Each entry in the index is associated with a single URL and stores the URL path as well as a link to both a TLD and a hostname record. The use of TLD and hostname records allows for many URLs to come from the same domain name and TLD (e.g. hostname “google” and TLD “com”) without duplicating both unnecessarily in the database.
This means that for a 500 URL index, the storage space consumed is around 200KiB, most of which is consumed by full-text titles and descriptions from crawled webpages.
It’s certainly been an interesting learning journey to get here and in particular it has been fascinating discovering the subtle nuances of robots.txt
, crawling many URLs in a scalable way, and building something
useful from the ground up in Go.