About the HN Buddies Data
This app is a processed similarity index, not a complete HN username directory. It starts from public Hacker News comments, converts author comment histories into keyword vectors, and stores authors only when enough usable keyword signal remains for similarity search.
What Data Is Included
The current pipeline uses public HN comments from Jan 1, 2020 through May 31, 2026. Deleted comments, dead comments, comments without text, and rows without an author are excluded before keyword processing begins.
How The Index Is Built
- Extract comment text by author. Comment HTML is stripped, text is lowercased, and candidate keyword tokens are extracted.
- Remove low-signal words. Common stopwords, filler words, URL fragments, and generic discussion terms are removed.
- Keep usable keywords. Keywords that are too rare or too broad across authors are filtered out before scoring.
- Build author keyword vectors. Remaining keywords are weighted with TF-IDF, then each author keeps their strongest keyword features.
- Find similar authors. Author pairs need enough shared usable keywords to receive a cosine similarity score.
- Store the strongest matches. The app stores a bounded set of top matches per author, with shared keywords kept for explanation.
Why A User May Be Missing
- No comments in the source window. HN accounts outside the exported public comment range are not represented.
- Not enough usable keyword signal. An author can have comments but still lose most terms to stopword, generic-word, rare-word, or broad-word filters.
- No qualifying keyword overlap. Similarity search depends on shared usable keywords. Authors without enough overlap may not appear in match results.
- No stored top match. The UI searches the prepared similarity tables, so weak or filtered matches may be absent even when an author exists on HN.
How To Read The Results
Similarity scores compare keyword profiles, not personal identity, reputation, or overall HN activity. Higher scores mean two authors used a more similar set of weighted keywords in the processed corpus. Shared keyword counts help separate stronger matches from scores based on only a small overlap.