I work in rail vehicle engineering consulting. It’s a small industry with maybe a thousand people in the US who do this kind of work. It’s useful to know who when staffing a proposal, figuring out who the competition is fielding, or looking for staff.

I used to keep track of this stuff manually. Then I automated some it via Obsidian and email. Since the industry is so small, I just built a tool to do it systematically and pull in everyone.

What it does

collect.py takes a seed list of consultant names, firms, and projects from a YAML file. For each one, it calls the Claude and Perplexity APIs with web search enabled and asks it to find whatever’s publicly available. The results come back as structured markdown and get saved to disk.

Firms are processed first, because a firm search usually turns up individual names and project names worth following up on. Those get automatically queued for the current run and appended to the targets file for next time.

The whole pipeline is read targets → check if we’ve searched each one recently → Perplexity web search → Claude web search → write markdown → scan for PDF links → repeat.

By default, nothing gets re-searched within 30 days.

PDF parsing

A lot of information is in PDFs for conferences, committee rosters, and procurement documents. The tool scans every markdown file it writes for PDF links in the Sources section. When it finds one, it fetches the PDF, extracts the text with pypdf, and sends it to Claude (without web search this time) to identify anyone who looks like a rail vehicle consultant. New names get added to the yaml queue.

There’s a Wayback Machine fallback for PDFs that have gone offline.

What I actually use it for

Competitive intelligence, mostly. If a transit agency puts out an RFP for vehicle procurement consulting, I want to know who’s likely bidding and who they’d staff. If a name comes up in a project meeting that I don’t recognize, I can add it and have their info in a few minutes.

Data Format

Everything is markdown in a data/ directory with subdirectories for resumes, projects, and firms.