Hacker News

A library to easily scrape metadata from an article on the web

by Sharmaon 5/21/2020, 4:27:06 AM with 1 comment

by bryanrasmussenon 5/21/2020, 4:43:28 AM
hah, I recently had a project with an ex-cofounder (non-technical) where he wanted me to make a scraper for some sites.
He said he wanted to scrape the metadata, he suggested this library even, I could hardly see what I needed to do and why his current programmers couldn't do it.
After going back and forth for what seemed like infinity it turns out he didn't want metadata and when he used terms like title he meant an h1 at a particular position on some pages, and a div on other pages, and description was a div sibling to the h1 on site 1 but a span with a randomly generated id on another site etc. etc.
It was easy enough to do in the end, as is often the case the difficulty was in communication, but it did highlight one point - generally the metadata of a modern web page is not that interesting to scrape.