Hacker News

Ask HN: Scripts/commands for extracting URL article text? (links -dump but)

by WCityMikeon 6/28/2019, 6:23:33 PM with 3 comments

by westurneron 6/29/2019, 6:18:55 AM
> I imagine it starts with "links -dump", but then there's using the title as the filename,
The title tag may exceed the filename length limit, be the same for nested pages, or contain newlines that must be escaped.
These might be helpful for your use case:
"Newspaper3k: Article scraping & curation" https://github.com/codelucas/newspaper
lazyNLP "Library to scrape and clean web pages to create massive datasets" https://github.com/chiphuyen/lazynlp/blob/master/README.md#s...
scrapinghub/extruct https://github.com/scrapinghub/extruct
> extruct is a library for extracting embedded metadata from HTML markup.
> It also has a built-in HTTP server to test its output as JSON.
> Currently, extruct supports:
> - W3C's HTML Microdata
> - embedded JSON-LD
> - Microformat via mf2py
> - Facebook's Open Graph
> - (experimental) RDFa via rdflib

by WCityMikeon 6/30/2019, 9:22:24 PM

Just for the record in case anyone digs this up on a later Google search, install the newspaper, unidecode, and re python libraries (pip3 install), then:

  from sys import argv
  from unidecode import unidecode
  from newspaper import Article
  import re

  script, arturl = argv

  url = arturl
  article=Article(url)

  article.download()
  article.parse()

  title2 = unidecode(article.title)

  fname2 = title2.lower()
  fname2 = re.sub(r"[^\w\s]", '', fname2)
  fname2 = re.sub(r"\s+", '-', fname2)
   
  text2 = unidecode(article.text)
  text2 = re.sub(r'\n\s*\n', '\n\n', text2)

  f = open( '~/Desktop/' + str(fname2) + '.txt', 'w' )
  f.write( str(title2) + '\n\n' )
  f.write( str(text2) + '\n' )
  f.close()

I execute via from shell:

  #!/bin/bash
  /usr/local/opt/python3/Frameworks/Python.framework/Versions/3.7/bin/python3 ~/bin/url2txt.py $1

If I want to run it on all the URLs in a text file:

  #!/bin/bash
  while IFS='' read -r l || [ -n "$l" ]; do
    ~/bin/u2t "$l"
  done < $1

I'm sure most of the coders here are wincing at one or multiple mistakes or badly formatted items I've done here, but I'm open to feedback ...

by spaceprisonon 6/28/2019, 9:49:18 PM
I don't know if a specific script but you might be able to make something with python using the requests, beautifulsoup and markdownify modules.
Requests to fetch the page. beautifulsoup to grab the tags you care about (title info) and then markdownify to take the raw html and turn it into markdown.