Ask HN: Is there a Hacker News takeout to export my comments / upvotes, etc.?

  • You can export the whole dataset as described here: https://github.com/ClickHouse/ClickHouse/issues/29693

    Or query one of the preloaded datasets: https://play.clickhouse.com/play?user=play#U0VMRUNUICogRlJPT...

        curl https://clickhouse.com/ | sh
    
        ./clickhouse client --host play.clickhouse.com --user play --secure --query "SELECT * FROM hackernews WHERE by = 'thyrox' ORDER BY time" --format JSON

  • Here's a small, crude, Scrapy spider, with hardcoded values and all. You can set the value of `DOWNLOAD_DELAY` in `settings.py` for courtesy. It puts the comments in a `posts` directory as `html` files.

    It doesn't do upvotes nor stories/links submitted (they have the type `story` in the response, as opposed to `text` for comments). You can easily tweak it.

      from pathlib import Path
      
      import scrapy
      import requests
      import html
      import json
      import os
    
      USER = 'Jugurtha'  
      LINKS = f'https://hacker-news.firebaseio.com/v0/user/{USER}.json?print=pretty'
      BASE_URL = 'https://hacker-news.firebaseio.com/v0/item/'
    
      class HNSpider(scrapy.Spider):
          name = "hn"
      
          def start_requests(self):
              submitted = requests.get(LINKS).json()['submitted']
              urls = [f'{BASE_URL}{sub}.json?print=pretty' for sub in submitted]
              for url in urls:
                  item = url.split('/item/')[1].split('.json')[0]
                  filename = f'{item}.html'
                  filepath = Path(f'posts/{filename}')
                  if not os.path.exists(filepath):
                      yield scrapy.Request(url=url, callback=self.parse)
                  else:
                      self.log(f'Skipping already downloaded {url}')
      
          def parse(self, response):
              item = response.url.split('/item/')[1].split('.json')[0]
      
              filename = f"{item}.html"
              content = json.loads(response.text).get('text')
              if content is not None:
                  text = html.unescape(content)
                  filepath = Path(f'posts/{filename}')
      
                  with open(Path(f'posts/{filename}'), 'w') as f:
                      f.write(text)
                      self.log(f"Saved file {filename}")

  • I wrote a JS one years ago. It still seems to work but it might need some more throttling.

    https://news.ycombinator.com/item?id=34110624

    Edit: I see I added a sleep on line 83 a few years ago.

    Edit 2: I just fixed a big bug, I'm not sure if it was there before.

    Edit 3: I wrote a Python one, too, but I haven't tested it and it most likely needs to be throttled. It's also not currently authenticated so only useful for certain pages unless you add authentication.

    https://github.com/gabrielsroka/gabrielsroka.github.io/blob/...

  • There are few tests for this script which isn't packaged: https://github.com/westurner/dlhn/ https://github.com/westurner/dlhn/tree/master/tests https://github.com/westurner/hnlog/blob/master/Makefile

    Ctrl-F of the one document in a browser tab works, but isn't regex search (or `grep -i -C`) without a browser extension.

    Dogsheep / datasette has a SQLite query Web UI

    HackerNews/API: https://github.com/HackerNews/API

  • https://gist.github.com/verdverm/23aefb64ee981e17452e95dd5c4...

    Fetches pages and then converts to json

    There might be an HN API now. I know.theyve wanted one and I thought I might have seen posts more recently that made me think it now exists, but I haven't looked for it myself

  • Nothing out of the box.

    There's a copy of the data in bigquery: https://console.cloud.google.com/bigquery?p=bigquery-public-...

    But the latest post is from Nov 2022, not sure if/when it gets reloaded.

  • Partially. https://github.com/dogsheep/hacker-news-to-sqlite