Paperless: Scan and index paper documents

  • For me the problem is stability and future-proofness. Technology changes very quickly. If the maintainer loses interest, the software may rot away as the dependencies change, etc.

    Important documents often need to be stored for 5-10-20 years. Why put everything in this shiny new software, when it may change in 1 or 2 years?

    I think it's best to just put scanned pdfs in folders based on year and topic. Those can be easily and transparently backed up and searched.

    But on a few months timescale this software could be useful.

  • I have been doing something similar for a couple of years.

    My printer can scan to a shared drive on my home LAN, saving files as PDFs. These are then uploaded Google Drive where everything else happens automatically (e.g. if you search for something, it will find it in scanned PDFs automatically).

    Its super-useful especially since the mobile clients for drive is rock solid. I can be on the phone to someone and pull up basically any document I've had since the 90s in a couple of seconds, for free. Its kinda fun being on the phone to a call centre and being able to pull up data quicker than they can. Tax returns are an absolute doddle when everything is paperless.

    The only thing that is missing for me from Google Drive is like a "Knowledge Graph" for my own documents - I can search by keyword or filename etc sure, but I'd like to get some "intelligence" next like we're used to with Google Now, but for my scanned docs, like "show me my bank statements with a payment to Amazon in the last 3 months" etc.

  • If you don't want to buy a document scanner, just use your mobile phone for this.

    I personally use Scanbot for this, it automatically recognizes, crops and OCRs documents (on the device) and stores them as PDF with the extracted text in the location of your choosing. Works well enough.

  • I've been using Evernote's Scannable for receipts and single pages. I had been using a scanner w/ ADF, but it was slow I never automated it.

    Scannable works really fast and Evernote indexes PDFs.

    If only Evernote's editor didn't make me want to switch away every time I use it...

  • I use google docs for it. You can upload scanned documents to Google docs. Documents are automatically OCRed, you can search by keywords and you can still access the original image.

    Disclaimer: I work at google, although not on the Google docs team.

  • Last time I checked it's much cheaper to get document scanner with ADF built together with a laser printer than to buy one standalone. I was quite surprised.

  • Nice combination of technologies to solve a problem -- could be very useful for a business that needs to be able to archive and access paper records.

    But for a household -- there are very few documents you need to keep long term. Better to just keep those in a fireproof file box, and shred and discard everything else rather than devote any resources or mental energy to keeping them around in either paper or digital form.

  • It would be nice if this joined forces with Camlistore to hurry up the Scanning Cabinet replacement :-)

  • I bought a high-speed scanner with OCR a few years ago. MacOS automatically indexes PDFs, so I can easily search through my scanned documents in Finder.

    A magic folder system, like Dropbox or Syncplicity, makes sure that the pdfs are safely backed up for me.

  • You can use Docady's scanner that also does OCR and recognizes its content. It then stores your documents and encrypts them. At the moment it's available on iOS, but should be available soon in Android too.

    Demo: https://www.youtube.com/watch?v=cN_Zw6xoUaw

    App: https://itunes.apple.com/US/app/id921250909?mt=8

    (Full Disclosure: I work at Docady and part of its team)

  • I've found adobe acrobat x to be great for OCR and indexing of PDFs. Nothing else I've used comes close to what it can recognise.

  • It seems somewhat ironic to me that someone built this whole paper to ocr system, and then says "hey use it with a scanner like X", which has OCR capabilities (producing searchable PDFs) built in.

  • OP, great job. I have been trying to solve this very same problem for over an year now, and have a business plan based on the same. Is there a way I can pm you and get some clarifications. Thanks.

  • Any chance of a wiki to group-collaborate on getting different scanners to work with this?

    I have a HP envy I'd like to glue to the cloud.

  • No Dockerfile? I have a Dockerfile for handling Web cam images sent by FTP: https://github.com/kaihendry/camftp2web/blob/master/Dockerfi...

  • This is all good and well if you're tolerant of occasional random errors coming out of the OCR process

  • What about just take a picture?