Ask HN: How to efficiently import Excel/CSV files

  • There is an old post by patio11 https://www.kalzumeus.com/2015/01/28/design-and-implementati... (HN https://news.ycombinator.com/item?id=8960280 | 107 points | Jan 28, 2015 | 35 comments) Perhaps it has something relevant.

  • Check out PowerShell. In my experience, it handles CSV files the best. Also you can open it up using the Excel module, so you can do pretty much anything you need to the file.

    You will always get false positives, so what I do is dump the PS results into another database, then run a comparison, and then import. It slows you down a bit but I have noticed, every few months, I'll get 1 or 2 new errors due to data sanitization.

  • I would want to make sure your working version of Excel & its settings open the file exactly as your clients' do.

    This could be high-touch but it could also build more client trust and pay off more than other non-automatable efforts.

    Then first actually open each file in Excel.

    I would imagine many clients would expect you to be more advanced at Excel than they are anyway, so do not fall behind.

    That way you and the client always start on the same page, and you have the full power of Excel at your disposal at least like they do.

    Matching the columns is no longer on your plate, fait accompli.

    And a bad XLSx can be caught by Excel without being subject to further incorrect processing.

    Maybe the data will be cleaner or more well-confirmed to begin with, and there might be fewer errors to return already.

    The spreadsheet itself is the database for that one file, in your client's familiar format, an opened XLSx.

    Rather than parsing, you could then consider reading the data directly from the cells using a straightforward macro which creates a binary or text file directly converting to your desired database format, or also an intermediate CSV if desired.

    Excel's built-in macro language should still be capable of accomplishing this, or I would expect you could still call any proper x86 code in other languages as an Excel macro or subroutine if desired too.

    It should then be possible for a powershell script to accomplish this structured file conversion with a single click from a menu, on any target XLSx or folder of XLSx's that conform to the clients' layout, which you have encoded into the custom macro that converts to your proprietary file structure.

    If this is considered efficient while accommodating the overhead of Windows and Excel together only you can decide.

    I used a common office PC for this process so I could encode modern data for use by antique scientific instruments, then sent the files to the equipment in their own obsolete format using the COM port.

    Only decoding the incoming proprietary data stream from the antique gear required a parser.

  • I've used opencsv java library on many occasions. It's pretty neat.

    http://opencsv.sourceforge.net/

  • Don't write the parser yourself. Whatever language you're using very likely already has one that's much better than what you can build and maintain by yourself.

  • I import and export to Excel all the time. I never do any fancy tricks, like exporting format or formulas. I use C# and there are libraries available to make life easy.

  • I’m not sure I understand the problem space here, but consider AWK. It really changed how I think about any problem involving tabular data parsing.

  • https://datafetcher.io/

    Python + Pandas would be my choice to develop something.

  • 1. Come to an agreed upon format with clients 2. SSIS

  • pandas read_csv

  • What language you are using?