Ask HN: Extract Any Date Format

  • I don't know the fastest method but you might search around on StackExchange [1] and test some of their answers as I see some examples of people converting dirty date inputs that would surely be faster than using a heavy LLM, especially if broken up into batches and distributed to multiple machines by core/thread counts.

    [1] - https://stackoverflow.com/questions/63371125/python-how-to-c...

  • > I am using this simple task to explore how LLM adaptation capabilities can be made performant for scalable extraction

    Divide and conquer:

    - look at the first 50 or so unhandled cases, and pick the most common pattern in them.

    - find a way to handle that pattern with a smile parser (e.g. a JVM DateTimeFormatter, a regex, or whatever works decently in your preferred language)

    - repeat

    That probably will decrease that 10 million to a million, then to 100,000 fairly rapidly.

    Once you’re down to a manageable number, get your LLM to handle those.

    (Also: this task likely is easily run in parallel, so if you have money, you won’t need 10M+ seconds)

  • I would transform the most commonly occurring formats programmatically. The rest could probably be handled by GPT-3. Alternatively, divide the task between as many cloud VMs hosting something like LLaMa as it takes to fit your time constraints.