I wonder how this would perform on the M4 Makridakis competitions (time series competitions)
https://github.com/Mcompetitions/M4-methods
https://en.wikipedia.org/wiki/Makridakis_Competitions
Makridakis' conclusion remained true for many years: "statistically sophisticated and complex methods do not necessarily provide more accurate forecasts than simpler ones."
Maybe things have changed?
(side: Nixtla showed a simple ensemble outperforming Chronos, and the Chronos team responded, but there's some back and forth in the comments: https://www.linkedin.com/pulse/extended-comparison-chronos-a...)
Look i'm optimistic about time-series foundation models too, but this post is hard to take seriously when the test is so flawed:
- Forward filling missing short periods of missing values. Why keep this in when you explictly mention this is not normal? Either remove it all or don't impute anything
- Claiming superiority over classic models and then not mentioning any in the results table
- Or let's not forget, the cardinal sin of using MAPE as an evaluation metric
> Our dataset consisted of Kubernetes pod metrics collected from a production retail checkout application.
That sums it up and it’s no surprise why Datadog’s toto model performed exceptionally well.
The results would have been much more useful had they opted for a heterogenous mix of data sets. I am thinking of census data and statistics, or financial forecasting (GDP, interest rates), or clinical trial drop-out rates etc. So many interesting problems out there.
I'd be curious what the results would be with the automated Autogluon fit/evals. I suspect given the results here, a weighted average model would likely win out.
Interesting, what are the usecases youre using the models for? Would like to know more on that, like anomaly detection
I'm a bit confused by the results table. Were these models tested against the same dataset? Also, a visualization of the test data and forecasts would be helpful as well.
[flagged]
[flagged]
[flagged]
I think that the concept of a "foundation model" for time series is actually a bit flawed as presented in this blog post. A foundation model is interesting because it is capable of many tasks _beyond the target tasks_ that it was trained to do, whereas what the author is looking for is a time-series model that can make out-of-distribution predictions without re-training - which is, in my opinion, a problem that is pretty well solved by existing ARIMA and (especially) Prophet models (Yes, you have to re-fit the model on your distribution, but this is not at all akin to the task of training or fine-tuning an LLM, it's something you can do in seconds on a modern CPU, and yes, there are certain hyperparameters that may need to be selected, but they are actually fairly minimal).
But for a model to make out-of-distribution predictions does not make it a foundation model for time series, really that's just the basic task that all time series forecasting models do. A more interesting question is, does an LLM architecture seem to improve the task of univariate or multivariate time-series prediction? I don't think the answer is yes, although, depending on your domain, being able to use language inputs to your model may have a positive impact, and the best way to incorporate language inputs is certainly to use a transformer architecture, but that isn't what is addressed in this post.