Hacker News

Ask HN: What database should we use for our “big data” problem?

by iLochon 1/24/2017, 1:15:05 AM with 1 comment

by joeclark77on 1/24/2017, 2:18:04 AM
I have a couple of questions for clarification:
1. When you say "thousands of inserts/updates per second" do you mean thousands of both? Or just thousands of inserts and maybe a smaller number of updates? It makes a difference because updates require finding the record first, and some data stores (like Hadoop) are basically write-only.
2. Are these "aggregations" going to be run on whole database tables at once (like the top score of each game), or on small chunks of them (like an individual player's scores)? For small chunks, strategies like indexing that speed up queries are essential, but for big batch processes, the MapReduce algorithm is the way to go.
I'm going to give my two cents based on a couple of alternative scenarios. In the first scenario, you're capturing gamers' scores/achievements/whatever and it's really only inserts, no updates or deletes except to correct errors. In that case you might go with a basic star schema implemented in a relational database, or something like BigTable if you need it faster. Read the Kimball book (The Data Warehouse Toolkit) if you don't know the model. Basically it's optimized for aggregation -- sums/averages/counts sliced and diced and pivoted any way you want to see them. At the middle of the star schema is a big "fact" table that basically records all events/transactions in a write-once-then-read-only manner.
In the second scenario, you're storing gamers' saved games or something, and really need to do updates and deletes as well as inserts. Yet you want to run aggregations on data at the level of individual gamers and on individual games. In this case I think I'd use some NoSQL database with very fast access and redundancy, based on Amazon's Dynamo. Riak is based on that model. I can't remember if Cassandra is. You could have two "tables", one for players and one for games, and each data point would be inserted twice. So you could get all the data for a game with one quick hash lookup, or all the data for a player, and then run your aggregation at that level of detail. Really big system-wide analyses could run on the whole database using MapReduce (you mentioned Spark, that'd do it).