smbrian
๐ Joined in 2018
๐ผ 39 Karma
โ๏ธ 5 posts
Load more
(Replying to PARENT post)
Under the hood they are based on building immutable layers of data that are implicitly merged. Clones that share data are cheap (check out rocksdb-cloud for a rocksdb fork which adds this copy-on-write-esque snapshotting). When overwriting a key, the old value will get lazily garbage-collected, but there are ways around that.
I haven't explored it for this usecase, but seems like it might work...
(Replying to PARENT post)
Consider someone who analyzes medium-sized volumes of (slowly-changing) data -- OLAP, not OLTP. People who need to do this primarily have 2 alternatives:
* columnar database (redshift, snowflake, bigquery)
* a data lake architecture (spark, presto, hive)
The latter can be slow and wasteful, because the data is stored in a form that allows very limited indexing. So imagine you want query speeds that require the former.
Traditional databases can be hugely wasteful for this usecase -- space overhead due to no compression, slow inserts due to transactions. The best analytics databases are closed-source and come with vendor lockin (there are very few good open-source column stores -- clickhouse is one, duckdb is another). Most solutions are multi-node, so they come with operational complexity. So DuckDB could fill a niche here -- data that's big enough to be unwieldy, but not big enough to need something like redshift. It's analogous to the niche SQLite fills in the transactional database world.
(Replying to PARENT post)
This tutorial looks good, and well written.
(Replying to PARENT post)
But at that point, they also have a lot in common with linear models. Those also seem practical in that domain (though I have less experience here, tbh). And performant, when using SGD + feature hashing like e.g. vowpal wabbit.
My beef with non-linear kernels and structured data is a longer discussion, but I find kernel methods for structured data (which is usually high-dimension but low-rank -- lots of shared structure between features, shared structure between missingness of features) to be highly problematic.
(Replying to PARENT post)
They're the perfect blend of theoretically elegant and practically impractical. Training scales as O(n^3), serialized models are heavyweight, prediction is slow. They're like Gaussian Processes, except warped and without any principled way of choosing the kernel function. Applying them to structured data (mix of categorical & continuous features, missing values) is difficult. The hyperparameters are non-intuitive and tuning them is a black art.
GBMs/Random Forests are a better default choice, and far more performant. Even simpler than that, linear models & generalized linear models are my go-to most of the time. And if you genuinely need the extra predictiveness, deep learning seems like better bang for your buck right now. Fast.ai is a good resource if that's interesting to you.