smbrian

๐Ÿ“… Joined in 2018

๐Ÿ”ผ 39 Karma

โœ๏ธ 5 posts

๐ŸŒ€
5 total posts
Stories0
Comments5
Ask HN0
Show HN0
Jobs0
Polls0

(Replying to PARENT post)

Have you looked at LSM k-v stores (RocksDB being the obvious one)?

Under the hood they are based on building immutable layers of data that are implicitly merged. Clones that share data are cheap (check out rocksdb-cloud for a rocksdb fork which adds this copy-on-write-esque snapshotting). When overwriting a key, the old value will get lazily garbage-collected, but there are ways around that.

I haven't explored it for this usecase, but seems like it might work...

๐Ÿ‘คsmbrian๐Ÿ•‘5y๐Ÿ”ผ0๐Ÿ—จ๏ธ0

(Replying to PARENT post)

The pitch is faster, and more space efficient since column stores are far better for analytics than row stores. Some benchmarks that found ~5-10x speedup: https://uwekorn.com/2019/10/19/taking-duckdb-for-a-spin.html

Consider someone who analyzes medium-sized volumes of (slowly-changing) data -- OLAP, not OLTP. People who need to do this primarily have 2 alternatives:

* columnar database (redshift, snowflake, bigquery)

* a data lake architecture (spark, presto, hive)

The latter can be slow and wasteful, because the data is stored in a form that allows very limited indexing. So imagine you want query speeds that require the former.

Traditional databases can be hugely wasteful for this usecase -- space overhead due to no compression, slow inserts due to transactions. The best analytics databases are closed-source and come with vendor lockin (there are very few good open-source column stores -- clickhouse is one, duckdb is another). Most solutions are multi-node, so they come with operational complexity. So DuckDB could fill a niche here -- data that's big enough to be unwieldy, but not big enough to need something like redshift. It's analogous to the niche SQLite fills in the transactional database world.

๐Ÿ‘คsmbrian๐Ÿ•‘5y๐Ÿ”ผ0๐Ÿ—จ๏ธ0

(Replying to PARENT post)

Sorry, I should've been clearer! Beginner to ML? Stay away from SVMs.

This tutorial looks good, and well written.

๐Ÿ‘คsmbrian๐Ÿ•‘5y๐Ÿ”ผ0๐Ÿ—จ๏ธ0

(Replying to PARENT post)

Agreed -- linear SVMs, especially in text processing applications, is the one area where they are a natural fit. All their attributes complement the domain. Linear SVMs also have desirable performance characteristics.

But at that point, they also have a lot in common with linear models. Those also seem practical in that domain (though I have less experience here, tbh). And performant, when using SGD + feature hashing like e.g. vowpal wabbit.

My beef with non-linear kernels and structured data is a longer discussion, but I find kernel methods for structured data (which is usually high-dimension but low-rank -- lots of shared structure between features, shared structure between missingness of features) to be highly problematic.

๐Ÿ‘คsmbrian๐Ÿ•‘5y๐Ÿ”ผ0๐Ÿ—จ๏ธ0

(Replying to PARENT post)

Stay away, in my opinion. I spent a year supporting a SVM in a production machine learning application, and it made me wish the ML research community hadn't been so in love with them for so long.

They're the perfect blend of theoretically elegant and practically impractical. Training scales as O(n^3), serialized models are heavyweight, prediction is slow. They're like Gaussian Processes, except warped and without any principled way of choosing the kernel function. Applying them to structured data (mix of categorical & continuous features, missing values) is difficult. The hyperparameters are non-intuitive and tuning them is a black art.

GBMs/Random Forests are a better default choice, and far more performant. Even simpler than that, linear models & generalized linear models are my go-to most of the time. And if you genuinely need the extra predictiveness, deep learning seems like better bang for your buck right now. Fast.ai is a good resource if that's interesting to you.

๐Ÿ‘คsmbrian๐Ÿ•‘5y๐Ÿ”ผ0๐Ÿ—จ๏ธ0