Shoestring Budget? Starting to feel growth issues on your back-end? Embrace unix and C
๐Ÿ‘คscumola๐Ÿ•‘17y๐Ÿ”ผ140๐Ÿ—จ๏ธ120

(Replying to PARENT post)

This may have been a fun and interesting project for you (and a great way to learn the unix utils), but I wouldn't recommend that other startups follow this path.

Rewriting something in C should be the last thing you do, not the first. The first thing you should do is find out why it's slow. In your case, it sounds as though you were fetching one url at a time (blocking). Switching to async io all could have fixed this.

Btw, if you're using the Gnu utilities, it's unlikely that they were written in the 60's and 70's (also, people were processing much smaller amounts of data back then).

๐Ÿ‘คpaul๐Ÿ•‘17y๐Ÿ”ผ0๐Ÿ—จ๏ธ0

(Replying to PARENT post)

Great post!

Please, HN, change the color of text in posts like the above. There is very little contrast between #828282 (copy) and #F6F6EF (background), and I for one am sick of having to fix this with Firebug.

๐Ÿ‘คthomasmallen๐Ÿ•‘17y๐Ÿ”ผ0๐Ÿ—จ๏ธ0

(Replying to PARENT post)

scumola, this should have been a blog post that you linked to. I'm not saying this because I don't think this is HN-quality material, it's actually an awesome little story to read. But if this was on a company blog somewhere, it would be generating juice for your company in addition to for HN.

This is great news and a great way to spread the word about your service. Don't let that opportunity go to waste!

๐Ÿ‘คDanHulton๐Ÿ•‘17y๐Ÿ”ผ0๐Ÿ—จ๏ธ0

(Replying to PARENT post)

>The unix text utilities were written in the 60's and 70's when computers were 33mhz and had 5MB of ram.

I'm not positive (I was born in the 70s), but I'm pretty sure they had less RAM and speed than this.

๐Ÿ‘คConradHex๐Ÿ•‘17y๐Ÿ”ผ0๐Ÿ—จ๏ธ0

(Replying to PARENT post)

Okay, there is a lot of noise in this thread, so this needs to be clearly stated:

Good job, scumola.

It's impressive that you diagnosed the architectural bottleneck of your design and solved it with the least amount of effort (from your standpoint) and achieved a 48x speedup. There are many developers who simply can't do that; they have tunnelvision, wasting a ton of time on improving portions of their systems without first thinking deeply about the problem they're trying to solve (and about simple ways to sidestep that problem).

In your case, you identified the correct problem, which was "How can I maximize the number of sites crawled per day?" and not "How can I optimize [the database, the perl scripts, etc]?" And then you did the most straightforward optimization you thought of, accomplishing your goal in one or two nights. Your solution is valid, maintainable, and most importantly works, and so I personally don't see anything wrong with it. Again, nice hack.

๐Ÿ‘คpalish๐Ÿ•‘17y๐Ÿ”ผ0๐Ÿ—จ๏ธ0

(Replying to PARENT post)

This is one reason I like Python so much. Writing hooks into C/C++ is pretty easy. Often they are already there for you. For example, OpenCV is an excellent image processing library in C++, and already has hooks to Python.

Lots of companies take this approach, of Python + C/C++. A few that come to mind are Google, Weta, ILM, iRobot, and Tipjoy.

๐Ÿ‘คivankirigin๐Ÿ•‘17y๐Ÿ”ผ0๐Ÿ—จ๏ธ0

(Replying to PARENT post)

Congrats!

If you haven't yet, check out _The Unix Programming Environment_ and _The Practice of Programming_, both by Rob Pike and Brian Kernighan (K of K&R). They're concise, highly informative books about using the Unix toolset to their maximum potential. The former was written back when computers were slow and had little memory, the latter in from 1999 but very much in the same spirit. (It seems to include a lot of insights from developing Plan 9.)

Also, a dissenting opinion here: C's performance vs. higher level languages' development speed is not necessarily an either/or choice. Some languages (OCaml, the Chicken Scheme compiler, implementations of Common Lisp with type annotations or inference for optimizing, Haskell (under certain conditions...), others) can perform very favorably compared to C, but tend to be much, much easier to maintain and debug.

As a generalization, languages that let you pin down types are faster because they only need to determine casts once, at compile time, but if those decisions can be postponed until your program is already mostly worked out (or better still, automatically inferred and checked for internal consistency), you can keep the overall design flexible while you're experimenting with it. Win / win.

Also (as I note in a comment below), Python can perform quite well when the program is small amounts of Python tying together calls to its standard library, much of which (e.g. string processing) is written in heavily optimized C.

Alternately, you could embed Lua in your C and write the parts that don't need to be tuned (or the first draft of everything) in that.

๐Ÿ‘คsilentbicycle๐Ÿ•‘17y๐Ÿ”ผ0๐Ÿ—จ๏ธ0

(Replying to PARENT post)

> multi-threadded crawler in C

If your crawler is I/O-bound, then just wait till you discover epoll :)

Or, on a more general note, have a look at http://www.kegel.com/c10k.html

๐Ÿ‘คhuhtenberg๐Ÿ•‘17y๐Ÿ”ผ0๐Ÿ—จ๏ธ0

(Replying to PARENT post)

Congratulations. That is an impressive speed up.

I agree about the built-in Unix utils. You just do not see people taking advantage of these powerful and extremely optimized programs any more. I wonder how many times grep or sort have been unwittingly rewritten in Perl or Ruby because the programmer lacked familiarity with basic Unix tools?

As for your crawler, I think the significant thing here is that you rewrote something in C after you already had it working in another language. Not to bag on C, but writing the original in a higher level language first gives you a better shot at correcting any bugs in the actual solution domain. Then if you move to C you're only fighting against C, not against C and bugs in your solution at the same time.

๐Ÿ‘คsoftbuilder๐Ÿ•‘17y๐Ÿ”ผ0๐Ÿ—จ๏ธ0

(Replying to PARENT post)

Congrats on your big speedup; successful optimization like that is always a rush.

I wonder what the result would be if you did everything you describe, but wrote the code that's now in C, in Python instead. I suspect the speed would be very similar. (I like C, for what it's worth.)

๐Ÿ‘คConradHex๐Ÿ•‘17y๐Ÿ”ผ0๐Ÿ—จ๏ธ0

(Replying to PARENT post)

"Starting to feel growth issues on your back-end?"

I think you'd better get a doctor to look at that.

๐Ÿ‘คhopeless๐Ÿ•‘17y๐Ÿ”ผ0๐Ÿ—จ๏ธ0

(Replying to PARENT post)

Key to innovation: no funds, no Internet, a laptop and free time.
๐Ÿ‘คpjf๐Ÿ•‘17y๐Ÿ”ผ0๐Ÿ—จ๏ธ0

(Replying to PARENT post)

We had some views in our rails app that could be hit several times per second by our users, and they were uncacheable, so we implemented the views in c++ using libpg and fastcgi. So much awesomely faster.

If you've got some code that works well and you can solidify into c++ (ie - you're no longer tweaking it 3 times a day) - it's totally worth spending the time to rewrite.

I think C++ on rails would be a great idea, ie - some code generators to help you port your most used .rhtml views over to C++.

๐Ÿ‘คbnolan๐Ÿ•‘17y๐Ÿ”ผ0๐Ÿ—จ๏ธ0

(Replying to PARENT post)

Very cool. I have had a similar experience with Mibbit...

Next step, get rid of those threads and use non blocking IO ;)

๐Ÿ‘คaxod๐Ÿ•‘17y๐Ÿ”ผ0๐Ÿ—จ๏ธ0

(Replying to PARENT post)

If you enjoyed learning/using sed, awk, grep and other Unix text processing utils, you'll love this: http://borel.slu.edu/obair/ufp.pdf

Its called Unix For Poets and it'll show you just how far you can really push these tools.

๐Ÿ‘คyoungnh๐Ÿ•‘17y๐Ÿ”ผ0๐Ÿ—จ๏ธ0

(Replying to PARENT post)

What's your average page size?

Does your crawler support gzip compressed transfers?

What speed is your Comcast link?

๐Ÿ‘คnatch๐Ÿ•‘17y๐Ÿ”ผ0๐Ÿ—จ๏ธ0

(Replying to PARENT post)

First, I hope you weren't using the obscenely slow version of Perl that ships with RedHat.

Second, for more perf analysis, there are some very good unix tools for profiling & optimizing C code. Many of them free.

๐Ÿ‘คlallysingh๐Ÿ•‘17y๐Ÿ”ผ0๐Ÿ—จ๏ธ0

(Replying to PARENT post)

What's interesting about this posting is unbenost to my co-founder, yesterday I was discussing some of the issues we were facing with a "retired" unix programmer and he talked about early sorting and searching methods. It was quite remarkable what they achieved with very little in the way of hardware and memory.
๐Ÿ‘คtyohn๐Ÿ•‘17y๐Ÿ”ผ0๐Ÿ—จ๏ธ0

(Replying to PARENT post)

perl threading was (and I'm sure still is) absolutely horrid.

I'd be interested in seeing the results of a rewrite not to C, but to python or ruby where the threading support is much much better. Then you could rewrite functions at a time in C as needed, but not have the extra burden of rewriting the whole thing.

I totally agree with the rest of the approach. Going low tech and using unix tools is a very good way to reduce overhead, increase parallelism, and delay calculations. One of the nice things about this approach is you can cobble up another $50 unix box to do some of the bulk processing via nfs or other means.

Congrats... It sounds like a very interesting project.

๐Ÿ‘คzenspider๐Ÿ•‘17y๐Ÿ”ผ0๐Ÿ—จ๏ธ0

(Replying to PARENT post)

> I'm still looking for ways to optimize things

Re-implement your crawler on a FPGA.

Just kidding :) It's great that you got what you were expecting, but as Paul said, you have to profile before you optimize (otherwise you will "optimize" useless stuff, and not only waste your time but also likely do counter-optimizations).

Anyway, your post was very interesting. Because a lot of people assume they have to use layers on top of the OS, while modern Unix systems have good file systems and memory managers (maybe have a look at DragonFly BSD; they are going into an interesting direction).

๐Ÿ‘คcorentin๐Ÿ•‘17y๐Ÿ”ผ0๐Ÿ—จ๏ธ0

(Replying to PARENT post)

i used an implementation of cilk when i crawled ~1000000 htmls from a social network in 2 hours (can't disclose -- the site went into maintenance during/after my experiments ... just to exercise my curiosity)

had to be root to raise the user's maxproc-max (previous experiments locked me out --- "sh can't fork" messages ... can't even ssh in)

it's done in a 100mbps amd2000+ 256 mb 40 GB IDE OpenBSD colo ... but i don't think the hardware matters that much (the cilk is really the key)

๐Ÿ‘คhs๐Ÿ•‘17y๐Ÿ”ผ0๐Ÿ—จ๏ธ0

(Replying to PARENT post)

Yes, I love it when things can be speed up with old school techniques. Although I'm not sure C was a really great choice in case you need to upscale later.
๐Ÿ‘คpwoods๐Ÿ•‘17y๐Ÿ”ผ0๐Ÿ—จ๏ธ0

(Replying to PARENT post)

Most startups that have scaling problems have them due to number of customers, in which case it's trivial to either monetize or raise some money.
๐Ÿ‘คmattmaroon๐Ÿ•‘17y๐Ÿ”ผ0๐Ÿ—จ๏ธ0

(Replying to PARENT post)

Congrats. I'm sure that's a good feeling.
๐Ÿ‘คdonniefitz2๐Ÿ•‘17y๐Ÿ”ผ0๐Ÿ—จ๏ธ0

(Replying to PARENT post)

I once had a boss who refused to buy the programmers Pentium machines and made us use 486s instead. He said it wasn't because he was cheap, but it was so that if we wrote code that ran really fast on a 486, just think how fast it will run on a Pentium. Nevermind that none of the new Pentium features, like SIMD, could be taken advantage of.

That company is no longer in business.

๐Ÿ‘คthwarted๐Ÿ•‘17y๐Ÿ”ผ0๐Ÿ—จ๏ธ0

(Replying to PARENT post)

It's true that utilities like grep, sort, and join are highly optimized. Many people have had wins replacing scripting logic with calls to these utilities. But I'm having some difficulty understanding the merits of putting the top level logic in C, using shell scripts, and the alleged advantages of flat text files.

Shell scripts are hard to make robust. I don't understand why you wouldn't just drive the unix utilities from perl.

Why use perl threads on unix? Everyone know they suck. Why not fork and use a transactional database for IPC?

I remain unconvinced of the alleged wins people have with flat file solutions. It's usually about replacing a toy like mysql or some convoluted berkelydb mess. You can use db2 for free up to 2 gigs of memory. Why waste even a minute replicating transactional RDBMS functionality by hand? As soon as you're dealing with flock and company and you COULD be using a DB, you should, as far as I'm concerned.

๐Ÿ‘คkingkongrevenge๐Ÿ•‘17y๐Ÿ”ผ0๐Ÿ—จ๏ธ0

(Replying to PARENT post)

Nice, but are your growth issues so bad that you couldn't make this a blog entry on mediawombat.com instead?
๐Ÿ‘คcdr๐Ÿ•‘17y๐Ÿ”ผ0๐Ÿ—จ๏ธ0