(Replying to PARENT post)
Please, HN, change the color of text in posts like the above. There is very little contrast between #828282 (copy) and #F6F6EF (background), and I for one am sick of having to fix this with Firebug.
(Replying to PARENT post)
This is great news and a great way to spread the word about your service. Don't let that opportunity go to waste!
(Replying to PARENT post)
I'm not positive (I was born in the 70s), but I'm pretty sure they had less RAM and speed than this.
(Replying to PARENT post)
Good job, scumola.
It's impressive that you diagnosed the architectural bottleneck of your design and solved it with the least amount of effort (from your standpoint) and achieved a 48x speedup. There are many developers who simply can't do that; they have tunnelvision, wasting a ton of time on improving portions of their systems without first thinking deeply about the problem they're trying to solve (and about simple ways to sidestep that problem).
In your case, you identified the correct problem, which was "How can I maximize the number of sites crawled per day?" and not "How can I optimize [the database, the perl scripts, etc]?" And then you did the most straightforward optimization you thought of, accomplishing your goal in one or two nights. Your solution is valid, maintainable, and most importantly works, and so I personally don't see anything wrong with it. Again, nice hack.
(Replying to PARENT post)
Lots of companies take this approach, of Python + C/C++. A few that come to mind are Google, Weta, ILM, iRobot, and Tipjoy.
(Replying to PARENT post)
If you haven't yet, check out _The Unix Programming Environment_ and _The Practice of Programming_, both by Rob Pike and Brian Kernighan (K of K&R). They're concise, highly informative books about using the Unix toolset to their maximum potential. The former was written back when computers were slow and had little memory, the latter in from 1999 but very much in the same spirit. (It seems to include a lot of insights from developing Plan 9.)
Also, a dissenting opinion here: C's performance vs. higher level languages' development speed is not necessarily an either/or choice. Some languages (OCaml, the Chicken Scheme compiler, implementations of Common Lisp with type annotations or inference for optimizing, Haskell (under certain conditions...), others) can perform very favorably compared to C, but tend to be much, much easier to maintain and debug.
As a generalization, languages that let you pin down types are faster because they only need to determine casts once, at compile time, but if those decisions can be postponed until your program is already mostly worked out (or better still, automatically inferred and checked for internal consistency), you can keep the overall design flexible while you're experimenting with it. Win / win.
Also (as I note in a comment below), Python can perform quite well when the program is small amounts of Python tying together calls to its standard library, much of which (e.g. string processing) is written in heavily optimized C.
Alternately, you could embed Lua in your C and write the parts that don't need to be tuned (or the first draft of everything) in that.
(Replying to PARENT post)
If your crawler is I/O-bound, then just wait till you discover epoll :)
Or, on a more general note, have a look at http://www.kegel.com/c10k.html
(Replying to PARENT post)
I agree about the built-in Unix utils. You just do not see people taking advantage of these powerful and extremely optimized programs any more. I wonder how many times grep or sort have been unwittingly rewritten in Perl or Ruby because the programmer lacked familiarity with basic Unix tools?
As for your crawler, I think the significant thing here is that you rewrote something in C after you already had it working in another language. Not to bag on C, but writing the original in a higher level language first gives you a better shot at correcting any bugs in the actual solution domain. Then if you move to C you're only fighting against C, not against C and bugs in your solution at the same time.
(Replying to PARENT post)
I wonder what the result would be if you did everything you describe, but wrote the code that's now in C, in Python instead. I suspect the speed would be very similar. (I like C, for what it's worth.)
(Replying to PARENT post)
I think you'd better get a doctor to look at that.
(Replying to PARENT post)
(Replying to PARENT post)
If you've got some code that works well and you can solidify into c++ (ie - you're no longer tweaking it 3 times a day) - it's totally worth spending the time to rewrite.
I think C++ on rails would be a great idea, ie - some code generators to help you port your most used .rhtml views over to C++.
(Replying to PARENT post)
Next step, get rid of those threads and use non blocking IO ;)
(Replying to PARENT post)
Its called Unix For Poets and it'll show you just how far you can really push these tools.
(Replying to PARENT post)
Does your crawler support gzip compressed transfers?
What speed is your Comcast link?
(Replying to PARENT post)
Second, for more perf analysis, there are some very good unix tools for profiling & optimizing C code. Many of them free.
(Replying to PARENT post)
(Replying to PARENT post)
I'd be interested in seeing the results of a rewrite not to C, but to python or ruby where the threading support is much much better. Then you could rewrite functions at a time in C as needed, but not have the extra burden of rewriting the whole thing.
I totally agree with the rest of the approach. Going low tech and using unix tools is a very good way to reduce overhead, increase parallelism, and delay calculations. One of the nice things about this approach is you can cobble up another $50 unix box to do some of the bulk processing via nfs or other means.
Congrats... It sounds like a very interesting project.
(Replying to PARENT post)
Re-implement your crawler on a FPGA.
Just kidding :) It's great that you got what you were expecting, but as Paul said, you have to profile before you optimize (otherwise you will "optimize" useless stuff, and not only waste your time but also likely do counter-optimizations).
Anyway, your post was very interesting. Because a lot of people assume they have to use layers on top of the OS, while modern Unix systems have good file systems and memory managers (maybe have a look at DragonFly BSD; they are going into an interesting direction).
(Replying to PARENT post)
had to be root to raise the user's maxproc-max (previous experiments locked me out --- "sh can't fork" messages ... can't even ssh in)
it's done in a 100mbps amd2000+ 256 mb 40 GB IDE OpenBSD colo ... but i don't think the hardware matters that much (the cilk is really the key)
(Replying to PARENT post)
(Replying to PARENT post)
(Replying to PARENT post)
(Replying to PARENT post)
That company is no longer in business.
(Replying to PARENT post)
Shell scripts are hard to make robust. I don't understand why you wouldn't just drive the unix utilities from perl.
Why use perl threads on unix? Everyone know they suck. Why not fork and use a transactional database for IPC?
I remain unconvinced of the alleged wins people have with flat file solutions. It's usually about replacing a toy like mysql or some convoluted berkelydb mess. You can use db2 for free up to 2 gigs of memory. Why waste even a minute replicating transactional RDBMS functionality by hand? As soon as you're dealing with flock and company and you COULD be using a DB, you should, as far as I'm concerned.
(Replying to PARENT post)
Rewriting something in C should be the last thing you do, not the first. The first thing you should do is find out why it's slow. In your case, it sounds as though you were fetching one url at a time (blocking). Switching to async io all could have fixed this.
Btw, if you're using the Gnu utilities, it's unlikely that they were written in the 60's and 70's (also, people were processing much smaller amounts of data back then).