Infinite Loops Reading From /dev/null Considered Harmful

In case you haven’t worked it out by now, this blog is about me. Particularly, my need for the cathartic release of my immensely humourous programming anecdotes, minor tales of woe in which I nonetheless triumph, and cautionary tales for the unwary who would follow in my pioneering footsteps. This is… either the woe or the footsteps one. I’m not quite sure. Let me know in the comments, just after you write the bit about how this post totally changed your life.

So, this is a really boring story. About something that happened to me. And, there will be some C code. Just a little, but I feel fair warning is in order. If you are a regular reader of this blog, there’s a 0.43% chance that the information here will actually prove useful to you. I have a Ph.D., and it’s in economics, so I get to just make up numbers like that. It’s one of the perks. Of course, if you’ve reached this page via a search engine then that number might be much higher, depending on your search terms. I could embed an R script in this page that would calculate the likely utility of this article to you based upon your search terms, but that would just be silly since you have gotten yourself here so between you and Google you’ve implicitly calculated that there might be something here of value, or that your time isn’t worth very much. Who am I to argue with that? Also, as we shall see, there are perils for those who would embed R scripts in web pages.

So, it all started on a very busy morning a few days ago. I was rushing out the door to attend a conference when it occurred to me that maybe, instead of just restarting my web server, I should actually try to see what was causing it to hang. For a few days it had been hanging about once a day and causing my users to wail and gnash their teeth and walk over to my desk from 20 feet away to ask me, very sweetly, if I was perhaps doing something incredibly important so that the system was down and, well, if whatever that was could possibly be suspended for just a few minutes so that my system could actually be.. umm.. used or something like that. I would always graciously accede that I would be more than happy to pause whatever experiment I was doing and put the system back online. Integrity is of utmost importance to a programmer. Even if I wasn’t technically doing maintenance on the system, then perhaps I was doing a social experiment to see how long it took people to complain about the system being down. See? Ph.D.s are awesome. You get away with all sorts of stuff.

I had on prior occasions gone as far as typing

ps aux | grep mongrel

to see what information I could glean from mongrel_proctitle, which typically gave a curious result along the lines that the system had spent the last 5 hours trying to load welcome/index. Which, being very keen to maintain good user experience, I usually try to have load slightly more quickly than that. But, knowing that there are some awfully nasty memory-eating SQL queries in my system which need to be redesigned, and being in the middle of lots of other things, I chalked the issue up to my poor overworked dev machine cum server being out of memory, killed the mongrels and restarted them. And, indeed, that usually seemed to do the trick. Although the mongrels were a bit tenacious and sometimes required kill -9ing. Sometimes more than once.

So, this particular morning, I decided to look a little closer in the hopes that my system might stay up for the duration of the conference. I opened production.log and saw nothing amiss. Except that welcome/index had, indeed, never finished loading. So, I opened mongrel.log with OSX Console. Then I looked at the file size of mongrel.log. mongrel.log, for those of you who aren’t rails developers, typically has about … 20 lines of text in it. It logs the initialization of the mongrel servers and that’s pretty much it unless something goes horribly wrong. My mongrel.log was 145 MB, and it was made up of pretty much just one thing. The word Selection, followed by a colon and a space, was repeated until it filled up 145 MB of a log file. As I sat and watched, more “Selection: “s appeared in the file before my very eyes.

I managed to open the file in TextWrangler, which is an OSX text editor that I use for just 1 thing: opening text files in which you might go several MB without a line break. TextWrangler doesn’t crash when you have one of those 10MB SQL dump files in which the entire insert statement appears on a single line. So, with TextWrangler, I was able to see what was in the log file prior to “Selection: “, and it was:


Possible actions:
1: abort (with core dump)
2: normal R exit
3: exit R without saving workspace
4: exit R saving workspace
Selection:

Or rather:

4: exit R saving workspace
Selection: Selection: Selection: Selection: Selection: Selection: Selection: 

This is, for those of you who don’t recognize it, the list of options you get when R encounters a segfault. So, something was segfaulting. That’s fine, almost certainly me doing something very silly, and trackable and fixable pretty easily. But, R was interrogating mongrel.log to ask what it should do. And, mongrel is a pretty amazing piece of software but, sadly, the log files aren’t endowed with artificial intelligence and so mongrel.log was just sitting there, dumbly not answering what is a pretty simple question (you want to choose 1). Indeed, according to the R documentation: “Since the R process may be damaged, the only really safe option is the first.” At this point you might be asking yourself why this menu even exists. If it really bothers you, you can ask the R mailing list, after you’ve read the posting guide.

So, what’s happening in R? It didn’t take that long to find this C code (you were warned) in the R source:


    if(R_Interactive) {
	REprintf("\nPossible actions:\n1: %s\n2: %s\n3: %s\n4: %s\n",
		 "abort (with core dump, if enabled)",
		 "normal R exit",
		 "exit R without saving workspace",
		 "exit R saving workspace");
	while(1) {
	    if(R_ReadConsole("Selection: ", ConsoleBuf, CONSOLE_BUFFER_SIZE,
			     0) > 0) {
		if(ConsoleBuf[0] == ‘1′) break;
		if(ConsoleBuf[0] == ‘2′) R_CleanUp(SA_DEFAULT, 0, 1);
		if(ConsoleBuf[0] == ‘3′) R_CleanUp(SA_NOSAVE, 70, 0);
		if(ConsoleBuf[0] == ‘4′) R_CleanUp(SA_SAVE, 71, 0);
	    }

Yay! An infinite while loop which will only be broken when it gets 1, 2, 3 or 4 from the console buffer. Fortunately, there did seem to be a way to disable this menu. Something called R_Interactive. I eventually found this very helpful snippet on the mailing list which makes me suspect that I’m not the first person to have had this same issue.

> It seems that Rf_initEmbeddedR() sets R_Interactive to TRUE and depends
> on the output of isatty() to change it to FALSE in Rf_initialize_R() if
> there is no tty. Unfortunately, the most common methods for starting
> Postgres leave the tty attached (stdout and stderr are directed to the
> log file). I ended up explicitly writing “R_Interactive = FALSE” just
> after running Rf_initEmbeddedR() — is this a safe and reasonable thing
> to do?

I am using a really sweet rubygem called rsruby to embed R code in ruby and hence rails applications. The author, Alex Gutteridge, is extremely helpful and responsive and he quickly added R_Interactive = FALSE to rsruby’s initialization of R. This instantly solved the problem, since in the event of a segfault R just aborts. You crash. Like you’re supposed to. You get error messages and downtime notifications and helpful stuff like that. Not 145 MB log files and processes that appear to be running.

So, I did a little more digging in the interest of learning. When you just run ruby script/server and get a segfault with R_Interactive = TRUE, the segfault menu appears, just once, in the terminal. You can choose an option from the menu. Life is, not good, but not disastrous. The 145 MB log files come when you daemonize the server which you will almost always do in a real deployment. mongrel_cluster automatically chooses the -d mongrel switch. Daemonize sets STDIN to /dev/null and STDOUT to mongrel.log or server.log or whatever mongrel’s log file is called. R then infinitely tries to suck down /dev/null in the hopes that eventually it will spit out 1,2,3 or 4, but when it gets EOF instead it tries again, prompting with “Selection: “. (About 10 seconds of FWSE research leads me to believe that if STDIN is set to /dev/null it just spits out EOFs.)

Okay, so what’s important about all of this for me is that everything was working pretty much as it was designed to do and as it was supposed to. There is a slight design issue with R’s behaviour as an embedded library, and the fact that R_Interactive isn’t made a lot more visible in the documentation. I would have assumed, with a name like Rf_initEmbeddedR(), that I was initializing something that was safe to embed. It’s poor developer interface design in my opinion, but it’s technically behaving as per the documentation. And, the fix was basically 1 line of code added to rsruby’s initialization routine. (I think there is a good case to be made for breaking the loop above either on “1″ or “EOF”, however, which would also resolve this particular issue.)

I obviously need to rethink my monit config to check for more than just a live PID. I know this, it just hasn’t been an urgent issue since my server is very local and if I have a performance/memory problem I am usually the first to know about it. Actually, the process wasn’t taking up much memory in this case, it just took 5 hours to (not) serve up welcome/index. So, that’s a change I need to make in my deployment recipe.

So, this wasn’t a bug in anything other than my code. And, other than the minor bug which caused the actual segfault, it’s more of an unexpected interaction between a bunch of components that individually were working as per their documentation. So, for the record, it’s potentially dangerous to embed something which behaves interactively in a daemonized system. That’s probably old news to you. But, it’s not always obvious when a library is going to attempt to behave interactively.

Before this gets taken out of context, I want to say that I certainly don’t plan to deploy code which I know will cause a segfault. I haven’t talked about how I will fix the segfault because that’s not really the point of this post. I don’t recall R ever segfaulting in normal operation, i.e. when I wasn’t playing around with a new library or doing something naughty and experimental. But, I want to have R functionality in some internal web applications which are written in Rails. And, I can’t guarantee that there will never be a segfault in any library I might use. What I want is for any future segfaults to blow up noisily and quickly so that I know all about them and can address them. As to the safety of embedding R in a Rails application, I don’t know. There are, of course, risks with any 3rd party library. I know that R is very well established and well respected, but it was not written with this sort of deployment in mind. So, for now, any R integration I do will be strictly for personal or internal use.


Comments

Andrew 07 Mar 2008

Awesome post - that update to RSRuby has helped me no end.
With regard to the segfaults, I think it’s to do with threading in the web server messing up the R library - I’m trying to do a similar thing (have RSRuby running in a GServer which produces answers to each query that a client connects with) - and it’s segfaulting regularly, each time at a different point in the code.
When I run the same code in a loop (single thread) pretending to be the clients, the calculations are all performed correctly - every time.
If you can find out a way to get around this, please do let me know!
Andrew

ana 07 Mar 2008

Hi, Andrew,

Thanks for that information, it’s good to know. I’m suspecting it’s something like that since it doesn’t seem to be a piece of culprit code, everything behaves perfectly well in testing. I am running mongrel from GDB in the hopes of capturing a segfault, but of course it hasn’t thrown one since I doing started that. :-) I’ve tried gctorture in R, and calling GC.start periodically in ruby, but that’s not triggering anything.

The next thing I’m going to do is try running RSRuby in a DRb server, at least that way the web server should be more isolated from segfaults.

I’ll post updates if I have any luck. :-)

-Ana

Ana Nelson · R on Rails with RSRuby 24 Apr 2008

[...] ‹ Infinite Loops Reading From /dev/null Considered Harmful [...]