Sunday, December 5, 2004

Problems with JVM crashing


I suddenly seem to have all kinds of problems with the JVM crashing when I'm creating it in our monitor code. The way things work is that I have an executable that linksjava.so instead of using the shipped java exectuable. I call this the "driver." Here's what I've found:
The driver will often (but not most of the time) crash, only when -Xdebug is given, with the following stack trace:
gdb build/debug.linux.x86.rhel3/bin/scdriver_debug core.28224
GNU gdb 6.0
Copyright 2003 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you are
welcome to change it and/or distribute copies of it under certain conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB.  Type "show warranty" for details.
This GDB was configured as "i686-pc-linux-gnu"...
Core was generated by `/home/jared.oberhaus/jared.oberhaus-linux3-all/shared/1.2/build/debug.linux.x86'.
Program terminated with signal 11, Segmentation fault.
Reading symbols from /home/jared.oberhaus/jared.oberhaus-linux3-all/tools/linux/j2sdk1.4.2_06/jre/lib/i386/libjava.so...done.
Loaded symbols for /home/jared.oberhaus/jared.oberhaus-linux3-all/tools/linux/j2sdk1.4.2_06/jre/lib/i386/libjava.so
Reading symbols from /lib/tls/libpthread.so.0...done.
Loaded symbols for /lib/tls/libpthread.so.0
Reading symbols from /lib/tls/libc.so.6...done.
Loaded symbols for /lib/tls/libc.so.6
Reading symbols from /home/jared.oberhaus/jared.oberhaus-linux3-all/tools/linux/j2sdk1.4.2_06/jre/lib/i386/server/libjvm.so...done.
Loaded symbols for /home/jared.oberhaus/jared.oberhaus-linux3-all/tools/linux/j2sdk1.4.2_06/jre/lib/i386/server/libjvm.so
Reading symbols from /home/jared.oberhaus/jared.oberhaus-linux3-all/tools/linux/j2sdk1.4.2_06/jre/lib/i386/libverify.so...done.
Loaded symbols for /home/jared.oberhaus/jared.oberhaus-linux3-all/tools/linux/j2sdk1.4.2_06/jre/lib/i386/libverify.so
Reading symbols from /lib/libnsl.so.1...done.
Loaded symbols for /lib/libnsl.so.1
Reading symbols from /lib/libdl.so.2...done.
Loaded symbols for /lib/libdl.so.2
Reading symbols from /lib/ld-linux.so.2...done.
Loaded symbols for /lib/ld-linux.so.2
Reading symbols from /lib/tls/libm.so.6...done.
Loaded symbols for /lib/tls/libm.so.6
Reading symbols from /home/jared.oberhaus/jared.oberhaus-linux3-all/tools/linux/j2sdk1.4.2_06/jre/lib/i386/native_threads/libhpi.so...done.
Loaded symbols for /home/jared.oberhaus/jared.oberhaus-linux3-all/tools/linux/j2sdk1.4.2_06/jre/lib/i386/native_threads/libhpi.so
Reading symbols from /lib/libnss_files.so.2...done.
Loaded symbols for /lib/libnss_files.so.2
Reading symbols from /lib/libnss_ldap.so.2...done.
Loaded symbols for /lib/libnss_ldap.so.2
Reading symbols from /lib/libresolv.so.2...done.
Loaded symbols for /lib/libresolv.so.2
Reading symbols from /lib/libnss_dns.so.2...done.
Loaded symbols for /lib/libnss_dns.so.2
Reading symbols from /usr/lib/sasl/libanonymous.so...done.
Loaded symbols for /usr/lib/sasl/libanonymous.so
Reading symbols from /usr/lib/sasl/libcrammd5.so...done.
Loaded symbols for /usr/lib/sasl/libcrammd5.so
Reading symbols from /usr/lib/sasl/libdigestmd5.so...done.
Loaded symbols for /usr/lib/sasl/libdigestmd5.so
Reading symbols from /usr/kerberos/lib/libdes425.so.3...done.
Loaded symbols for /usr/kerberos/lib/libdes425.so.3
Reading symbols from /usr/kerberos/lib/libkrb5.so.3...done.
Loaded symbols for /usr/kerberos/lib/libkrb5.so.3
Reading symbols from /usr/kerberos/lib/libcom_err.so.3...done.
Loaded symbols for /usr/kerberos/lib/libcom_err.so.3
Reading symbols from /usr/kerberos/lib/libk5crypto.so.3...done.
Loaded symbols for /usr/kerberos/lib/libk5crypto.so.3
Reading symbols from /usr/lib/sasl/libgssapiv2.so...done.
Loaded symbols for /usr/lib/sasl/libgssapiv2.so
Reading symbols from /usr/kerberos/lib/libgssapi_krb5.so.2...done.
Loaded symbols for /usr/kerberos/lib/libgssapi_krb5.so.2
Reading symbols from /usr/lib/sasl/liblogin.so...done.
Loaded symbols for /usr/lib/sasl/liblogin.so
Reading symbols from /lib/libcrypt.so.1...done.
Loaded symbols for /lib/libcrypt.so.1
Reading symbols from /lib/libpam.so.0...done.
Loaded symbols for /lib/libpam.so.0
Reading symbols from /usr/lib/sasl/libplain.so...done.
Loaded symbols for /usr/lib/sasl/libplain.so
Reading symbols from /home/jared.oberhaus/jared.oberhaus-linux3-all/tools/linux/j2sdk1.4.2_06/jre/lib/i386/libzip.so...done.
Loaded symbols for /home/jared.oberhaus/jared.oberhaus-linux3-all/tools/linux/j2sdk1.4.2_06/jre/lib/i386/libzip.so
Reading symbols from /home/jared.oberhaus/jared.oberhaus-linux3-all/tools/linux/j2sdk1.4.2_06/jre/lib/i386/libjdwp.so...done.
Loaded symbols for /home/jared.oberhaus/jared.oberhaus-linux3-all/tools/linux/j2sdk1.4.2_06/jre/lib/i386/libjdwp.so
#0  0x0066e6c1 in pthread_mutex_init () from /lib/tls/libpthread.so.0
(gdb) where
#0  0x0066e6c1 in pthread_mutex_init () from /lib/tls/libpthread.so.0
#1  0x01070e3c in ObjectMonitor::ObjectMonitor ()
   from /home/jared.oberhaus/jared.oberhaus-linux3-all/tools/linux/j2sdk1.4.2_06/jre/lib/i386/server/libjvm.so
#2  0x01000517 in CreateRawMonitor ()
   from /home/jared.oberhaus/jared.oberhaus-linux3-all/tools/linux/j2sdk1.4.2_06/jre/lib/i386/server/libjvm.so
#3  0x0039a872 in JVM_OnLoad ()
   from /home/jared.oberhaus/jared.oberhaus-linux3-all/tools/linux/j2sdk1.4.2_06/jre/lib/i386/libjdwp.so
#4  0x00ff8a2e in JvmdiInternal::post_event ()
   from /home/jared.oberhaus/jared.oberhaus-linux3-all/tools/linux/j2sdk1.4.2_06/jre/lib/i386/server/libjvm.so
#5  0x01002a0e in jvmdi::post_vm_initialized_event ()
   from /home/jared.oberhaus/jared.oberhaus-linux3-all/tools/linux/j2sdk1.4.2_06/jre/lib/i386/server/libjvm.so
#6  0x010f109c in Threads::create_vm ()
   from /home/jared.oberhaus/jared.oberhaus-linux3-all/tools/linux/j2sdk1.4.2_06/jre/lib/i386/server/libjvm.so
#7  0x00fb4388 in JNI_CreateJavaVM ()
   from /home/jared.oberhaus/jared.oberhaus-linux3-all/tools/linux/j2sdk1.4.2_06/jre/lib/i386/server/libjvm.so
#8  0x08048e6b in exec_java (java_library_path=0x0, 
    jre_home=0xbfffcdda "/home/jared.oberhaus/jared.oberhaus-linux3-all/tools/linux/j2sdk1.4.2_06/jre", 
    java_class=0xbfffce27 "com/scalent/shared/tools/test/MonitorTest3", 
    classpath=0x0) at driver.c:300
#9  0x0804895d in main (argc=5, argv=0xbfffb054) at driver.c:81
  • I thought it was something I did because in the stack trace I can see thatclasspath and java_library_path, parameters to exec_java are null, and sometimes contain other bad values. Examining this with the debugger I've determined that this is just the optimizer. The compiler is passing in the right values for these when they're needed, but otherwise they reflect the value of the register $esi which can vary.
  • I tried to use Purify on this, but there is something seriously broken with Purify on the machine that I'm running on right now. It seems to work better with root, but when I try it as my own user, I get a MSE on almost every malloc and pthreadoperation, whether my code does it or not. Another red/green/blue herring.
  • I tried Valgrind on it to try to find something, but that didn't seem to discover anything either. Of course, Valgrind can't really execute the whole JVM, but that's not what I was looking for; I was just trying to get it to execute my non-JVM code and find some sort of memory corruption.
  • I also tried the j2sdk1.4.2_06, better than our j2sdk1.4.2_03. That didn't help at all. It still crashes at least 1/3 times.
  • Finally, I went into our code and turned off all the options. After 34 runs ofcom.scalent.shared.tools.test.TestMonitor it did not fail once. I believe the whole thing has something to do with the -Xdebug and related options, as I've never seen a crash in the non-debug version of the driver.
  • I think I really proved that it has something devious to do with -Xdebug and friends. I commented out just the -Xdebug and -Xrunjdwp:transport=dt_socket,address=9300,server=y,suspend=noptions and ran the test com.scalent.shared.tools.test.TestMonitor160 times and it didn't fail once.
  • I tried putting a 5 second delay between tests incom.scalent.shared.tools.test.TestMonitor, but that didn't help. It still failed on the third test.
  • I tried again with strict=y on the -Xrunjdwp:transport line, but that didn't help.
    I also tried using the dt_shmem transport for -Xrunjdwp, but that didn't help either.
  • I have resigned myself to the fact that this is a bug in the JVM, at least with the way that I'm calling it. Fortunately it only happens while we have -Xdebug turned on.

Friday, December 3, 2004

Find your .jar file

This should help you find the jarfile a class comes from...

Wednesday, November 3, 2004

RedHat Enterprise Linux and iowait

We use RedHat Enterprise Linux 3.1 at work for a developer box, and I was seeing problems where the machine would get into 100% iowait lockups; it wouldn't completely lock, but it would get REALLY slow. The answer is here, and for me it worked instantly:

Thursday, October 21, 2004

Getting Purify to work with Java


Let's say you wrote some code using Java JNI and you wanted to Purify that code so that you could find memory leaks and other bugs.
Short answer: you can't.
Here's the long description about what I went through to get to there.
These are the software versions I'm working with:
RedHat Enterprise Linux 3.1
Linux kernel 2.4.21-9EL
PurifyPlus.2003a.06.13.FixPack.0155
Java Runtime 1.4.2_03-b02
One of the most important steps is the .purify file that I had constructed that suppress hundreds of thousands of warnings and allowed me to run things in a reasonable amount of time--but apparently I forgot to save that in a safe place and it's been destroyed. But easy to recreate if you follow these steps.
Anyhow, where I'm stuck is that when an attempt is made by Java to bind to a socket and start listening, it just sits there. There's no activity that I can see via strace, noCPU taken up by the process. But the rtslave is still responsive. It never goes past that step.
I can see this in two different ways; if I turn on Java debugging for my process using the appropriate flags, as soon as the JVM starts up it attempts to bind to that socket. The result is that it just hangs there before executing any Java code. However, if I turn off the Java debugging flag, much Java code is executed up to the point where my Java code attempts to bind to a socket and listen. Then it just sits there again.
In a previous exercise trying to debug Java and listening on a socket, I found that when Java opens a socket it apparently uses rtnetlink to turn off the multicast flag for that socket. I don't know if that has anything to do with it, but it might be interesting...
However, to get this far, here are the steps:
  • You generally have to build a purified executable on the same machine that you're executing on. If anything is different it will crash instantly.
  • The Purify rtslave process just eats tons of memory when it stores errors. If you suppress those errors, it will use much less (or no) memory for those suppressions. The reporting of those errors also takes a huge amount of time, so the purify process ran for a very long time, getting nowhere.
  • The JVM has lots and lots of things that look like MSE's and UMR's. Once you suppress those, the JVM can get somewhere under Purify.
  • You have to set DISPLAY, otherwise Purify will dump everything to stdout, which usually isn't very helpful.
  • I modified our startup environment to pass the environment variables DISPLAY,PUREOPTIONS and PURIFYOPTIONS so that they can affect the operation of Purify.
  • I'm running the JVM with -Xint so that the HotSpot compiler is not invoked, which probably would introduce lots and lots of interesting challenges to get things to work. Update: I got stuck and tried my luck with the HotSpot compiler, and now I'm getting farther. So you should not use -Xint.
  • I found out that IBM has a newer version of Purify that seems to work much better than the previous version against the JVM. It's PurifyPlus.2003a.06.13.FixPack.0155.
    There is an undocumented parameter when building with purify, called -handle-calls-to-java. I added this to my PUREOPTIONS environment variable.
  • Because of -handle-calls-to-java, Purify goes into its cache and sets up symbolic links to "help" the JVM find stuff. For instance, I have -cache-dir set to/var/purify/cache. In /var/purify/cache/opt/scalent/jre/lib/there are lots of symbolic links back to /opt/scalent/jre/lib/. That is where our JRE is stored in the file system.
  • The JVM still needs at least one more (that I know about so far) symbolic link to find stuff. First you have to run the JVM and have it fail with the message: "Error occurred during initialization of VM java.lang.UnsatisfiedLinkError: no zip on java.library.path". This is because when java looks for a library to open called "zip", on Linux it's going to look for libzip.so on its java.library.path. But since the name has been Purify-mangled, it can't find it. Therefore, do the following:
cd /var/purify/cache/opt/scalent/jre/lib/i386/
ln -s /opt/scalent/jre/lib/i386/libawt.so
ln -s /opt/scalent/jre/lib/i386/libcmm.so
ln -s /opt/scalent/jre/lib/i386/libdcpr.so
ln -s /opt/scalent/jre/lib/i386/libdt_socket.so
ln -s /opt/scalent/jre/lib/i386/libfontmanager.so
ln -s /opt/scalent/jre/lib/i386/libhprof.so
ln -s /opt/scalent/jre/lib/i386/libioser12.so
ln -s /opt/scalent/jre/lib/i386/libjaas_unix.so
ln -s /opt/scalent/jre/lib/i386/libjavaplugin_jni.so
ln -s /opt/scalent/jre/lib/i386/libjawt.so
ln -s /opt/scalent/jre/lib/i386/libjcov.so
ln -s /opt/scalent/jre/lib/i386/libJdbc0dc.so
ln -s /opt/scalent/jre/lib/i386/libjdwp.so
ln -s /opt/scalent/jre/lib/i386/libjpeg.so
ln -s /opt/scalent/jre/lib/i386/libsig.so
ln -s /opt/scalent/jre/lib/i386/libjsoundalso.so
ln -s /opt/scalent/jre/lib/i386/libjsound.so
ln -s /opt/scalent/jre/lib/i386/libmlib_image.so
ln -s /opt/scalent/jre/lib/i386/libnative_chmod.so
ln -s /opt/scalent/jre/lib/i386/libnet.so
ln -s /opt/scalent/jre/lib/i386/libnio.so
ln -s /opt/scalent/jre/lib/i386/librmi.so
ln -s /opt/scalent/jre/lib/i386/libverify.so
ln -s /opt/scalent/jre/lib/i386/libzip.so
  • I found another directory that needs to be linked. I got the error "ZoneInfo: /var/purify/cache/opt/scalent/jre/lib/zi/ZoneInfoMappings (No such file or directory)". I also found lots of other directories in a similar state:
cd /var/purify/cache/opt/scalent/jre/lib
ln -s /opt/scalent/jre/lib/zi
ln -s /opt/scalent/jre/lib/locale
ln -s /opt/scalent/jre/lib/images
ln -s /opt/scalent/jre/lib/im
ln -s /opt/scalent/jre/lib/fonts
ln -s /opt/scalent/jre/lib/ext
ln -s /opt/scalent/jre/lib/cmm
ln -s /opt/scalent/jre/lib/audio
  • When the Java code starts up, it forks off processes that are written in C. The result is that Purify follows the fork with another Purify rtslave that immediately does an exec. Purify takes this as a process exit, and so immediately starts looking for leaks in that process. We don't care about leaks at this point; we'll find the leaks in the original JVM process when we want by clicking on the leak button. So until I fix process forking, I'm adding the options -inuse-at-exit=no -leaks-at-exit=no to my PURIFYOPTIONS environment variable.
In case you're wondering, Valgrind won't work either.

Wednesday, June 30, 2004

Java and MT


Java's memory model is very aggressive, and you have to be very careful when accessing memory from multiple threads. You of course have to synchronize access to memory locations, but you have to synchronize them even when it looks like you don't have to. There are several cases where you must use synchronize:
  • To provide a mutual exclusion barrier to prevent one thread from modifying a data structure while the other is reading it.
  • To provide a memory barrier to prevent memory operation reordering from doing something you didn't want to have happen.
  • To make the memory you're accessing volatile so that the runtime optimizer doesn't throw away your request to read a memory location.
Here's a good web page that discusses this.
A good rule to use is that when in doubt, synchronize.
Reordering can only hit you with a multiple-cpu machine, but the problems that I've been running into recently happen on my single CPU machine, with something like this:
(Note that everything after this is speculation based on behavior I've seen):
int m_y = 0;
Thread1() {
    synchronized(m_x) {
        m_y = 1;
    }
}
void Thread2() {
    while(true)
        System.out.println(m_y);
}
Even after the code in Thread1 has executed in its thread, the code in Thread2 will print 0; I believe this is because the runtime optimizer doesn't bother to look at the value of m_y after the first access. This is similar to a compile-time optimizer, which you'd fix with volatile. But a compile-time optimizer couldn't do anything in this situation.
But in Java the runtime optimizer will make it so that the first access gets the value, but it won't bother reading the value from memory anymore after that.
This strange behavior goes away by putting the synchronize(m_x) around the access to m_y. I believe this tells the runtime optimizer that something is likely to have been changed by another thread.

Tuesday, June 8, 2004

Java uses /dev/random; may block forever creating SSL connections

The software that we're developing creates SSL connections when it starts up, and it does so at S13 (has to be after network, but before other services start). The result is that on an NFS booted Linux machine, it sits there forever, and never completes the connection.

Clue #1: if you move the mouse or type on the machine's keyboard, eventually the connection will complete.

Of course the reason for it hanging is that Java is using /dev/random to generate the keys for the SSL connection. And /dev/random gets all of its entropy from the physical environment, and refuses to return random values until it gets some input from the outside world.

We don't see this on a machine that boots from disk; I assume that /dev/random gets entropy from the interaction with the drive, via interrupts and so forth. For some reason the network activity doesn't yield the same entropy data, or at least not enough.

I found this article that discusses the usefulness of /dev/random given its current design.

In order to work around this, we decided to use /dev/urandom. We could do this by a link in the file system, but a much superior solution is to set the following system property in Java:

-Djava.security.egd=file:/dev/urandom

Now all you have to worry about is attacks against your SSL connection from those who know that you are using the pseudo-random number generator...

Thursday, May 6, 2004

Linux and Java and Threads and setuid


Today I learned something very interesting. I learned that you can't setuid on a process in Linux; not when you have multiple threads. Please see #8 in this list.
What they refer to as interesting times probably includes the following:
  • When calling setuid, only the caller thread will actually get its uid changed. All other existing threads in the "process" retain their original uid.
  • I believe any sane person should recognize this as meaning Linux is broken when using threads and setuid.
  • This is a security hole, because root threads still exist in the process. If the non-root threads are hijacked by an attacker, they can stack stomp on the root threads and execute arbitrary code as root.
  • Because synchronization depends on the ability to deliver signals, and delivering signals depends on priviledges, it's easy to see how synchronization between a thread running as root and another running as non-root can wedge the process.
  • Even if I did call setuid in the first bytecode instruction in a Java process, it's too late; Java has already forked threads to do things like garbage collection, and those threads present the security hole described above, and the synchronization problem described above.
  • I'm sure there's a long list of other reasons why this is bad, but I can't think of them now, and the above is sufficient.
In our project we have a Java process that uses forked processes written in C; the purpose of these forked processes is to run as root, or at least elevated privileges, while the Java process runs as some sort of nobody user. Unfortunately this doesn't work very well at all on Linux because we cannot downgrade the uid of the Java process after it starts.
This also means that if we want to listen on a port under 1024, we'll have to do that some other way; there's no way we could get the Java process to bind to that port as root and then downgrade to a nobody uid.
Also the processes I refer to have to be forked before the JVM starts. This means that we have to rendezvous with them in some manner that either means some sort of JNI code to hook up the file descriptors in the pipes, or use some other form of IPC.

Monday, April 26, 2004

Analysis of why creating socket in Java takes 3 minutes


Here's what we think is happening, thanks to Evan's suggestion to use gdb and Carol's assistance in recreating the loopback's IP address and route:
The first attempt by Java to open a socket is preceeded with an initialization of its socket code.
The socket initialization code calls java.net.PlainDatagramSocketImpl.leave, as is indicated in this stack trace from gdb:
#0  0xb75ebc32 in _dl_sysinfo_int80 () from /lib/ld-linux.so.2
#1  0xb75e3bf8 in connect () from /lib/tls/libpthread.so.0
#2  0xaa4b7c24 in Java_java_net_PlainDatagramSocketImpl_leave ()
   from /opt/scalent/jre/lib/i386/libnet.so
#3  0xaa4b8029 in Java_java_net_PlainSocketImpl_initProto ()
   from /opt/scalent/jre/lib/i386/libnet.so
#4  0xb2fa6bf2 in ?? ()
#5  0xb2fa0ddb in ?? ()
#6  0xb2f9e104 in ?? ()
#7  0xb721bb44 in JavaCalls::call_helper ()
   from /opt/scalent/jre/lib/i386/client/libjvm.so
#8  0xb72cfa6d in os::os_exception_wrapper ()
   from /opt/scalent/jre/lib/i386/client/libjvm.so
#9  0xb721bd96 in JavaCalls::call ()
   from /opt/scalent/jre/lib/i386/client/libjvm.so
#10 0xb7200f6f in instanceKlass::call_class_initializer_impl ()
   from /opt/scalent/jre/lib/i386/client/libjvm.so
#11 0xb720569c in instanceKlass::call_class_initializer ()
   from /opt/scalent/jre/lib/i386/client/libjvm.so
#12 0xb72001cb in instanceKlass::initialize_impl ()
   from /opt/scalent/jre/lib/i386/client/libjvm.so
#13 0xb72059af in instanceKlass::initialize ()
   from /opt/scalent/jre/lib/i386/client/libjvm.so
#14 0xb720d6d4 in InterpreterRuntime::_new ()
   from /opt/scalent/jre/lib/i386/client/libjvm.so
#15 0xb2fad510 in ?? ()
#16 0xb2fa0ddb in ?? ()
#17 0xb2fa0ddb in ?? ()
#18 0xb2fa0ddb in ?? ()
#19 0xb2fa0ddb in ?? ()
#20 0xb2fa0ddb in ?? ()
#21 0xb2fa0d04 in ?? ()
#22 0xb2fa0ddb in ?? ()
#23 0xb2fa10e1 in ?? ()
#24 0xb2f9e104 in ?? ()
#25 0xb721bb44 in JavaCalls::call_helper ()
   from /opt/scalent/jre/lib/i386/client/libjvm.so
#26 0xb72cfa6d in os::os_exception_wrapper ()
---Type  to continue, or q  to quit---
   from /opt/scalent/jre/lib/i386/client/libjvm.so
#27 0xb721bd96 in JavaCalls::call ()
   from /opt/scalent/jre/lib/i386/client/libjvm.so
#28 0xb721b666 in JavaCalls::call_virtual ()
   from /opt/scalent/jre/lib/i386/client/libjvm.so
#29 0xb721c1df in JavaCalls::call_virtual ()
   from /opt/scalent/jre/lib/i386/client/libjvm.so
#30 0xb7274f25 in thread_entry ()
   from /opt/scalent/jre/lib/i386/client/libjvm.so
#31 0xb7319caa in JavaThread::thread_main_inner ()
   from /opt/scalent/jre/lib/i386/client/libjvm.so
#32 0xb7315674 in JavaThread::run ()
   from /opt/scalent/jre/lib/i386/client/libjvm.so
#33 0xb72d1083 in _start () from /opt/scalent/jre/lib/i386/client/libjvm.so
#34 0xb75dedac in start_thread () from /lib/tls/libpthread.so.0
  • The call in question seems to be to make the machine leave the multicast group. See here.
  • Leaving the multicast group must involve connecting to loopback on a random port (the port it chooses changes and is always above 32768), and then shoving some random bytes through. That's my theory. Update: This is almost certainlyrtnetlink.
  • It just sits there trying to communicate with itself, and times out after ~3.5 minutes.
  • It appears that it has its problem because the loopback device is not configured with an IP address and is not in the route table.
  • After we issued the following commands, everything works just fine and the 3.5 minute delay turns into 18ms delay:
    • ip addr add 127.0.0.1/8 dev lo
    • /sbin/route add -net 127.0.0.0/8 dev lo
  • We found a most interesting thread on the Java forum that seems to mirror our problem. But the guy apparently never figured it out. Maybe I should post a solutionthere.
  • We went through the following other possible problems:
    • I always thought it was some kind of nfs file locking problem, but that's not the case at all.
    • We redirected the logging output to a local disk on the machine. That didn't help at all.
    • We saw the process was blocked reading /dev/random/dev/random must use loopback to generate random numbers. To solve this we used/dev/urandom which is not as random, but removed the block. But the connection delay persisted. This might explain why it took 50 minutes to send the first message on the SSL connection. Once we removed the block on/dev/random the 50 minute message send delay seemed to go away.
    • We wrote some code to try to connect without SSL, but I'm not convinced that ever worked. It's still in the code and can be activated with a configuration setting, and I tested that configuration setting in my client.
    • We then thought it was a delay caused by doing a reverse DNS lookup on the peer's IP address--likely so it could do certificate validation/throw nice exceptions. We saw in gdb that the stack trace was deep in some Java code that was trying to do some kind of DNS operation. /etc/resolv.conf was empty, so we added our name server to it and rebooted the machine. That didn't help, but the stack trace changed.
    • Then the stack trace was stuck inJava_java_net_PlainDatagramSocketImpl_leave; I thought that might have still been some DNS hosage, so I changed /etc/hosts to include the addresses of the peers. That didn't help and didn't change the stack trace.
    • Finally we typed /sbin/ifconfig. That showed us that lo did not have an IP address.
    • Carol told us the correct magic commands to type.

Creating a socket in Java takes 3 minutes

Sometimes we would see on some of our Linux boxes a 3 minute delay between an attempt to open a socket and a successful connection. This did not make sense... but I eventually determined that this was caused by the loopback device not having an address, or not having a route in the route table.

You can probably fix this by typing the following:
/sbin/ip addr add 127.0.0.1/8 dev lo
/sbin/route add -net 127.0.0.0/8 dev lo

Also, we had these other symptoms:
  • /dev/random would block for about 3 minutes, probably because it depends on loopback to get its results.
  • Java trying to do a reverse DNS lookup would block for about 3 minutes, probably because it was trying to get results from 0.0.0.0, because /etc/resolv.conf was empty, and 0.0.0.0 was being interpreted as 127.0.0.1... Update: see this postabout a related fix...

Here is a thread where I replied with this information; the thread also suggests other solutions, perhaps to the same or similar problems.

Now if I could just figure out how that Linux box got into that situation...

Monday, March 29, 2004

libstdc++ and compatibility


I've been studying the C/C++ build and I think I learned some things:
  • glibc is inextricably linked with the Linux operating system. You can't run with a new glibc.
  • LD_LIBRARY_PATH can affect libc, but cannot affect ld-linux.so.2 (ld-2.3.2.so). It seems you can get around this with chroot, but then you have other problems.
  • glibc 2.3.2 has the symbol GLIBC_PRIVATE which is in ld-2.3.2.so, but not inld-2.2.x.
  • libstdc++ 3.2 (it comes with g++ 3.2) requires glibc 2.3. Redhat 7 ships with 2.2 or earlier. See previous point. You cannot take a libstdc++ from Redhat 9 and run it on Redhat 7 unless you upgrade glibc and just about everything else in the OS, at which point it's not really Redhat 7.
  • libstdc++ is more than STL. It's the C++ runtime and STL. Therefore STLport can never replace libstdc++.
  • I can chroot with Redhat 7 (actually Mandrake 8) and get my Redhat 9 compiled binary and libstdc++ 3.2 shared object. However, once I do that I can't do things like read /proc or modify /etc which is something we need to do.
  • Starting with g++ 3.2, libstdc++ is attempting to be forward/backwards compatible in its ABI where possible. At this point compatibility was completely broken.
  • Redhat 9 ships with compat-libstdc++ which contains the C++ runtime libraries for gcc 2.96 as used in Redhat 7.3. This means C++ stuff compiled on Redhat 7 will work on Redhat 9, but only when this package is installed.
  • glibc works very well forward/backwards compatibility-wise, with the GLIBC_2.0,GLIBC_2.1, etc. symbols. If you build a binary that is C only, it's probably going to run anywhere, as long as it's glibc v2 or better, preferably glibc v2.1.
  • It is impossible to statically link libstdc++ into an executable when exceptions are thrown/caught. This is because symbols such as _Unwind_DeleteExceptionexist in libgcc.so but do not exist in libgcc.a.

Sunday, March 14, 2004

perl and relocating its installation


While setting up our development system and source control, I'm taking the philosophy that all tools are to be checked into source control, not installed on individual machines; in that way a developer's tools are never out of date. Unfortunately some tools don't like this approach, they like to hard-code or "relocate" their position during installation.
One of those is perl.
This link explains a bit how ActivateState relocates perl on install.
What happens is that the @INC path must be embedded in the perl executable on
Unix platforms, or so they claim. When install.sh is run, it calls reloc_perl,
which uses an ActiveState perl module Relocate which then uses this trick to
replace things like
/tmp/.TheInstallScriptWasNotRunTheInstallScriptWasNotRunTheInstallScriptWasNotRun-perl/lib/5.8.0
with the appropriate path. Unfortunately, when I first tried this, the path just happens to be my home directory where I downloaded it.
By the way, there is only 0x80 (128) bytes of space to put the path in, so there is a limit to what location it can be relocated into.
So, the procedure I used to get an ActivePerl that works on anyone's machine no matter where their source directory is mapped to their file system:
  • Installed ActiveState Perl normally, into a place such as your home directory: in my case this was /home/jared.oberhaus/p4/tools/linux/ActivePerl-5.8.3.809
  • Found all instance of text and binary files under the installation directory that contain /home/jared.oberhaus and replaced them with the original files from the install tar. The original files still have encoded strings such as/tmp/.TheInstallScriptWasNotRunTheInstallScriptWasNotRunTheInstallScriptWasNotRun-perl/lib/5.8.0 inside them.
  • Submitted these files to source control as-is.
  • Modified ActiveState's install.sh by adding to it (not removing the original install procedures). First it links the magic /tmp path to the file location where the source control version is mapped. This is controlled by detecting where the install script exists and processing that. When reloc_perl executes it will copy everything into /home/user/p4/tools/linux/perl-5.8.3 and at the same time replace the magic /tmp string with the correct location.

Wednesday, March 3, 2004

Preventing System.exit()


You can prevent System.exit() by setting the appropriate thing in the SecurityManager. Try something like this in your JUnit test:
public void setUp() {
    System.setSecurityManager(new CatchSystemExit());
}
public void tearDown() {
    System.setSecurityManager(null);
}
private static class CatchSystemExit extends SecurityManager {
    /** @see SecurityManager */
    public void checkExit(int status) {
        m_exitCode = status;
        throw new SecurityException("System.exit() attempt caught");
    }
    /** @see SecurityManager */
    public void checkPermission(Permission perm, Object context) {
    }
    /** @see SecurityManager */
    public void checkPermission(Permission perm) {
    }
}