Ruminations on Technology (November 2005)

[Wed Nov 23 17:46:07 CST 2005]

SearchOpenSource.com publishes an interview with Mark Wilding and Dan Behman discussing Linux system crashes. They are the authors of Self-Service Linux: Mastering the Art of Problem Determination, which have just been released. Behman clarifies the differences between a crash and a hang:

The properties of a crash and a hang at either level [the application or the kernel level] are basically the same. A hang occurs when a process or thread gets stuck waiting for something —usually a lock of some kind or some hardware resource— to become free. Waiting for a lock or a resource is not uncommon, but it is when that lock or resource doesn't become available that a hang entails.

(...)

A crash is very different from a hang, and occurs when an unexpected hardware or software error occurs. When these errors occur, special error handling is hopefully invoked to dump out diagnostic information and reports that will hopefully be useful to track down the cause of the error.

Crashes can be thought of as point-in-time problems that require post-mortem analysis, and hangs can be thought of as real-time problems that one can analyze live.

It is nothing earth-shattering, to be sure, but a good entry-level definition of what a hang and a crash are all about. I should point out that although interviewer and interviewees praise the virtues of the open source movement and the Linux kernel for making it easier to debug these crashes, especially when compared to closed source kernels and applications, they seem to miss the fact that other Unices tend to have far more developed tools that allow kernel debugging and crash analysis. True, in the case of AIX, HP-UX or IRIX one does not have access to a vital piece of information, the source code itself, but they have well developed and mature tools that make it quite easy to analyze a core dump, at least to their support personnel. In the case of Linux, on the other hand, such tools are still extremely inmature, no matter how much access one may have to the source code itself. In other words, things are not as simple as they seem to imply, although I can see their point, and agree with it in general terms.

Also interesting is Wilding's summary of the main causes of hangs and crashes:

For crashes, we can split the common causes into either panics or traps/exceptions. A panic (or abort of some kind) is a crash where the kernel or application decides to crash because of a severe situation. The software itself realized that there was a problem and literally panics and "commits suicide" in a way to prevent further errors, which could get more serious. A trap/exception means that memory was accessed in an invalid way and is almost always a programming error. In this case, the hardware actually detects the invalid memory access and raises an exception, which results in the application getting sent a signal to terminate the process.

There are generally two causes of hangs. One is a process or thread waiting on a resource, which may or may not become available. Other processes or threads can bind on resources (e.g., locks) that this process/thread is holding while it is hung. An example would be a processs that is holding a critical lock while waiting indefinitely to receive information from the network. The second general cause is a dependency loop where two or more processes or threads are waiting for each other. Examples of this action could be releasing a lock, writing something to an area of shared memory, etc.

Finally, what to do in case of a hang? Behman gives his advice:

Two key things to gather in a hang situation are strace output and stack tracebacks. The strace output will give an indication of what the process is doing —for instance, is it still moving?— while strace is watching the process. The stack tracebacks will give an indication of where in the source code the process currently is. This is very useful for developers so they can determine why the process might be in an apparent hang situation.

Neither Behman nor Wilding explicitly refer to the difference between a deadlock and a livelock, although it is quite important in the context of computer science and operating systems in general. A deadlock occurs when two processes or threads are waiting for each other to release a resource, while a livelock occurs when two processes constantly change with regards to each other making it impossible for either one to ever complete. Needless to say, these situations arise quite often in multiprocessing or multithreaded environments. {link to this story}

[Mon Nov 21 20:30:28 CST 2005]

eWeek published an interesting interview with Microsoft's Jim Allchin to commemorate the twentieth anniversary of Windows that proves to be quite fun, mainly because he finally acknowledges one way or another all those things that Microsoft's criticas have been repeating for years now while being accused of merely being Microsoft bashers. So, when talking about his first days with the company and good old Windows NT, Allchin acknowedlges:

The first thing was that, at Microsoft, networking was viewed as separate from the operating system, and that was so strange to me. It didn't make any sense at all, and I recommended that we disband the network business unit over time and move those resources into either the Windows client or a new group that was being created, the NT group.

(...)

...to be polite, let me say that it was not a system that was designed and architected the way I would necessarily have done it. You had something that had great applications, but the architecture was not one designed for robustness and security and extensibility and high performance.

In other words, that it was barely a decent operating system, right? I mean, if it was not robust, extensible, secure nor capable of high performance... what was it then?

It is also quite interesting to read his comments about Microsoft's internal mess at times:

So there were overlaps between those teams, and there was a little rough period there. And then, after NT 3.1 was out, it was designed as both a client and server, and we looked at how we could get more focus on both the client and server independently... We started to focus the teams on two separate areas, trying to make the server into a true server operating system. We knew the client would take longer.

What I find particularly interesting about some of these insights is that they open a window into Microsoft's internal quagmire. It has been assumed all too often that Microsoft is a lean mean machine, simply because it made so much money and it grew so large, when in reality most of the accounts that we read should lead us to believe that it offers more of an example of what not to do when it comes to managing a company. The overall strategy, of course, has proven clearly sound, but its day to day operations appear sometimes as chaotic and improvised in Allchin's interview. Still, both the interview and other articles published by eWeek to celebrate the anniversary are an interesting read, and also help us realize what made such a big success of Bill Gates' little startup: it offered a clear and single standard to both hardware manufacturers and software vendors, therefore spawning what today most people refer to as the software ecosystem. That is Microsoft's main secret, more than its execution or anything else. Oh, and by the way, whoever remembers what computing was like back in the early to mid-1980s should admit that it was also a huge leap forward. We owe it to Gates and his people, no matter what we think of his dirty tactics or the quality of their products. {link to this story}

[Mon Nov 14 12:24:23 CST 2005]

Nexenta was announced to the world a couple of weeks ago. It is a Debian based GNU/Solaris system that combines the Open Solaris kernel with the GNU tools. Here is a snippet from the announcement:

This is to announce Nexenta: the first-ever distribution that combines GNU and OpenSolaris. As you might know, Sun Microsystems just opened Solaris kernel under CDDL license, which allows one to build customer Operating Systems. Which we did... created a new Debian based GNU/Solaris distribution with (the latest bits of) Solaris kernel & core userland inside.

{link to this story}

[Sat Nov 12 14:57:44 CST 2005]

There has been some talk in the last couple of weeks about the request by certain countries to have the US relinquish its monopoly over the Internet control, an issue that will most likely be raised during the World Summit on the Information Society sponsored by the United Nations and that will take place in Tunisia starting on November 16th. In spite of all the brouhaha, the much talked about "US control over the Internet" is pretty much limited to the Domain Name System (DNS) or, more properly speaking, the root servers behind the system. In this sense, it is not as if the US could all of a sudden cut out the rest of the world from making use of the underlying network connections that are at the very core of the Internet (although one would also have to see who owns the major pipes that transfer the Internet traffic from place to place, but that is a completely different story) with the exception that it could conceivably choose to render a whole national domain such as .de or .it unusable. Certainly, there is no precedent for this type of behavior, but we should also admit that the Internet is still not as central to every country as it may become in the near future, and the US simply did not have much to gain from isolating Iraq or North Korea from the Internet during the recent conflicts. Something tells me this may change in just a few years, and the unilateralist policies of the current Bush Administration worry quite a few foreign governments, authoritarian and democratic alike, who are starting to think of the possibility that the White House might choose some day to cut the French off the Internet if they do not back up their interventionist policies somewhere in the world, for example. I suppose what I mean is that, far from what the officials of the US Administration have implied in their official statements on the topic, this is not just a debate between "evil" authoritarian regimes that want to control their population by enforcing some form of tight Internet censorship and the "good" democratic countries bent on promoting freedom throughout the world. As it tends to be the case, things are far more complicated than that, and while I agree it is not necessary to move the DNS root servers anywhere nor to give control to any international body, it seems reasonable to me if the US allows more participation from other countries taking a more multilateralist approach to the topic. But what am I thinking? After all, this is the same Administration that went to war in Iraq against the opposition of the rest of the world, using the argument that Saddam's regime was making weapons of mass destruction that were never found, and in spite of that it never felt the need to acknowledge any mistake. Spamming and phishing are not things that can be solved with unilateralist policies, and one wonders whether an inward-looking US Administration would ever get any sort of international traction to coordinate any policies in that respect. {link to this story}

[Sat Nov 12 14:21:30 CST 2005]

Unlike Cray, Control Data or Digital, Unisys has managed to survive in a semi-decent financial situation after the end of the large mainframe dinosaurs. However, it still has to struggle with bad results every now and then. Their latest announcement has been an agreement with NEC to dump its hardware manufacturing facilities in a move that should slash about US $50 million from Unisys' costs by the year 2008. As a part of the deal, and here is the interesting bit, Unisys will quickly move away from its own proprietary line of processors and adopt the Intel architecture instead, something that they have already been doing for a few years now, but that will be deepened now. Sure, most people do not even know who Unisys is, but let us not forget that they still sell around 33% of the servers in the higher-end market (i.e., those servers that sell for US $50,000 or more), to HP's 25% and IBM's 18%. In other words, they are still a player, whether or not the masses ever heard of them. To make matters more interesting, that will mean that Unisys, NEC, SGI and Fujitsu will be on the Intel side of the fence while other vendors in the desktop and mid-range workstation and server market (i.e., Dell, Lenovo and, to some extent, HP and IBM) will be betting on the latest kid in the block, . Some of these (namely, HP and IBM) are large enough to hedge their bets and go with both sides at the same time, but it seems as if we are starting to see an interesting development here: Itanium2 is starting to gain a profile as the platform of choice for those vendors who sell in the higher-performance market and whose customers need performance at nearly any price, while AMD64 is the favorite of those who need the most bang for the buck. How this will play in the near future (and, even more important, what effects it may have on the prospects of a company such as Intel, which has traditionally been playing in the mass market) is anybody's guess, but these are exciting times that we are witnessing now. We could be at the beginning of what the business types usually refer to as "a paradigm shift". {link to this story}

[Sat Nov 12 14:14:19 CST 2005]

And you thought the web was full of loonies plugging their webcams to their PCs so the rest of the world could observe their every move online, prompting quite a few people to... well, precisely check those pages to see what some complete unknown is doing at any given time of the day. Now, thanks to the fellows at GhostStudy.com we can also peep into the mysterious world of the afterlife. These folks have compiled a list of 24-hour ghostcams that allow users all over the world to monitor the supernatural activities at a wide array of haunted places, including the Queen Mary ocean liner and the Paris catacombs. Now, who said the Web could not be put to good use? {link to this story}

[Sat Nov 12 14:05:10 CST 2005]

Here is an interesting concept I just read about in the pages of Information Week. There are a few companies out there offering continuous data protection backups. The idea is to record every change to every single piece of data, including MS Word documents, database entries, configuration files on a continuous basis. After all, with traditional backup methods there is always the chance that one may lose some data in between backups. Needless to say, only a bunch of companies out there truly need this level of data protection, but it appears to be enough to provide a nice niche market to both large system providers such as IBM and EMC, as well as smaller businesses such as Asempra Technologies, Revivio and TimeSpring Software. The prices, of course, are pretty high too. {link to this story}

[Tue Nov 1 20:06:06 CST 2005]

Open Magazine published an article titled AMD64 NUMAology about Non-Uniform Memory Access (i.e., NUMA) on the AMD64 architecture running the 2.6 Linux kernel that is an interesting read for anyone who cares about Linux in the high-performance market. Anybody involved in HPC computing knows that the Linux kernel experienced a tremendous improvement in scalability when it moved from 2.4 to 2.6, and this article with its benchmarks comes to confirm it. The good thing about it though is that all those running UP systems (i.e., most of us) will still benefit from important changes, such as the ones introduced to the scheduler:

The Linux scheduler in 2.4 kernel had a single runqueue design, which limited throughput and increased lock contention on systems with multiple CPUs. To improve the scheduler and make it NUMA-aware, a new multi-queue scheduler with a runqueue per processor was developed.

The new scheduler changed the load balancing logic and facilitates dispatching processes on the same processor to take advantage of cache warmth. This makes the execution of individual processes more efficient and produces much lower CPU-to-CPU memory access traffic than what a single shared-memory bus would experience.

As for the benchmarks themselves, surely nobody who works in the HPC market will be surprised to learn that binaries compiled with the Intel compilers ended up performing much better than binaries compiled with GCC, with the additional finding that some of the executables compiled with GCC generated segmentation faults when running on AMD64. GCC has still has some ways to go before it reaches on 64-bit platforms the same level as on 32-bit, although it is true that its main claim to fame has always been a good enough performance and, above all, portability. {link to this story}

[Tue Nov 1 16:10:46 CST 2005]

A geeky joke I found while reading Planet Debian:

Richard M. Stallman, Linus Torvalds and Donald E. Knuth engage in a discussion on whose impact on the computerized world was the greatest.

Stallman: God told me I have programmed the best editor in the world!
Torvalds: Well, God told *me* that I have programmed the best operating system in the world!
Knuth: Wait, wait. I never said that!

{link to this story}