Dr. Jerry Pournelle

Email Me

Why not subscribe now?

Chaos Manor Subscribe Now

Useful Link(s)...


Hosting by

Powered by Apache

Computing At Chaos Manor:
September 8, 2009

The User's Column, September, 2009
Column 350
Jerry Pournelle jerryp@jerrypournelle.com
Copyright 2009 Jerry E. Pournelle, Ph.D.

I don't do topical news, but it's worth noting that Microsoft will still be allowed to sell Microsoft Word. A federal court in Texas had issued an injunction forbidding Microsoft from selling Word after October 10; that has been over-ruled by an appeals court pending a hearing at present scheduled in late September.

All this is over a patent conflict: a Canadian company claims Word 2007 violates patents it holds concerning the reading of XML files. I haven't followed this case very closely. Apparently Microsoft lost in trial court, but believes it has valid patents of its own (which will be licensed as part of Office Open XML), that Microsoft hasn't infringed anything, and the format ought not to be restricted. More information here.

Word is one of Microsoft's biggest cash cows, so this is all a very serious matter to the companies involved. It's pretty sure to be settled. Microsoft might even buy the company that brought the suit. For the latest on all that, see this link.

Whatever the outcome, there's a much larger issue for the world. Most of the world's most important documents are created in Microsoft Word. Many of them are created in docx format (docx uses XML). Those files can't be opened by Word 97 and older versions, although there are ways around that and all versions later than Word 97 have a free download that will let the program open docx files.

As an aside, when Office 97 came out I got an early copy in late summer 1996 and noted how big it was - several hundred megabytes. I called it "bloatware" in one of my columns. A friend who happened to be a Waggoner Edstrom executive saw an early draft of that column and said "O please don't call it that!" but I went on and did it. It seemed very large: in those days, hard drives were very much smaller. Of course within months - possibly in the next column - I had to eat my words. (In those days columns were filed three months before publication. Things often changed fast in those days.)

First, really large hard drives began to flood the market so that Office 97 wasn't very big compared to systems storage capability; and second, Office 97 and particularly Word 97 were well worth the upgrade. I'm no fan of adding features for their own sake, but Office 97 had added many useful ones. As I recall, my original ire was roused because I was on the road and it took a long time for Office 97 to install on my laptop (possibly from floppies, but I don't really recall) and it ate a lot of the laptop's disk space - but of course I hadn't needed all of Office 97 on the road to begin with. I should have installed Word only. In any event I soon discovered that I liked Office 97 a lot.

I've used Office ever since, and I've recently begun converting all the machines at Chaos Manor to Office 2007. I generally create documents in the older .doc format, because I have no need for .docx features; but that will change as people get accustomed to using XML features. That trend is what makes this patent lawsuit important to us all: we need standard formats. One of the big complaints often made against Microsoft has been the proprietary Word .doc format. This was made public over time, and many third party programs for opening and saving documents in Word format appeared. Microsoft then developed Office Open XML.

Microsoft submitted Office Open XML to the European standards commission as a draft standard. The commission made some changes and a new standard was published. Office 2007 conforms to that standard - almost. There are differences. The next edition of Office will be fully in compliance.

The Office/Word .docx format has become the de facto standard for many new documents; it's important that it remain open. I don't have much information about the current law suit, but I am quite certain that patents limiting access to common documents created by many different word processing programs are not in the public interest. If the formats are published, there are only a limited number of ways they can be accessed, and I would not think it useful to reward the first one to file a method for doing something that everyone can see needs doing is in the public interest. Of course that needs judgment: some methods may be more clever than others, some may not be, and patent examiners aren't likely to be experts on this. What's obvious to some programmers may look like a brilliant innovation to others; now who do we believe? This mess needs to be settled, and it may be time for Congress to look into the relevant patent laws.

I have among my friends and advisors advocates of two schools of thought. One is that intellectual property rights are absolute and no different from any other property right. The creator of a book has the same rights to that book and all its copies as he would have to a house he has built, or a horse he has bought. The other view is that this isn't so: my making a copy of a book without permission of the creator does not deprive anyone else of a copy of that book, and isn't really stealing. Intellectual property is more of a convention, and the moral rights are different.

In general I hold with the absolute rights of creators over their intellectual property, but I also understand that the Powers that grant intellectual property rights also have the right to set conditions. It is worth noting that in the United States, the Constitution explains that Congress has the power to grant copyrights and patents - monopolies - for the specific purpose of promoting the useful arts and sciences. There is no moral justification given or attempted. It is a power granted for a specific purpose; which means that Congress can set limits to what is patentable - as indeed it already does. Patents that restrict access to de facto standards may be worth exceptions to the general patent laws. Whether the Congress has access to sufficient expertise to draft sensible patent exception rules that will help, not hinder, the development of the useful arts is debatable - but the debate ought to be held. There are many conflicts between individual rights and public interest, and they do not all get decided the same way; nor should they.

Snow Leopard

One of Aesop's fables concerns a mountain in labor: the rumbles and noises were enormous, and great things were expected. Eventually a crack opened in the mountain, and a mouse ran out.

From preliminary reports on Apple's Snow Leopard this seemed to apply; but perhaps not. Few users have had much trouble with Snow Leopard. It's certainly simpler than installing Windows 7, and it's not expensive. On the other hand, I generally accept the maxim, Be not the first by whom the new is tried, nor yet the last to cast the old aside...

Apple users certainly ought to convert to Snow Leopard at some point. The question is when. My usual advice in such matters is to wait a month or so, particularly if you have many non-Apple components in your system. Wait for drivers to be tested and refined; it won't be long now, and there's no harm in waiting. I have reports from some readers about printers and various other devices being unhappy with Snow Leopard.

On the other hand, some report really great advantages from Snow Leopard.

Managing Editor Brian Bilbrey has this view:

Snow Leopard. I installed it the day after release. It makes *everything* faster.

(Sealiesoftware link)

That's a very technical article explaining why, but essentially native software has a huge boost due to prelinked and loaded libraries. After I installed the new version of the OS, the speed bump in application launching was noticeable right off the bat. Safari, too, is much faster at rendering than previously. The only thing that the update broke was my twitter client, Nambu. Everything else (which I've been keeping up to date) works fine, including Parallels. Now, I cannot update my Mac Mini, because it's one of the last of the PowerPC mini's that was sold. That's one more stellar example of purchase mis-timing on my part.

I'll probably install Snow Leopard next week. For the moment I'm still making sure Windows 7 works properly, and for now it's one thing at a time. Certainly we'll all be using Snow Leopard in a few months. All of us who can, anyway. I'm told the installation takes less than an hour and goes without problems.

Exchange users will be pleased to know that Snow Leopard works well with Exchange. I'm not an Exchange user, but here's an account:

(Engadget link)

Windows 7 Launch

Remember the enormous Microsoft launch parties last Millennium? They held them in San Francisco, or New York City, or the Shrine Auditorium in Los Angeles, and they generally had big stars on stage and Bill Gates as host. I recall when Excel for the Mac - there wasn't a version for DOS or Windows (actually there wasn't a Windows) was launched. There was a huge launch party at the Tavern on the Green in Manhattan's Central Park. I recall huge launch parties for Internet Explorer, various versions of Windows, even new versions of Office. Back in BYTE days I had to go to all of them. Not only were there the official Microsoft launch parties, but a whole host of subsidiary parties thrown by Microsoft customers and associates.

No more. I don't much go to such events now, but then there are not many to go to. If there was a big launch party for Windows 7 I must have missed it. I do note that there are to be multiple house parties for the Windows 7 launch, and you can host one of them. Perhaps. You can try, anyway. It's apparently a Big Deal, but somewhat scaled down for the worldwide recession. If you're interested you can find out more on the web: here, in the US, or In Japan at this link.

Windows 7 at Chaos Manor

We have upgraded most of our systems to Windows 7, and I can recommend that Vista users do that when they can. I have many reports from Windows 7 users, and I've seen no reasons for not upgrading Vista to Windows 7. I do have to confess that while Windows 7 networking is simpler than Vista networking, there are still anomalies that drive me wild. Some computers connect easily, others say I can't connect, and I don't know any differences between them. Very odd. I'm still learning the new network philosophy.

The upgrade installation from Vista works: there's no need to reformat the drives and do a new installation. So far I have done this with four machines, and I have seen no bad consequences, and a number of good ones.

Upgrade installation can take several hours, and you want to pay close attention to instructions. First you'll be offered a compatibility check. Then you'll be offered the opportunity to be online during the installation, thus upgrading your version of Windows 7 as Windows 7 replaces Vista (as an upgrade). I took this option, and it worked fine for me.

The compatibility check did tell me that a number of programs would no longer be compatible. The most important one was Skype, which is the communications program I use to participate in many conference calls including Leo Laporte's TWIT (This Week in Technology). The message said I'd have to reinstall Skype to use Windows 7, or at least that's what I thought it said. In any event it was of no importance; see below, I have Skype working just fine in Windows 7.

Other possible problem programs were iTunes and Cyberlink's PowerDVD. I don't use either of those programs on Roxanne, the Vista machine I was converting to Windows 7, so I didn't worry about that. I used to use PowerDVD to watch DVD movies. It came with most DVD drives I bought, I got used to it, and I used it whenever I wanted to watch a movie on a Windows machine. Then I got out of the habit of watching DVD movies on Windows at all: it was much simpler to use the iMac. The iMac's screen is good, the player program is easy to use, and that's what I've been using, so I didn't anticipate any problems arising from the lack of Cyberlink's DVD player.

The next thing was to figure out why the installation didn't proceed. As I said earlier, the important thing is to pay attention: eventually I figured out that the message wasn't just about what programs wouldn't work; it was telling me that I had to reset the system before Windows 7 would install. I did that, did nothing when the system offered to boot from the DVD, and let things continue.

The installation took about four hours. Eventually it wanted me to give the software key, then came more trundling. There was a repeat of the warning about programs that wouldn't work with Windows 7, but this time it didn't demand a restart, only that I OK this fact. More trundling, and finally success.

The system already knew its name, and the workgroup it belonged to; and the first thing I found out about Windows 7 is that the networking works enormously better in Win 7 than in Vista. Roxanne had previously been unreliable in discovering Windows XP systems including those on the iMac; she could sort of be forced to find them, but only if I knew exactly how to do it, and even then it was hard to force access to them. No longer. Windows 7 networking is almost as easy as Mac networking, and that includes networking to Imogene the iMac. There are still some anomalies, mostly caused when trying to log on to a machine with a different user name and password. Macintosh OS-X handles this by allowing you to input the workgroup name and user name under "log in as"; Microsoft doesn't offer you that opportunity, which can be terribly frustrating. I'm sure I'll figure it out, but I may need some help. The good news is that Imogene, the iMac, sees all Windows 7 machines and can connect to them so I can transfer files anywhere; and of course Time Machine backs up anything I copy to the Mac even if the copy was only meant for transfer elsewhere. One day Windows may make networking as easy as it is with Mac OS-X. We can only hope.

Windows Media Player

Windows Media Player just works in Windows 7. It probably did in Vista - I just didn't use it much. Media Player has many more options than the versions of Cyberlink PowerDVD had, and once you figure out the Microsoft Way it's about as simple for playing DVDs and files on your hard disk as the Mac makes things. It also works with various programs that bring in TV. Microsoft has done a good job with Media Player.


The easiest way to update Skype on a Vista machine is to open the old installation (it's still there). The update is pretty well automatic, and the instructions aren't too confusing. It automatically knows your friends and contacts from the old installation. I wasn't sure of that, so I tried to import one of my friends; the program told me it couldn't do that because he was already in my list. After that I paid a bit more attention.

It took about five minutes to update Skype. It works just fine.

Converting to Office 2007

The first thing that happens to Office 2003 users who convert to 2007 find themselves at sea: the Office 2007 menus are more logically organized and eventually easier to use, but old habits die hard, and you will find yourself pushing the F1 key for help more often than you want to.

One of my readers has a suggestion:

Hi Jerry,

After upgrading to Office 2007, I spent a lot of time customizing the various quick launch bars to surface commands that I couldn't discover on the ribbon without a lot of frustration. Recently, I found a free (for private use) tool on both Lifehacker and Kim Komando's site.

While it doesn't convert the ribbon back to the Office 2003 menu structure, it adds a Menu tab to the ribbon that in turn contains the original Office 2003 menus for the application. It even helps you discover where the commands were moved in the Office 2007 applications.

If you use the 2007 versions of Word, Excel and PowerPoint, this tool may help ease the transition.

All the best,

Jim Floyd

I tried this and found it interesting, but I am already pretty well used to the new Office 2007 ribbon, and decided not to keep the addition. While I had it running I experienced no problems with it.

I suspect I would have been very glad to have it in the first weeks of using Word 2007. I mostly concentrate on Word, but the changes in OneNote can be confusing too. The difference is that I don't use Excel and OneNote as much as I use Word, so the old habits aren't so hard to extinguish.

It's Dangerous Out There

It started with a call for help. Roberta had messages that wouldn't go away, and they insisted that she download programs to fix her system. That didn't seem like a very good idea. Much of the malware running wild on the Internet invades your system by advertising itself as security software. Running any program you haven't been asked to run is dangerous.

Roberta's system is old enough that it has a serial port with a modem attached. The hard drive is small. It was last upgraded to Windows XP when XP was relatively new. I kept offering her an upgrade to Vista, or even a new machine, but she kept insisting that what she had was good enough. That was before I had my radiation therapy. The radiation therapy was successful, but it put upgrading her system out of mind. Her system wasn't broke so it didn't need fixing.

Only now it very much needed fixing. I went down to have a look, and the system was very sick indeed.

First there was a big red X in the right hand tray, and from it grew a warning that said the system was threatened, needed virus protection, and click on that to install the protection software. There wasn't any way to close that window: that is, there was a small x in the window, but clicking on that was the same as clicking on the window itself.

There were other warning messages popping up on screen. I suspected that Roberta had tried to close one of the messages, which meant that some of the malware had probably installed itself. That was confirmed when I tried to get to some of the security web sites: we were blocked from them. The machine was definitely infected.

I was able to get to the ESAT web site and from there to the ESAT online scanner. That works well if slowly, and I let it grind on Roberta's machine. It instantly found infections, and reported "exploit:JS/ Mult BB". You can look that up in the Microsoft data base if you like but you won't learn a lot. It's a Trojan downloader, and you can get it if you haven't updated things properly. I thought Roberta's system was properly updated, but one of the exploits apparently uses Acrobat Reader, and that doesn't automatically update: we suspect that may have been the source.

We're not sure, but we think she got this in an email forwarded by her sister. Roberta followed the links, and that did it. We're not sure, and thus I can't tell you what she should have done; as I've said, my guess is that when the "You're in danger, click here to install safeguards" offer popped up she tried to close it by clicking on the little closing x - but that was the same as clicking on the offer window itself.

In any event, ESAT went through all her files and found a number of them it wanted to delete. I told it to do so. Now it was time to reset the machine and hope all would be well. Alas, before I could do that, Microsoft insisted on some kind of update over which I had no control. Then the machine reset. It came up and for one glorious moment I thought we were in good shape: but then I made the hideous error of restoring the crashed Firefox Session, and we were back where we started.

Nuke It from Orbit

The infection was not only serious, but was also sophisticated. For example, if I did control-alt-delete, the "task manager" tab on the menu that popped up was grayed out. Moreover, it wasn't possible to boot the system in safe mode. It just wouldn't boot.

Had this been one of my machines I'd have set it aside to work on at leisure just to see if I could disinfect it, but this was Roberta's main and only machine. It was well backed up except for the current outlook files, but it had essentially everything she has ever done on it. It would be easy to panic...

Eric Pobirs offered this advice:

First, download and run the MS Windows Malicious Software Removal Tool. It covers a lot of stuff.

Make sure hidden files are turned off then run searches for the file types you want to back up. You can copy from the search results rather than manually hunting them all down. You can look in the registered file types control panel to see if there is something an application uses you didn't know about.

Whenever I get a situation where a popup makes me suspicious of any interaction at all, I invoke Task Manager and nuke the browser. It's rare but comes up once in a long while. I'd rather lose a few web pages that are hard to find again than have an infection.

When I explained that I couldn't get to task manager, Rick Hellewell pointed out

Rather than doing Ctrl+Alt+Del to get into task manager, try Start, Run, "taskmgr.exe" (no quotes). If you need to run as administrator, use the Start button to see the "Command prompt" choice, right-click that and select "Run as Administrator", then type in "taskmgr.exe" at the command prompt.

I was also advised to download various software fixing tools on a good machine, copy them to a big thumb drive, and bring them over to Roberta's system and run them from that. All good advice.

But everyone was also a bit concerned that no matter what I did, we'd never be very sure that the infection was gone. Better to scrub down to bare metal, and reinstall everything.

Nuke it from orbit.

Saving Everything

All things considered, this looked to be a good time to upgrade Roberta's entire establishment: get her a new system with larger hard drives, faster DVD burner, faster CPU, and Windows 7 operating system. While we were at it I'd install Office 2007.

This seemed like a good idea. My first move was to bring down a Seagate USB Book drive to transfer all of Roberta's files to. Her system was working, sort of. I disconnected it from the network so it couldn't bother anyone else, plugged in the USB drive, and began to move files. Naturally this took a while, but eventually I had them all. I then took the book drive upstairs to a Windows 7 system, plugged it in, and used both ESAT and Microsoft to scan that drive for any indications of infections. None. Her data was safe and could be transferred to another machine.

The next question is what machine. One possibility was to get Roberta a Mac. Another was to set her up with a Linux system. She wasn't interested in either of those alternatives. Changing to Windows 7 would be hard enough, she thought.

My first selection was Silver. This is a Pentium 4 machine that sat at Larry Niven's work station in the Great Hall until its monitor died in that spate of monitor deaths that happened a couple of months ago. Silver is not a particularly powerful machine, but Roberta doesn't need that. She needs good Internet access and mail access. Silver's most important use was as a machine for Niven to work with when he is here, and naturally any of that work is instantly copied to other machines. The other files on Silver were old game files. Those were easily copied.

It took an hour to reformat Silver's C: drive, and less than that to install Windows 7. I installed Office 2007, then transferred all of Roberta's files. The hardest part of the job was carrying Roberta's new machine downstairs. She was able to use it immediately, and while Outlook 2007 takes a bit of getting used to, it wasn't really a problem. She got used to using Silver without problems, and she was much impressed with the new speed.

The Fans

I've been hard of hearing all my life, and Niven had never complained about Silver; but in Roberta's smaller office it was obvious to me that Silver had some very loud fans. He was just plain noisy. After a couple of days it was obvious that he was too noisy, and we decided to set Roberta up with another machine.

Faster but Slower?

Roxanne was the first machine to get Vista. I did an upgrade installation of Windows 7 onto Roxanne - the process is described in the first part of this column - and once again transferred Roberta's files. She worked perfectly, finding all the machines on the network. I did a series of tests, decided she was ready, and carried Silver back upstairs and took Roxanne down to Roberta's office.

Roxanne is the last of the Pentium systems, and one of the fastest. She's nearly two years younger than Silver. Roberta noticed the difference instantly. Roxanne is slower than Silver. Much faster than her old machine, of course, but slower than Silver. Much quieter than Silver. But definitely slower.

Roxanne is newer and ought to be faster, but she is not. The only explanation I have is that Roxanne got an early beta copy of Vista, and was upgraded each time a new version of Vista came out. Then she got an upgrade version of Windows 7. That keeps her slow. I'll test this the next time I get a chance - which is to say when I have a whole day to work with Roberta's machine. I'll peel off all her software, scrub Roxanne to bare metal, and reinstall Windows 7 from scratch. My guess is that she'll be even faster than Silver.

I do lots of silly things so you don't have to, and installing upgrade editions of Vista and then Windows 7 is one of them. Note that Roxanne is not slow. She's pretty fast, and she works just fine. It's just that Silver, an older machine with a start from scratch installation of exactly the same version of Windows 7, is noticeably faster. You may draw what conclusions you like from that.

Installing Windows 7 on a Windows XP System

My only installation of Win 7 to an XP system involved reformatting the hard drive, but that isn't really necessary. Captain Morse suggests:

The next time you get a chance to play with a Window 7 installation to an existing machine, I recommend you give the "Windows Easy Transfer" utility a try. It's pretty slick. People migrating from XP will especially find it useful as they can't upgrade their existing installations as one can with Vista. I found it useful because it picked up on a number of system settings and user configurations that I would have overlooked had I done the move strictly on my own...like my Firefox profile.

Ron Morse

I'll do that next time. I have at least one more machine to upgrade.

Do you need Windows 7

Upgrading to Windows 7 is easy enough but it can be tedious. There is also the cost. Should you do it, and are you in a hurry?

I'm not at all happy with Microsoft's pricing of Windows 7. Mostly that's perception: the price of all the other components in a PC has plummeted, but the OS pricing doesn't look as if it's keeping pace. Of course as Eric points out, that's mostly true for those who buy boxed non-upgrade licenses, which are under 5% of Windows sales. You can get the Windows 7 family pack with 3 upgrades for $150; that's Home Premium, which is good enough for most. It doesn't allow you to connect to a domain, but few home users have any need for that.

I make no secret of my bewilderment - I might also say contempt - at the myriad of Windows 7 versions and the complexity of the price structure, but we've been through that before. Most users won't need the additional features in the more "advanced" and expensive versions of Windows, but those features are in the OS, just not turned on, and that's aesthetically upsetting to me. It reminds me of the old days when Wang word processors had BASIC and other capabilities, but you couldn't access them. Barry Longyear decided he wanted one of the games I wrote about in my BYTE column; he was using a Wang word processor. He got Wang to turn on BASIC for his Wang word processor. A customer service engineer in white shirt and necktie came out and removed a jumper. He now had BASIC and his bill went up a few dollars a month. I found this disturbing at the time, and Microsoft's practice of sending you features you can't use gives me the same impression. Perhaps I'm irrational about it. I still don't like it.

In any event, if you're a Vista user upgrading to Windows 7 is probably a good investment. Windows 7 is easier to use than Vista. Its networking capabilities are better, and after you get used to some changes, it's more fun.

More importantly, it's safer. Whether Windows 7 is much safer than Vista is debatable, but it's certainly safer than XP, and XP users ought to plan on converting to Windows 7 as soon as they can. Roberta is a pretty careful user, we have firewalls and a router, and in general we practice safe computing at Chaos Manor; yet she got an infection that crippled her machine. Windows 7 like Vista has a number of safety features built in, and for XP users the safety factor alone is worth upgrading for. Of course you can get even greater safety by switching to a Mac or to Linux, but that's a different story.

Just in Case

Eric Pobirs comments on the infection:

Adobe reader does have an auto-update function built in on the last several versions but I find that users frequently dismiss the notification and the update never gets installed.

Once a machine is badly infected the only thing that really works reliably is software that runs from a boot CD or flash drive. This allows the system to be cleaned without the malware running in the background creating new hiding places in the areas already scanned. The Geek Squad where my brother works, has a very good tool for this called MRI that combines a bunch of different malware scanners from various companies and provides a lot of tools for getting things done in a comfortable GUI. I don't know of anything comparable on the retail market but most malware companies have their own bootable scanner solution using the Windows PE (Preboot Environment).

Security Expert Rick Hellewell adds:

Best source of info and help is the HijackThis forums on the www.bleepingcomputer.comsite. Start here. Important: print out that page and follow the instructions carefully, don't skip any steps. Download the files to a thumb drive from a non-infected computer.

The folks at BleepingComputer are your best bet for removal. It may take several passes to get things done correctly, but if you carefully follow their directions, you should be able to ensure a clean computer. Even though you may want to put Windows 7 on that computer eventually, the process might be instructive and interesting to your readers.

Good luck.


I thought of doing that, but Roberta was in a hurry. I decided that recovery would be possible, but it would be tedious, and we'd always remain worried: did we get it all? In the end I decided it would be best to save the data, scrub down to metal, and reinstall. I still have the machine and I may try the recovery option another time.

Internet Explorer and Firefox

I continue to use Firefox for my primary browser, and I continue to gnash my teeth as I do it. It's not that Firefox doesn't work. It works fine, when it's not demanding that it be updated. At the moment it continues to demand that I update to Firefox 3.5. It also warns me that many of my add-ons and extensions probably won't work properly. That has frightened me off so far, but I suppose the upgrade to 3.5 is inevitable. I've been putting it off, because one reader reports that 3.5 has problems if you keep a large number of tabs open. Since it's my habit to use open Firefox tabs as a reminder of what I need to read next, this scares me.

Meanwhile, I continue to find the latest Internet Explorer quite satisfactory but deficient in add-ons and extensions. Firefox has far more of both, and they work pretty well. The list for Internet Explorer is pretty small, hard to search properly, and deficient in the kind of add-ons I like. I suppose eventually Internet Explorer will catch up, but it doesn't look as if that will be very soon.

I know I do lots of silly things so you don't have to, and I'll get to Firefox 3.5 next week.

The Natural Language Game

When I was an undergraduate I became interested in language analysis. I studied General Semantics with Wendell Johnson at the State University of Iowa and even managed some visits to Hayakawa and some of Korzybski's people. I read through Korzybski's Science and Sanity twice, and I don't regret a moment of it.

One of the things we did in those days was to investigate authorship by means of linguistic statistics. Two primary tools were type/token and verb/adjective ratios. A type/token ratio is the ratio of unique words to the total number of words; verb/adjective ratio should be obvious. Those turn out to be surprisingly invariant for a given author, and we hoped to solve some of the great authorship mysteries with those and other tools.

Of course the immediate problem was that extracting that information from a text was excruciatingly dull.

There wasn't any way for a computer to do it. In those days the Iliac at the University of Illinois was one of the world's most advanced machines. I visited it once: with offices for programmers and administrators Iliac filled an old basketball stadium. It used vacuum tubes and two undergraduates ran up and down the aisles replacing tubes as they burned out. Iliac had no scanning capabilities at all. It could compute stationary time series, Pi to hundreds of decimal places, missile trajectories and the like, but my iPhone has considerably more computing power than Iliac ever did.

Most of our literary analysis work died away for want of an easy way to accumulate statistical data and manipulate linguistic texts.

During the 1980's there was a rash of programs intended to analyze text for readability. One of the best was Scandinavian PC Systems Readability Program for the IBM, PC, XT and AT. It was a DOS program and only read ASCII files, and didn't survive the transition to Windows. Microsoft incorporates a readability statistic generator into Word, but I don't know anyone who uses it because to get the statistics you must allow the Word grammar checker to go through your whole document. That is so tedious that I at least give up long before the check is finished. The Microsoft readability program gives the Flesch formulas, but not the kind of detailed analysis that the old Scandinavian PC systems program did. There was a time when I took the trouble to convert my columns to plain ASCII and ran the Scandinavian program against them. It identified long sentences, choppy sentences, foggy sections with far too large a ratio of bricks to mortar, and various other patterns that affect readability.

Bricks/mortar is the ratio of total unique words to the 450 most common words in the English language: if that gets too high, either explanations or vocabulary adjustments may be in order. Too many long sentences affects readability. So does a series of short choppy sentences. The SPC Systems program diagrammed my essays, much as screen writers look at paragraphing from a distance to see if a scene is likely to play. (Big blocks of speech in general don't play...)

I discussed this in the January 2009 mailbag.

All of which is an indirect introduction to the O'Reilly book Natural Language Processing with Python by Steven Bird, Ewan Klein, and Edward Loper (O'Reilly).

Python, for those who don't know, is an interpretive language. It's very powerful, and it's free. If you have a Mac you probably already have it.

If you're not interested in the subject of how to process and manipulate natural languages you will not want to read Natural Language Processing with Python: it begins by telling you how to download Python and the Natural Language Tool Kit (NLTK), also free. All this happens on page 3. Page 4 tells you how to start processing text from Moby Dick and other sources available in the NLTK. In other words, this is a highly technical book, intended for people with some familiarity with computers. You don't have to know how to write programs in Python to do the exercises in this book, but it wouldn't hurt to have some understanding of basic programming principles, and a nodding acquaintance with Python.

For that, incidentally, I recommend Learning Python by Mark Lutz also from O'Reilly. If you've toyed with the idea of learning some principles of programming, this is a great way to get started: Python is free, and there are a lot of free libraries and data bases to give you something to program with. If you find you like playing with Python, I can recommend David Beazley's Python Essential Reference from Addison Wesley. It's exactly what the title says it will be, a reference work for Python users, and it's quite complete.

You don't really need any Python book to learn a lot about text processing and manipulation. Natural Language Processing with Python tells you all you really need to know, and if you go through the whole book working out all the examples you will know more about the subject than many professors of linguistics: at least you'll know how to build Confusion Matrices, determine the entropy of labels, and have some understanding of Bayes Classifiers. The text is dense but informative, and the illustrations and examples illuminating. I very much wish I had this book (and a computer I could run Python on!) when I was an undergraduate. It would probably have changed my career.

I don't recommend this book for casual reading, but if you want to know what's possible in language manipulation, as well as learn how to do those calculations, this work is remarkably complete. I have never seen anything quite like it. Recommended, provided you have a strong interest in the subject.

Winding Down

The book of the month may be hard to find: I happened to find a reference to Wilmar Shiras' Children of the Atom and looked up my own copy, which I discovered in a pile of books I haven't seen since the earthquake. I read the first part of Mrs. Shiras' book when I was in high school. It was called "In Hiding" and was serialized in Astounding Science Fiction. The story had a great influence on me and my reading habits; you'd have to read it to understand why. I devoured the next installments as the came out in Astounding over the next few years, then pretty well forgot the book until recently. It's technically a young adult novel, and but it holds up well and I enjoyed reading it again. You might like it.

"In 2004, Dennis Charnetzky and Daelyn Chernetzky started a series of repairs to their two-bedroom home located in Valparaiso Indiana. The Charnetetzky's renovated their bathroom, refinished their hardwood floors, added a splash of paint, and put up some new wallpaper. These improvements caused the property value to shoot up nearly $400 million.

"Because the property value increased, the county's computers automatically adjusted the Charnetzky's property tax liability. With an increase in the tax base, municipal budgets were increased accordingly. By the time the typo that was entered into the county computer system was discovered, the Valparaiso school district and government agencies faced a financial shortfall and were forced to cut budgets by $3.1 million." The incident (NYT link) is used as an example in the computer book of the month, Viral Data in SOA: An Enterprise Pandemic by Neal A Fishman (IBM Press). SOA means Services Oriented Architecture for those to whom the term is as obscure as it was to me. Viral data is more difficult to define, although it's not hard to give examples, and the antonym to viral data is, according to Fishman, "trusted data."

If you Google "viral data" you'll find that the first page of references all point to this book or comments about it: there's just not that much about viral data and its effects, and even less on what you can do about it. If you're in information technology and your business depends on data accuracy, you'll want to read this book just to see what one expert thinks on the subject. You'll learn something. In a way I am reminded of a story Herman Kahn once told in a lecture to our systems analysis group at Boeing. Herman said "At Hudson Institute we have the world's first, second, and third most knowledgeable experts on ending a nuclear war." Then he laughed and said "Of course we don't know which is which. We assigned three junior analysts to think about the problem for two weeks."

I won't go so far as to say that no one else has thought about viral data and how one may defend against the harm it can do, but I will say it's hard to find much about the subject, and most of what I do find either comes from this book or was generated by discussions of the book. As to the harm viral data can do, in September 2008 a Googlebot elevated a story in a Fort Lauderdale newspaper to the Most Viewed tab. The story was six years old, but that data was not included. It said that United Airlines was filing for bankruptcy. The effect was immediate: United's stock fell precipitously, then bounced slightly before trading in United was halted. A lot of people lost a lot of money. Whose fault was this? Did Google do evil? What can be done to prevent this in future?

Fishman's book tries to address these questions. I'm not enough of an expert on the subject to say how well he does it: I do know that I now know more about the subject than I used to, and that alone made reading the book worth doing. We haven't heard the last of the effects of viral data infecting our Services Oriented Architecture.

The Movie of the Month is 500 Days of Summer. I owed my wife a chick flick, and she chose this one because we weren't able to find Julie and Julia at a theater near us. It stars Zooey Deschanel, sister of Emily Deschanel who plays Dr. Temperance Brennan in the TV series Bones. I first remember Zooey as the ingenue department store clerk in Elf, where she was very believable. She comes off extremely well in a rather difficult role in one of the summer's best movies. Just be warned that it's not really a romantic comedy.

There is no game of the month. I've been working on Mamelukes, which I hope to finish this month, but then I hoped to be done with it before August ended.