In the comments for this blog entry on NMEA 2000 we have been having an interesting debate about whether the NMEA 2000 glass is half full or half empty. I remain a proponent of N2K, but continue to run into implementation problems that are real spoilers. This stuff just doesn't appear to be tested enough against the basic standards requirements, or for practical interoperability across vendors.
I remember back when I was building networking products there used to be industry plug-fests to see how things worked together. My wife was really concerned about what a "Plug-Fest" might really be about, but she knew we geeks weren't that imaginative. So for a few days all the industry's competitors would put down their guard and all meet up in a hotel somewhere loaded with equipment. There was a schedule where the vendors would go from room to room like honey bees pollinating flowers, except they would be trying their equipment to be sure it worked together. Everyone knew that their product success depended on working properly with other vendors, even though we all preferred that everyone buy everything from us. Everyone also knew that a customer's site was NOT the place to discover and debug problems. It worked and UNH even created a networking interoperability lab that vendors still use to this day.
Dial forward to my last company, EqualLogic, where we were among the first people to use a new standard for data storage over networks. Our lives depended on that standard working, and on our products working with everyone else out in the market. And to complicate matters, we were dependent on a somewhat obscure feature in the standard that very few people planned to use other than ourselves. This feature was mandatory in the standard, kind of like Instancing in NMEA 2000, but we expected many people would not implement it, or not test it, and if it didn't work, we would be the one to suffer. So we made a point of reaching out to every single vendor who we could find who was working on this stuff, educated them on the feature and how it was important, provided consulting to whatever extent they needed it, offered to test with them to be sure we worked together, and ultimately provided customers with a list of products that we knew we worked with. It turned out very well.
Now let's come back to the latest with N2K. Our electronics are all done and we left yesterday to go cruising. I powered everything up, including the autopilot, and all was well. Then I turned on the VHFs and right around then the autopilot alarmed saying that my heading source had changed. I have two sources, a sat compass and a rate compass, and the pilot is programmed to use the sat compass first, followed by the rate compass if needed. I checked which was currently in use, and it was the sat compass just as it should have been. Hummm, weirdness strikes again. I brushed it off and moved on. Well, over the course of our day it alarmed about 6 times. Each time I cleared the alarm and checked to see which compass it was using and it was correctly on the sat compass.
We made it to our destination and, got settled in to our slip, and went out to explore. The VHFs were off, but I left the pilot on to see if it would alarm again. It didn't. After dinner I decided to try a little experiment, and I turned the VHFs on again. Guess what? The pilot alarmed right away. WTF, right? So I broke out the analyzer and started looking at the data stream coming from the various devices. What I found was a storm of traffic every 10 seconds. The bus was going through an address claim process every 10 seconds, followed by everyone querying everyone to gather basic info. This is a normal process, but should only happen when there is a configuration change in the network, and it should only happen once. Things should then remain stable until there is some other change in the network, like powering on the VHFs.
These storms of traffic last for a second or two then settle down, and the storm gets bigger the more devices there are on the network. My working theory is that adding the VHFs pushed the storm size to the point where the pilot was occasionally missing enough heading report messages (it expects then 10 times a second) that it briefly declared the heading sensor dead, but then healthy again before I could even ack the alarm. I've had issues just like this in the past, and as soon as the trigger for the storm is removed, all the problems go away. I had chased down one cause for such storms and gotten it fixed via a firmware update, but now here it was again.
I've been digging into it today, and think I've traced it down to a request from one device that is obscure and arguably unexpected, that triggers a second device to request the address claim process and that sets off the storm. Take away either of these two elements and the storm disappears. And this may actually be brought on my yet a 3rd device that when first powered on correctly responds to the obscure request, but after being on for some number of hours or days seems to stop responding, possibly causing the requester to become more insistent. The part that's clear is the ongoing address claims being requested by one device. What's a little less clear is exactly what's triggering it.
The point is that it's all this tangled mumbo jumbo that leads to the simple complaint that "my auto pilot keeps losing heading", and customers and installers tearing their hair out trying to figure out why the VHF is messing up the autopilot, and vendors pointing the fingers at each other. It's only when you take a look with an analyzer that you see something is really amiss. Most boats would never know if they had such a problem brewing, yet would experience occasional "Gremlins".
My advice to vendors? Go Get a Room and plug away.