Tuesday, March 24, 2015

More on NMEA 2000

In the comments for this blog entry on NMEA 2000 we have been having an interesting debate about whether the NMEA 2000 glass is half full or half empty.  I remain a proponent of N2K, but continue to run into implementation problems that are real spoilers.  This stuff just doesn't appear to be tested enough against the basic standards requirements, or for practical interoperability across vendors.

I remember back when I was building networking products there used to be industry plug-fests to see how things worked together.  My wife was really concerned about what a "Plug-Fest" might really be about, but she knew we geeks weren't that imaginative.  So for a few days all the industry's competitors would put down their guard and all meet up in a hotel somewhere loaded with equipment.  There was a schedule where the vendors would go from room to room like honey bees pollinating flowers, except they would be trying their equipment to be sure it worked together.  Everyone knew that their product success depended on working properly with other vendors, even though we all preferred that everyone buy everything from us.  Everyone also knew that a customer's site was NOT the place to discover and debug problems.  It worked and UNH even created a networking interoperability lab that vendors still use to this day.

Dial forward to my last company, EqualLogic, where we were among the first people to use a new standard for data storage over networks.  Our lives depended on that standard working, and on our products working with everyone else out in the market.  And to complicate matters, we were dependent on a somewhat obscure feature in the standard that very few people planned to use other than ourselves.  This feature was mandatory in the standard, kind of like Instancing in NMEA 2000, but we expected many people would not implement it, or not test it, and if it didn't work, we would be the one to suffer.  So we made a point of reaching out to every single vendor who we could find who was working on this stuff, educated them on the feature and how it was important, provided consulting to whatever extent they needed it, offered to test with them to be sure we worked together, and ultimately provided customers with a list of products that we knew we worked with.  It turned out very well.

Now let's come back to the latest with N2K.  Our electronics are all done and we left yesterday to go cruising.  I powered everything up, including the autopilot, and all was well.  Then I turned on the VHFs and right around then the autopilot alarmed saying that my heading source had changed.  I have two sources, a sat compass and a rate compass, and the pilot is programmed to use the sat compass first, followed by the rate compass if needed.  I checked which was currently in use, and it was the sat compass just as it should have been.  Hummm, weirdness strikes again.  I brushed it off and moved on.  Well, over the course of our day it alarmed about 6 times.  Each time I cleared the alarm and checked to see which compass it was using and it was correctly on the sat compass.

We made it to our destination and, got settled in to our slip, and went out to explore.  The VHFs were off, but I left the pilot on to see if it would alarm again.  It didn't.  After dinner I decided to try a little experiment, and I turned the VHFs on again.  Guess what?  The pilot alarmed right away.  WTF, right?  So I broke out the analyzer and started looking at the data stream coming from the various devices.  What I found was a storm of traffic every 10 seconds.  The bus was going through an address claim process every 10 seconds, followed by everyone querying everyone to gather basic info.  This is a normal process, but should only happen when there is a configuration change in the network, and it should only happen once.  Things should then remain stable until there is some other change in the network, like powering on the VHFs.

These storms of traffic last for a second or two then settle down, and the storm gets bigger the more devices there are on the network.  My working theory is that adding the VHFs pushed the storm size to the point where the pilot was occasionally missing enough heading report messages (it expects then 10 times a second) that it briefly declared the heading sensor dead, but then healthy again before I could even ack the alarm.  I've had issues just like this in the past, and as soon as the trigger for the storm is removed, all the problems go away.  I had chased down one cause for such storms and gotten it fixed via a firmware update, but now here it was again.

I've been digging into it today, and think I've traced it down to a request from one device that is obscure and arguably unexpected, that triggers a second device to request the address claim process and that sets off the storm.  Take away either of these two elements and the storm disappears.  And this may actually be brought on my yet a 3rd device that when first powered on correctly responds to the obscure request, but after being on for some number of hours or days seems to stop responding, possibly causing the requester to become more insistent.  The part that's clear is the ongoing address claims being requested by one device.  What's a little less clear is exactly what's triggering it.

The point is that it's all this tangled mumbo jumbo that leads to the simple complaint that "my auto pilot keeps losing heading", and customers and installers tearing their hair out trying to figure out why the VHF is messing up the autopilot, and vendors pointing the fingers at each other.  It's only when you take a look with an analyzer that you see something is really amiss.  Most boats would never know if they had such a problem brewing, yet would experience occasional "Gremlins".

My advice to vendors?  Go Get a Room and plug away.

5 comments:

  1. Hi Peter,

    Sorry to hear that N2K-problems still plague you; I guess the Wall of Shame can't be retired yet.

    OT, but your comments feed seems to have a problem. This URL:

    http://www.mvtanglewood.com/feeds/comments/default

    works, but instead this URL:

    http://www.mvtanglewood.com/feeds/4162505733272782290/comments/default

    is indicated as the comments' RSS feed in the page's source, which returns no comments at all?

    ReplyDelete
  2. Sorry for the OT comment above; after some percolating, I stumbled on the simple explanation that the published URL confines the comment feed to comments for that particular post. I personally prefer site-wide comment feeds, which WordPress-sites tend to deliver; hence my confusion.

    Given Maretron's dependence/insistence on proper instancing, do they see a market for a N2K filter/isolator, which would pass the useful data but would shield the problematic device from instancing messages and take on that responsibility?

    ReplyDelete
  3. Carl, thanks for the comments, including about the feeds. I'll check into it some more. What's there now is what Blogger puts in automatically and I frankly haven't paid much attention to it, but need to.

    As for Maretron, I'm not sure what their plans are. I've bent their ear about depending so much on instancing when it's poorly supported in the industry. I've told them that I have increasingly been running into trouble as a result. Their response has basically been "why should we have to change our product to accommodate everyone else's non compliant products. They have a point, but it their goal is to be the universal data display product for N2K, which I think is what N2KView is supposed to be, then I think that have taken on exactly that problem and either need to be the evangelist and catalyst to getting other products to comply, or broaden how they do device selection.

    As I've said in other comments, my original understanding of device selection was heavily (like totally) influenced by Maretron's view of the world. Only over the past 3-6 months have I come to realize they are the ones with a unique approach and that others care very little about instancing. I'm coming to like everyone elses approach better than Maretron's, not just because it works more universally in practice, but also because it's much easier to find and identify devices.

    All this is a really good example of what happens given the dearth of in-depth material describing how N2K is supposed to work. None of us know, so can easily be lead around by the nose by vendors. If this information were available, we would all know much more easily if a product is working correctly or not, and be able to hold vendors accountable for their bugs. I think we would converge much more quickly on quality, interoperable products.

    All of the system I've worked on that involved new protocols or even whole new networks, widespread interoperability across the industry was accomplished in 1-2 years max. N2K has been out since 2001. That's nearly 15 years. Come on guys, let's get it together here!

    ReplyDelete
  4. Ok... let's see... like I pointed few post ago, N2K have some issues when tried to build a complex system.
    Two things... one.. critical applications (course, autopilot, gps) need to be on his own loop and secundary (radar, AIS) in another and finally monitors in another.. complex? yes.. but easy to control, no conflict, no problem with different sources or vendors..
    Just one reason to do that.. if your backbone N2K failed, the whole system (black box type) gonna fail on big time..
    Redundancy is the key on critical 24/7 application.. specially if you're gonna be navigation on waters you cannot have an starbuck close.. (or west marine, or any store.. at all..)

    ReplyDelete
  5. I agree there are lots of ways to work around problems, and creating multiple isolated networks is certainly once way. When you have devices interfering with each other, separating them is certainly a solution.

    One thing I would comment on is that I have never seen data overload be a problem. My network has never run over about 25% load, and is typically in the mid teens. AIS can pump out a lot of data, but the max can be accommodated with a 38kb link as used in 0183. N2K is 250kb, so there is tons of available bandwidth even with AIS running at full capacity, which if course it never does. And my AIS isn't even on N2K anymore.

    You also mention radar as a big consumer, but I believe it's actually pretty modest. It listens to Heading adn position info just as many devices do, and the only data it puts out is ARPA target data via TTM adn those are only once every 2 seconds, so very modest. The actual radar image is local to the read in a stand along device line the ones I have, and over a separate ethernet for the MFD based radar systems.

    There also is a practical issue creating lots of separate networks, namely data that's needed across more than one. For example, heading is needed by the chart plotter, radars, and autopilot. It's also needed by the monitoring system if you want ground wind speed and direction. Separate networks as you suggest would necessitate multiple sat compasses, one on each network. And if one fails, a spare would be needed, or there would have to be a way to join networks to regain heading.

    And I guess the biggest issue I have with all this is that it's accepting the crap that vendors are building. They are supposed to be building products that solve problems for us, not creating a ton of extra work while we sort through all their bugs for them.

    ReplyDelete

Make comments here