, , , , ,

Medical devices – the wild west for cybersecurity vulnerabilities and savvy hackers

bloombergMedical devices are incredibly vulnerable to hacking attacks. In some cases it’s because of software defects that allow for exploits, like buffer overflows, SQL injection or insecure direct object references. In other cases, you can blame misconfigurations, lack of encryption (or weak encryption), non-secure data/control networks, unfettered wireless access, and worse.

Why would hackers go after medical devices? Lots of reasons. To name but one: It’s a potential terrorist threat against real human beings. Remember that Dick Cheney famously disabled the wireless capabilities of his implanted heart monitor for fear of an assassination attack.

Certainly healthcare organizations are being targeted for everything from theft of medical records to ransomware. To quote the report “Hacking Healthcare IT in 2016,” from the Institute for Critical Infrastructure Technology (ICIT):

The Healthcare sector manages very sensitive and diverse data, which ranges from personal identifiable information (PII) to financial information. Data is increasingly stored digitally as electronic Protected Health Information (ePHI). Systems belonging to the Healthcare sector and the Federal Government have recently been targeted because they contain vast amounts of PII and financial data. Both sectors collect, store, and protect data concerning United States citizens and government employees. The government systems are considered more difficult to attack because the United States Government has been investing in cybersecurity for a (slightly) longer period. Healthcare systems attract more attackers because they contain a wider variety of information. An electronic health record (EHR) contains a patient’s personal identifiable information, their private health information, and their financial information.

EHR adoption has increased over the past few years under the Health Information Technology and Economics Clinical Health (HITECH) Act. Stan Wisseman [from Hewlett-Packard] comments, “EHRs enable greater access to patient records and facilitate sharing of information among providers, payers and patients themselves. However, with extensive access, more centralized data storage, and confidential information sent over networks, there is an increased risk of privacy breach through data leakage, theft, loss, or cyber-attack. A cautious approach to IT integration is warranted to ensure that patients’ sensitive information is protected.”

Let’s talk devices. Those could be everything from emergency-room monitors to pacemakers to insulin pumps to X-ray machines whose radiation settings might be changed or overridden by malware. The ICIT report says,

Mobile devices introduce new threat vectors to the organization. Employees and patients expand the attack surface by connecting smartphones, tablets, and computers to the network. Healthcare organizations can address the pervasiveness of mobile devices through an Acceptable Use policy and a Bring-Your-Own-Device policy. Acceptable Use policies govern what data can be accessed on what devices. BYOD policies benefit healthcare organizations by decreasing the cost of infrastructure and by increasing employee productivity. Mobile devices can be corrupted, lost, or stolen. The BYOD policy should address how the information security team will mitigate the risk of compromised devices. One solution is to install software to remotely wipe devices upon command or if they do not reconnect to the network after a fixed period. Another solution is to have mobile devices connect from a secured virtual private network to a virtual environment. The virtual machine should have data loss prevention software that restricts whether data can be accessed or transferred out of the environment.

The Internet of Things – and the increased prevalence of medical devices connected hospital or home networks – increase the risk. What can you do about it? The ICIT report says,

The best mitigation strategy to ensure trust in a network connected to the internet of things, and to mitigate future cyber events in general, begins with knowing what devices are connected to the network, why those devices are connected to the network, and how those devices are individually configured. Otherwise, attackers can conduct old and innovative attacks without the organization’s knowledge by compromising that one insecure system.

Given how common these devices are, keeping IT in the loop may seem impossible — but we must rise to the challenge, ICIT says:

If a cyber network is a castle, then every insecure device with a connection to the internet is a secret passage that the adversary can exploit to infiltrate the network. Security systems are reactive. They have to know about something before they can recognize it. Modern systems already have difficulty preventing intrusion by slight variations of known malware. Most commercial security solutions such as firewalls, IDS/ IPS, and behavioral analytic systems function by monitoring where the attacker could attack the network and protecting those weakened points. The tools cannot protect systems that IT and the information security team are not aware exist.

The home environment – or any use outside the hospital setting – is another huge concern, says the report:

Remote monitoring devices could enable attackers to track the activity and health information of individuals over time. This possibility could impose a chilling effect on some patients. While the effect may lessen over time as remote monitoring technologies become normal, it could alter patient behavior enough to cause alarm and panic.

Pain medicine pumps and other devices that distribute controlled substances are likely high value targets to some attackers. If compromise of a system is as simple as downloading free malware to a USB and plugging the USB into the pump, then average drug addicts can exploit homecare and other vulnerable patients by fooling the monitors. One of the simpler mitigation strategies would be to combine remote monitoring technologies with sensors that aggregate activity data to match a profile of expected user activity.

A major responsibility falls onto the device makers – and the programmers who create the embedded software. For the most part, they are simply not up to the challenge of designing secure devices, and may not have the polices, practices and tools in place to get cybersecurity right. Regrettably, the ICIT report doesn’t go into much detail about the embedded software, but does state,

Unlike cell phones and other trendy technologies, embedded devices require years of research and development; sadly, cybersecurity is a new concept to many healthcare manufacturers and it may be years before the next generation of embedded devices incorporates security into its architecture. In other sectors, if a vulnerability is discovered, then developers rush to create and issue a patch. In the healthcare and embedded device environment, this approach is infeasible. Developers must anticipate what the cyber landscape will look like years in advance if they hope to preempt attacks on their devices. This model is unattainable.

In November 2015, Bloomberg Businessweek published a chilling story, “It’s Way too Easy to Hack the Hospital.” The authors, Monte Reel and Jordon Robertson, wrote about one hacker, Billy Rios:

Shortly after flying home from the Mayo gig, Rios ordered his first device—a Hospira Symbiq infusion pump. He wasn’t targeting that particular manufacturer or model to investigate; he simply happened to find one posted on EBay for about $100. It was an odd feeling, putting it in his online shopping cart. Was buying one of these without some sort of license even legal? he wondered. Is it OK to crack this open?

Infusion pumps can be found in almost every hospital room, usually affixed to a metal stand next to the patient’s bed, automatically delivering intravenous drips, injectable drugs, or other fluids into a patient’s bloodstream. Hospira, a company that was bought by Pfizer this year, is a leading manufacturer of the devices, with several different models on the market. On the company’s website, an article explains that “smart pumps” are designed to improve patient safety by automating intravenous drug delivery, which it says accounts for 56 percent of all medication errors.

Rios connected his pump to a computer network, just as a hospital would, and discovered it was possible to remotely take over the machine and “press” the buttons on the device’s touchscreen, as if someone were standing right in front of it. He found that he could set the machine to dump an entire vial of medication into a patient. A doctor or nurse standing in front of the machine might be able to spot such a manipulation and stop the infusion before the entire vial empties, but a hospital staff member keeping an eye on the pump from a centralized monitoring station wouldn’t notice a thing, he says.

 The 97-page ICIT report makes some recommendations, which I heartily agree with.

  • With each item connected to the internet of things there is a universe of vulnerabilities. Empirical evidence of aggressive penetration testing before and after a medical device is released to the public must be a manufacturer requirement.
  • Ongoing training must be paramount in any responsible healthcare organization. Adversarial initiatives typically start with targeting staff via spear phishing and watering hole attacks. The act of an ill- prepared executive clicking on a malicious link can trigger a hurricane of immediate and long term negative impact on the organization and innocent individuals whose records were exfiltrated or manipulated by bad actors.
  • A cybersecurity-centric culture must demand safer devices from manufacturers, privacy adherence by the healthcare sector as a whole and legislation that expedites the path to a more secure and technologically scalable future by policy makers.

This whole thing is scary. The healthcare industry needs to set up its game on cybersecurity.

, , , ,

Driving risks out of embedded automotive software

can-busWhen it comes to cars, safety means more than strong brakes, good tires, a safety cage, and lots of airbags. It also means software that won’t betray you; software that doesn’t pose a risk to life and property; software that’s working for you, not for a hacker.

Please join me for this upcoming webinar, where I am presenting along with Arthur Hicken, the Code Curmudgeon and technology evangelist for Parasoft. It’s on Thursday, August 18. Arthur and I have been plotting and scheming, and there will be some excellent information presented. Don’t miss it! Click here to register.

Driving Risks out of Embedded Automotive Software

Automobiles are becoming the ultimate mobile computer. Popular models have as many as 100 Electronic Control Units (ECUs), while high-end models push 200 ECUs. Those processors run hundreds of millions of lines of code written by the OEMs’ teams and external contractors—often for black-box assemblies. Modern cars also have increasingly sophisticated high-bandwidth internal networks and unprecedented external connectivity. Considering that no code is 100% error-free, these factors point to an unprecedented need to manage the risks of failure—including protecting life and property, avoiding costly recalls, and reducing the risk of ruinous lawsuits.

This one-hour practical webinar will review the business risks of defective embedded software in today’s connected cars. Led by Arthur Hicken, Parasoft’s automotive technology expert and evangelist, and Alan Zeichick, an independent technology analyst and founding editor of Software Development Times, the webinar will also cover five practical techniques for driving the risks out of embedded automotive software, including:

• Policy enforcement
• Reducing defects during coding
• Effective techniques for acceptance testing
• Using metrics analytics to measure risk
• Converting SDLC analytics into specific tasks to focus on the riskiest software

You can apply the proven techniques you’ll learn to code written and tested by your teams, as well as code supplied by your vendors and contractors.

, , , ,

Internet over Carrier Pigeon? There’s a standard for that

pidgeonThere are standards for everything, it seems. And those of us who work on Internet things are often amused (or bemused) by what comes out of the Internet Engineering Task Force (IETF). An oldie but a goodie is a document from 1999, RFC-2549, “IP over Avian Carriers with Quality of Service.”

An RFC, or Request for Comment, is what the IETF calls a standards document. (And yes, I’m browsing my favorite IETF pages during a break from doing “real” work. It’s that kind of day.)

RFC-2549 updates RFC-1149, “A Standard for the Transmission of IP Datagrams on Avian Carriers.” That older standard did not address Quality of Service. I’ll leave it for you to enjoy both those documents, but let me share this part of RFC-2549:

Overview and Rational

The following quality of service levels are available: Concorde, First, Business, and Coach. Concorde class offers expedited data delivery. One major benefit to using Avian Carriers is that this is the only networking technology that earns frequent flyer miles, plus the Concorde and First classes of service earn 50% bonus miles per packet. Ostriches are an alternate carrier that have much greater bulk transfer capability but provide slower delivery, and require the use of bridges between domains.

The service level is indicated on a per-carrier basis by bar-code markings on the wing. One implementation strategy is for a bar-code reader to scan each carrier as it enters the router and then enqueue it in the proper queue, gated to prevent exit until the proper time. The carriers may sleep while enqueued.

Most years, the IETF publishes so-called April Fool’s RFCs. The best list of them I’ve seen is on Wikipedia. If you’re looking to take a work break, give ’em a read. Many of them are quite clever! However, I still like RFC-2549 the best.

A prized part of my library is “The Complete April Fools’ Day RFCs” compiled by by Thomas Limoncelli and Peter Salus. Sadly this collection stops at 2007. Still, it’s a great coffee table book to leave lying around for when people like Bob MetcalfeTim Berners-Lee or Al Gore come by to visit.

, , , ,

Beyond the fatal Tesla crash: Security and connected autonomous cars

Kitt-InteriorWas it a software failure? The recent fatal crash of a Tesla in Autopilot mode is worrisome, but it’s too soon to blame Tesla’s software. According to Tesla on June 30, here’s what happened:

What we know is that the vehicle was on a divided highway with Autopilot engaged when a tractor trailer drove across the highway perpendicular to the Model S. Neither Autopilot nor the driver noticed the white side of the tractor trailer against a brightly lit sky, so the brake was not applied. The high ride height of the trailer combined with its positioning across the road and the extremely rare circumstances of the impact caused the Model S to pass under the trailer, with the bottom of the trailer impacting the windshield of the Model S. Had the Model S impacted the front or rear of the trailer, even at high speed, its advanced crash safety system would likely have prevented serious injury as it has in numerous other similar incidents.

We shall have to await the results of the NHTSA investigation to learn more. Even if it does prove to be a software failure, at least the software can be improved to try to avoid similar incidents in the future.

By coincidence, a story that I wrote about the security issues related to advanced vehicles,Connected and Autonomous Cars Are Wonderful and a Safety-Critical Security Nightmare,” was published today, July 1, on CIO Story. The piece was written several weeks ago, and said,

The good news is that government and industry standards are attempting to address the security issues with connected cars. The bad new is that those standards don’t address security directly; rather, they merely prescribe good software-development practices that should result in secure code. That’s not enough, because those processes don’t address security-related flaws in the design of vehicle systems. Worse, those standards are a hodge-podge of different regulations in different countries, and they don’t address the complexity of autonomous, self-driving vehicles.

Today, commercially available autonomous vehicles can parallel park by themselves. Tomorrow, they may be able to drive completely hands-free on highways, or drive themselves to parking lots without any human on board. The security issues, the hackability issues, are incredibly frightening. Meanwhile, companies as diverse as BMW, General Motors, Google, Mercedes, Tesla and Uber are investing billions of dollars into autonomous, self-driving car technologies.

Please read the whole story here.

, , ,

Crash! Down goes Google Calendar — cloud services are not perfect

crashCloud services crash. Of course, non-cloud-services crash too — a server in your data center can go down, too. At least there you can do something, or if it’s a critical system you can plan with redundancies and failover.

Not so much with cloud services, as this morning’s failure of Google Calendar clearly shows. The photo shows Google’s status dashboard as of 6:53am on Thursday, June 30.

I wrote about crashes at Amazon Web Services and Apple’s MobileMe back in 2008 in “When the cloud was good, it was very good. But when it was bad it was rotten.”

More recently, in 2011, I covered another AWS failure in “Skynet didn’t take down Amazon Web Services.”

Overall, cloud services are quite reliable. But they are not perfect, and it’s a mistake to think that just because they are offered by huge corporations, they will be error-free and offer 100% uptime. Be sure to work that into your plans, especially if you and your employees rely upon public cloud services to get your job done, or if your customers interact with you through cloud services.

, , , ,

When do we want automated emails? Now!

stopwatchI can hear the protesters. “What do we want? Faster automated emails! When do we want them? In under 20 nanoseconds!

Some things have to be snappy. A Web page must load fast, or your customers will click away. Moving the mouse has to move the cursor without pauses or hesitations. Streaming video should buffer rarely and unobtrusively; it’s almost always better to temporarily degrade the video quality than to pause the playback. And of course, for a touch interface to work well, it must be snappy, which Apple has learned with iOS, and which Google learned with Project Butter.

The same is true with automated emails. They should be generated and transmitted immediately — that is, is under a minute.

I recently went to book a night’s stay at a Days Inn, a part of the Wyndham Hotel Group, and so I had to log into my Wyndham account. Bad news: I couldn’t remember the password. So, I used the password retrieval system, giving my account number and info. The website said to check my e-mail for the reset link. Kudos: That’s a lot better than saying “We’ll mail you your password,” and then sending it in plain text!!

So, I flipped over to my e-mail client. Checked for new mail. Nothing. Checked again. Nothing. Checked again. Nothing. Checked the spam folder. Nothing. Checked for new mail. Nothing. Checked again. Nothing.

I submitted the request for the password reset at 9:15 a.m. The link appeared in my inbox at 10:08 a.m. By that time, I had already booked the stay with Best Western. Sorry, Days Inn! You snooze, you lose.

What happened? The e-mail header didn’t show a transit delay, so we can’t blame the Internet. Rather, it took nearly an hour for the email to be uploaded from the originating server. This is terrible customer service, plain and simple.

It’s not merely Wyndham. When I purchase something from Amazon, the confirmation e-mail generally arrives in less than 30 seconds. When I purchase from Barnes & Noble, a confirmation e-mail can take an hour. The worst is Apple: Confirmations of purchases from the iTunes Store can take three days to appear. Three days!

It’s time to examine your policies for generating automated e-mails. You do have policies, right? I would suggest a delay of no more than one minute from when the user performs an action that would generate an e-mail and having the message delivered to the SMTP server.

Set the policy. Automated emails should go out in seconds — certainly in under one minute. Design for that and test for that. More importantly, audit the policy on a regular basis, and monitor actual performance. If password resets or order confirmations are taking 53 minutes to hit the Internet, you have a problem.

, , ,

Retrospective: 2010’s ESDC, the Enterprise Software Development Conference

ESDC_2010Today’s serendipitous discovery: A blog post about the Enterprise Software Development Conference (ESC), produced by BZ Media in March 2010. I was the conference chair of that event; our goal was to try to replicate the wonderful SD West conference, which CMP had discontinued the year before. (I am the “Z” of BZ Media.)

Unfortunately, ESDC was not viable from a business perspective, so we only ran it one time. Even so, we had a great conference, and the attendees, presenters and exhibitors were delighted with the event’s quality and technical content.

One of our top exhibitors was OutSystems. Mike Jones, one of their executives, wrote about the conference in a thoughtful blog post, “ESDC Retrospective.” Mike started with

Last week, the OutSystems team attended the Enterprise Software Development Conference (ESDC) in San Mateo California. This is the first year for this show and, as Alan Zeichick notes, it takes up where the old SD West conference left off. As gold sponsors of the show, we got to both attend the sessions and talk to the conference attendees at the OutSystems booth. I just wanted to share a few highlights & take-aways from the show.

One of his cited highlights was

Another highlight: Kent Beck‘s keynote on “Responsive Design: Efficiency Through Safety.”  This was the first time I had heard Kent speak. He started off by referencing Ed Yourdon‘s work on Systems Design and how it led him to try and distill his own working process for design. This was the premise for his presentation. My take-away was that no matter what you do, your design will change. I think we all accept this as fact – especially for application software. Kent then explained his techniques to reduce the risk when making design changes. For each of his examples I found myself thinking ‘This is not really a problem with the Agile Platform because the TrueChange™ engine will keep you from breaking stuff you did not intend to break, allowing you to move very fast with little risk.” If you are hand-coding, then Kent’s four techniques (as described here by Alan Zeichick) to reduce risk when making change is great advice, but why do that if you don’t have to? BTW, I think Kent would love the Agile Platform.

Thanks, Mike, for the thoughtful writeup. Hard to believe ESDC was more than six years ago. (Read the whole post here.)

, , , ,

Quantify the risk of automotive software failures: The SRR Warranty and Recall Report

Summary of Recall Trends. Source: SRR.

Summary of Recall Trends. Source: SRR.

The costs of an automobile recall can be immense for an OEM automobile or light truck manufacturer – and potentially ruinous for a member of the industry’s supply chain. Think about the ongoing Takata airbag scandal, which Bloomberg says could cost US$24 billion. General Motors’ ignition locks recall may have reached $4.1 billion. In 2001, the exploding Firestone tires on the Ford Explorer cost $3 billion to recall. The list goes on and on. That’s all about hardware problems. What about bits and bytes?

Until now, it’s been difficult to quantify the impact of software defects on the automotive industry. Thanks to a new analysis from SRR called “Industry Insights for the Road Ahead: Automotive Warranty and Recall Report 2016,” we have a good handle on this elusive area.

According to the report, there were 63 software- related vehicle recalls from late 2012 to June 2015. That’s based on data from the United States’ National Highway Traffic Safety Administration (NHTSA). The SRR report derived that count of 63 software-related recalls using this methodology (p. 22),

To classify a recall as a software component recall, SRR searched the “Defect Summary” and “Corrective Action” fields of NHTSA’s Recall flat file for the term “software.” SRR’s inquiry captured descriptions of software-related defects identified specifically as such, as well as defects that were to be fixed by updating or changing a vehicle’s software.

That led to this analysis (p. 22),

Since the end of 2012, there has been a marked increase in recall activity due to software issues. For the primary light vehicle makes and models we studied, 32 unique software-related recalls affected about 3.6 million vehicles from 2005–2012. However, in a much shorter time period from the end of 2012 to June 2015, there were 63 software-related recalls affecting 6.4 million more vehicles.

And continuing (p. 23),

From less than 5 percent of all recalls in 2011, software-related recalls have risen to almost 15 percent in 2015. Overall, the amount of unique campaigns involving software has climbed dramatically, with nine times as many in 2015 than in 2011…

No surprises there given the dramatically increased complexity of today’s connected vehicles, with sophisticated internal networks, dozens of ECUs (electronic control units with microprocessors, memory, software and network connections), and extensive remote connectivity.

These software defects are not occurring only in systems where one expects to find sophisticated microprocessors and software, such as engine management controls and Internet-connected entertainment platforms. Microprocessors are being used to analyze everything from the driver’s position and stage of alert, to road hazards, to lane changes — and offer advanced features such as automatic parallel parking.

Where in the car are the software-related vehicle recalls? Since 2006, says the report, recalls have been prompted by defects in areas as diverse as locks/latches, power train, fuel system, vehicle speed control, air bags, electrical systems, engine and engine cooling, exterior lighting, steering, hybrid propulsion – and even the parking brake system.

That’s not all — because not every software defect results in a public and costly recall. That’s the last resort, from the OEM’s perspective. Whenever possible, the defects are either ignored by the vehicle manufacturer, or quietly addressed by a software update next time the car visits a dealer. (If the car doesn’t visit an official dealer for service, the owner may never know that a software update is available.) Says the report (p. 25),

In addition, SRR noted an increase in software-related Technical Service Bulletins (TSB), which identify issues with specific components, yet stop short of a recall. TSBs are issued when manufacturers provide recommended procedures to dealerships’ service departments for fixing problematic components.

A major role of the NHTSA is to record and analyze vehicle failures, and attempt to determine the cause. Not all failures result in a recall, or even in a TSB. However, they are tracked by the agency via Early Warning Reporting (EWR). Explains the report (p. 26),

In 2015, three new software-related categories reported data for the first time:

• Automatic Braking, listed on 21 EWR reports, resulting in 26 injuries and 1 fatality

• Electronic Stability, listed on 6 EWR reports, resulting in 7 injuries and 1 fatality

• Forward Collision Avoidance, listed in 1 EWR report, resulting in 1 injury and no fatalities

The bottom line here, beyond protecting life and property, is the bottom line for the automobile and its supply chain. As the report says in its conclusion (p. 33),

Suppliers that help OEMs get the newest software-aided components to market should be prepared for the increased financial exposure they could face if these parts fail.

About the Report

Industry Insights for the Road Ahead: Automotive Warranty and Recall Report 2016” was published by SRR: Stout, Risius Ross, which offers global financial advisory services. SRR has been in the automotive industry for 25 years, and says, “SRR professionals have more automotive experience in these service areas than any other advisory firm, period.”

This brilliant report — which is free to download in its entirety — was written by Neil Steinkamp, a Managing Director at SRR. He has extensive experience in providing a broad range of business and financial advice to corporate executives, risk managers, in-house counsel and trial lawyers. Mr. Steinkamp has provided consulting services and has been engaged as an expert in numerous matters involving automotive warranty and recall costs. His practice also includes consulting services for automotive OEMs, suppliers and their advisors regarding valuation, transactions and disputes.

, , , ,

A Seven-Point Plan for Automotive Cybersecurity

code-curmudgeon2I am hoovering directly from the blog of my friend Arthur Hicken, the Code Curmudgeon:

Last week with Alan Zeichick and I did a webinar for Parasoft on automotive cybersecurity. Now Alan thinks that cybersecurity is an odd term, especially as it applies to automotive and I mostly agree with him. But appsec is also pretty poorly fitted to automotive so maybe we should be calling it AutoSec. Feel free to chime-in using the comments below or on twitter.

I guess the point is that as cars get more complicated and get more “smart” parts and get more connected (The connected car) as part of the “internet of things”, you will start to see more and more automotive security breaches occurring. From taking over the car to stealing data to triggering airbags we’ve already had several high-profile incidents which you can see in my IoT Hall-of-Shame.

To help out we’ve put together a high-level overview of a 7-point plan to get you started. In the near future we’ll be diving into detail on each of these topics, including how standards can help you not only get quality but safety and security, the role of black-box, pen-test, and DAST as well as how to get ahead of the curve and harden your vehicle software using (SAST) and hybrid testing (IAST).

The webinar was recorded for your convenience, so be sure and check it out. If you have automotive software topics that are near and dear to your heart, but sure to let me know in the comments or on Twitter or Facebook.

Okay, the webinar was back in February, but the info didn’t appear on my blog then. Here it is now. My apologies for the oversight. Watch and enjoy the webinar!

, , , ,

The most important plug-in for Customer Experience Management software: Humans

customer_experienceNo smart software would make the angry customer less angry. No customer relationship management platform could understand the problem. No sophisticated HubSpot or Salesforce or Marketo algorithm could be able to comprehend that a piece of artwork, brought to a nationwide framing store location in October, wouldn’t be finished before Christmas – as promised. While an online order tracking system would keep the customer informed, it wouldn’t keep the customer satisfied.

Customer Experience Management (CEM). That’s the hot new buzzword for directly engaging the customer. Contrast that with Customer Relationship Management (CRM), which is more about the back-end tracking of customers, leads and orders.

Think about how Amazon.com or FedEx or Netflix keep you constantly informed about what’s happening with your products and services. They have realized that the key to customer success is equally product/service excellence and communications excellence. When I was a kid, you mailed a check and an order form to Sears Roebuck, and a few weeks later a box showed up in the mail. That was great customer service in the 1960s and 1970s. No more. We demand communications. Proactive communications. Effective, empathetic communications.

One of the best ways to make an unhappy customer happy is to empower a human to do whatever it takes to get things right. If possible, that should be the first person the customer talks to, so the problem gets solved as quickly as possible, and without adding “dropped calls” or “too many transfers” to the litany of complaints. A CEM platform should be designed with this is mind.

I’ve written a story about the non-software factors required for effective CEM platforms for Pipeline Magazine. Read the story: “CEM — Now with Humans!

, , ,

Bimodal IT — safety and accuracy vs. speed and agility

gartner-bimodal-itLas Vegas, December 2015 — Get ready for Bimodal IT. That’s the message from the Gartner Application, Architecture, Development & Integration Summit (AADI). It wasn’t a subtle message. Bimodal was a veritable drumbeat, pounded home over and over again in keynotes, classes, and one-on-one meetings with Gartner analysts. We’re going to be hearing a lot about bimodal development, from Gartner and the industry, because it’s a message that really describes what many of us are encountering today.

To quote Gartner’s official definition:

Bimodal IT is the practice of managing two separate, coherent modes of IT delivery, one focused on stability and the other on agility. Mode 1 is traditional and sequential, emphasizing safety and accuracy. Mode 2 is exploratory and nonlinear, emphasizing agility and speed.

Gartner sees that we create and manage two different types of projects. Some, Mode 1, being very serious, very methodical, bet-the-business projects that must be done right using formal processes, and others, Mode 2, being more opportunistic, quicker, more agile. That’s not to say that Mode 1 projects can’t be agile, and that Mode 2 projects can’t be big and significant. However, we all know that there’s a big difference between launching an initiative to implement a Black Friday sale on our website or designing a new store-locator mobile app, vs. rolling out a GAAP-compliant accounting system or migrating critical systems to the cloud.

You might argue that there’s nothing revolutionary here with bimodal, and if you did, you would be right. Nobody ever claimed that all IT projects, including software development, are the same, and should be managed the same way. What Gartner has done is provide a clear vocabulary for understanding, categorizing, and communicating project differences more efficiently.

Read more about this in my story “Mode 1, Mode 2: Gartner Preaches Bimodal Development at AADI,” published on the Parasoft blog.

, , , ,

Tomorrow’s forecast: Distributed Denial of Service

forecastMalicious agents can crash a website by implementing a DDoS—a Distributed Denial of Service Attack—against a server. So can sloppy programmers.

Take, for example, the National Weather Service’s website, operated by the United States National Oceanic and Atmospheric Administration, or NOAA. On August 29, the service went down, hard, as single rogue Android app overwhelmed the NOAA’s servers.

As far as anyone knows, there was nothing deliberately malicious about the Android app, and of course there is nothing specific to Android in this situation. However, the app in question was making service requests of the NOAA server’s public APIs every few milliseconds. With hundreds, thousands or tens of thousands of instances of that app running simultaneously, the NOAA system collapsed.

There is plenty of blame to go around. Let’s start with the app developer.

Certainly the app developer was sloppy, sloppy, sloppy. I can imagine that the app worked great in testing, when only one or two instances of the app were running at any one time on a simulator or on actual devices. Scale it up—boom! This is a case where manual code reviews may have found the problem. Maybe not.

Alternatively, the app developer could have checked to see if the public APIs it required (such as NOAA’s weather API) could handle the anticipated load. However, if the coders didn’t write the software correctly, load testing may not have sufficed. For example, say that the design of the app was to pull data every 10 seconds. If the programmers accidentally set up the data retrieval to pull the data every 10 milliseconds, the load would be 1,000x greater than anticipated. Every 10 seconds, no problem. Every 10 milliseconds, big problem. Boom!

This is a nasty bug, to be sure. Compilers, libraries, test systems, all would verify that the software ran correctly, because it did run correctly. In the scenario I’ve painted, it simply wasn’t coded to meet the design. The bug might have been spotted if someone noticed a very high number of external API calls, or again, perhaps during a manual code review. Otherwise, it’s not hard to see how it would slip through the crack.

Let’s talk about NOAA now. In 2004, the weather service beefed up its Internet loads in anticipation of Hurricane Charley, contracting with Akamai to host some of its busiest Web pages, using distributed edge caching to reduce the load. This worked well, and Akamai continued to work with NOAA. It’s unclear if Akamai also fronted public API calls; my guess is that those were passed straight through to the National Weather Service servers.

NOAA’s biggest problem is that it has little control over external applications that use its public APIs. Even so, Akamai was still in the circuit and, fortunately, was able to help with the response to the Aug. 29 accidental DDoS situation. At that time, the National Weather Service put out a bulletin on its NIDS messaging service that said:

TO – ALL CUSTOMERS SUBJECT – POINT FORECAST ISSUES. WE ARE PROVIDING NOTICE TO ALL THAT NIDS HAS IDENTIFIED AN ABUSING ANDROID APP THAT IS IMPACTING FORECAST.WEATHER.GOV. WE HAVE FORCED ALL SITES TO ZONES WHILE WE WORK WITH THE DEVELOPER. AKAMAI IS BEING ENGAGED TO BLOCK THE APPLICATION. WE CONTINUE TO WORK ON THIS ISSUE AND APPRECIATE YOUR PATIENCE AS WE WORK TO RESOLVE THIS ISSUE.

Kudos to NOAA for responding quickly and transparently to this issue. Still, this appalling situation—that a single DDoS attack could cripple such a vital service—is unacceptable. Imagine if this had been a malicious attack, rather than an accidental coding error, and if the attacker was able to modify the attack in real time to go around Akamai’s attempts to block the traffic.

What could NOAA have done differently? For best results, DDoS attacks must be blocked within the network before they reach (and overwhelm) the server. Therefore, DDoS detection and blocking systems should already have been in place.

For example, with the ability to detect potential attacks due to abnormally high volumes of requests from a specific app, raise alarms, and also drop such requests (which is fast and takes few resources), instead of servicing them (which is slow and takes more resources). Perfect? No. DDoS scenarios are nasty and messy. No matter how you slice it, though, a single misbehaving app should never be able to crash your server.

, ,

Forget Big Data and worry about Bad Data

bad-dataTwo consulting projects this year have involved lots and lots of data. One was the migration of a very complex customer database and transaction logging system to a cloud-based CRM platform from a homegrown system. The other involved performing serious analytics on a non-profit’s membership system that had data spanning decades.

Both projects required incredible manual intervention in the data processing. Data came from different original sources and had wildly varying schemas. Some data was relational, and some was flat-file. Some of the data was clearly contradictory. Timestamps were missing on records. Valuable data was stored in comment fields. Documentation didn’t exist. Keys were lost. Fields were abandoned. Live data was mixed with archival data.

Both systems were a mess—but that wasn’t the problem. The issues I’m describing are the everyday result of messy software, evolving databases and the real world. Solving those challenges takes some effort, but we all know the importance of factoring data cleansing into any type of migration or analytics project, both in terms of time and of finances.

The real challenge is that most of the data was totally wrong. No resemblance to reality. That person never lived at that address. The relationships in the SQL database were not correct. Conventions were nonexistent. When data is being collected by many systems—and stored in many systems—over years and decades, this is what happens.

Yes, both projects were successful. However, we had to throw away a lot of data that would have added value to the organizations and their customers or members. Worse, we learned that both organizations had been using bad data for years, resulting in missed opportunities, less-than-ideal customer service, and flawed business planning.

Garbage in, garbage out. After all, if you are thinking about offering a new product or service, and are basing your decisions on bad data, you aren’t making a good decision. You are guessing.

What went wrong? It wasn’t in the migration and analytics projects. We went in, cleaned up the data best we could, and got out. It was a finite task and went as well as could be expected.

The root causes weren’t bad programming either, or poor database administration. In many IT shops, schemas change. Documents are lost. Corruption happens. Ideas are tried and abandoned. That’s simply what happens when data is kept past its sell-by data.

The failure is that nobody regularly (or ever) checked the data to make sure that it’s still good. Nobody performs period data hygiene. Nobody tested addresses, or eyeballed records to see if they made sense, or validated the databases against other sources (or even against themselves).

Data is a valuable corporate asset. In fact, when it comes to customer data and transaction records, data may be the single biggest asset of your company. Most companies work hard to ensure that their assets are solid. A manufacturer checks its raw materials and finished goods to ensure that they are as expected. Materials in warehouses are inventoried. Random samples are pulled from time to time, tested, and examined carefully.

When it comes to data, long-term quality is rarely a consideration. Data is stored and used. Is it checked? Rarely, if ever. We all know the benefits of Big Data for our business. What about the costs of Bad Data? Unknown, but real. I’ve seen this time and again. As Bad Data is used and reused, it will only get worse.

, , , ,

Coping with complexity at the SDLC Acceleration Summit

arthur-hickenSouth San Francisco, California — Writing software would be oh, so much simpler if we didn’t have all those darned choices. HTML5 or native apps? Windows Server in the data center or Windows Azure in the cloud? Which Linux distro? Java or C#? Continuous Integration? Continuous Delivery? Git or Subversion or both? NoSQL? Which APIs? Node.js? Follow-the-sun?

In a panel discussion on real-world software delivery bottlenecks, “complexity” was suggested as a main challenge. The panel, held here at the SDLC Acceleration Summit, pointed out that the complexity of constantly evaluating new technologies, techniques and choices can bring uncertainty and doubt and consume valuable mental bandwidth—and those might sometimes negate the benefits of staying on the cutting edge. (Pictured: My friend Arthur Hicken, aka “The Code Curmudgeon,” chief evangelist at Parasoft, which sponsored the event.)

I was the moderator. Sitting on the panel were David Intersimone from Embarcadero Technologies; Paul Dhaliwal from 383 Media; Andrew Binstock, editor of Dr. Dobb’s Journal; and Norman Buck from SQS.

Choices are not simple. Merely keeping up with the latest technologies can consume tons of time. Not only reading resources like SD Times, but also following your favorite Twitter feeds, reading blogs like Stack Overflow, meeting thought leaders at conferences, and, of course, hearing vendor pitches.

While complexity can be overwhelming, the truth is that we can’t opt out. We must keep up with the latest platforms and changes. We must have a mobile strategy. Yes, you can choose to ignore, say, the recent advances in cloud computing, Web APIs and service virtualization, but if you do so, you’re potentially missing out on huge benefits. Yes, technologies like Software Defined Networking (SDN) and OpenFlow may not seem applicable to you today, but odds are that they will be soon. Ignore them now and play catch-up later.

Complexity is not new. If you were writing FORTRAN code back in the 1970s, you had choices of libraries. Developing client/server software for NetWare or AIX? Building with Oracle? We have always had complexity and choices in platforms, tools, methodologies, databases and libraries. We always had to ensure that our code ran (and ran properly) on a variety of different targets, including a wide range of browsers, Java runtimes, rendering engines and more.

Yet today the number of combinations and permutations seems to be significantly greater than at any time in the past. Clouds, virtual machines, mobile devices, APIs, tools. Perhaps we need a new abstraction layer. In any case, though, complexity is a root cause of our challenges with software delivery. We must deal with it.

,

Test Early, Test Often

Quality Assurance. Testing. No matter what you call it – and of course, there are subtle distinctions between testing and QA – the discipline is essential for successfully creating professional-grade software.

Sure, a one-person shop or a small consultancy might get away without having formal test teams or serious QA policies. Most of us can’t afford to work that way. The cost of software failure, to us and to our customers, can be huge in so many ways.

SD Times and sdtimes.com recently asked readers about test and QA in a research study. Here are some of the results; how well do the answers match your organization’s profile?

Does your organization have separate development and test teams? (Please check one only)

Yes, all development teams and test/QA teams are separate 35.9%
Some development and test/QA teams are separate, some are integrated 33.4%
All test and development teams are integrated 27.4%
Don’t know 3.3%

If any of the test/QA teams in your organization are separate, where do those test teams report? (Please check all that apply)

To the development team 16.2%
To a development manager, director, or VP of development 33.8%
To an IT manager not managing development 22.2%
To a software architect or project leader on a particular project 19.7%
To the CIO/CTO 9.2%
To line of business managers 14.8%
Don’t know 8.1%

What background do your test/QA managers and directors typically have? (Please check all that apply)

Development 20.3%
Test/QA only 28.9%
Development and test/QA 48.9%
General IT background 31.7%
General management background 18.5%
No particular background – we train them from scratch 14.2%

Does your company outsource any of its software quality assurance or testing? (Please check one only)

Yes, all of it 3.7%
Yes, some of it 32.3%
No, none of it 58.1%
Don’t know 5.9%

Who is responsible for internally-developed application performance testing and monitoring in your company? (Please check all that apply)

Software/Application Developers 68.2%
Software/Application Development Management 54.2%
Testers 52.3%
Testing Management 43.9%
Systems administrators 34.9%
IT top management (development) (VP or above) 29.3%
Networking personnel 25.2%
IT top management (non-development) (VP or above) 24.6%
Line-of-business management 21.5%
Consultants 20.2%
Service providers 19.0%
Networking management 18.1%

What is the state of software security testing at your company? (Please check all that apply)

Software security is checked by the developers 41.2%
Software security is checked by the test/QA team 31.6%
Software security is tested by a separate security team 26.9%
Software security testing is done for Web applications 25.7%
Software security is checked by the IT/networking department 25.4%
Software security testing is done for in-house applications 24.1%
Software security testing is done for public-facing applications 21.7%
We don’t have a specific security testing process 20.4%
Software security is checked by contractors 9.3%
Software security testing is not our responsibility 3.1%

At what stage is your company, or companies that you consult, using the cloud for software testing? (Please check one only)

No plans to use the cloud for software testing 42.3%
We are studying the technology but have not started yet 21.2%
We are experimenting with using the cloud for software testing 16.0%
We are using the cloud for software testing on a routine basis 10.7%
Don’t know 9.8%

Lots of good data here!

Z Trek Copyright (c) Alan Zeichick
, , , , ,

Four common mobile development mistakes

Web sites developed for desktop browsers look, quite frankly, terrible on a mobile device. The look and feel is often wrong, very wrong. Text is the wrong size. Gratuitous clip art on the home page chews up bandwidth. Features like animations won’t behave as expected. Don’t get me started on menus — or on the use-cases for how a mobile user would want to use and navigate the site.

Too often, some higher-up says, “Golly, we must make our website more friendly,” and what that results in is a half-thought-out patch job. Not good. Not the right information, not the right workflow, not the right anything.

One organization, UserTesting.com, says that there are four big pitfalls that developers (and designers) encounter when creating mobile versions of their websites. The company, which focuses on usability testing, says that the biggest issues are:

Trap #1 – Clinging to Legacy: ‘Porting’ a Computer App or Website to Mobile
Trap #2 – Creating Fear: Feeding Mobile Anxiety
Trap #3 – Creating Confusion: Cryptic Interfaces and Crooked Success Paths
Trap #4 – Creating Boredom: Failure to Quickly Engage the User

Makes sense, right? UserTesting.com offers a quite detailed report, “The Four Mobile Traps,” that goes into more detail.

The report says,

Companies creating mobile apps and websites often underestimate how different the mobile world is. They assume incorrectly that they can create for mobile using the same design and business practices they learned in the computing world. As a result, they frequently struggle to succeed in mobile.

These companies can waste large amounts of time and money as they try to understand why their mobile apps and websites don’t meet expectations. What’s worse, their awkward transition to mobile leaves them vulnerable to upstart competitors who design first for mobile and don’t have the same computing baggage holding them back. From giants like Facebook to the smallest web startup, companies are learning that the transition to mobile isn’t just difficult, it’s also risky.

Look at your website. Is it mobile friendly? I mean, truly designed for the needs, devices, software and connectivity of your mobile users?

If not — do something about it.

, , , ,

Coping with the data

As I write this on Friday, Apr. 19, it’s been a rough week. A tragic week. Boston is on lockdown, as the hunt for the suspected Boston Marathon bombers continues. Explosion at a fertilizer plant in Texas. Killings in Syria. Suicide bombings in Iraq. And much more besides.

The Boston incident struck me hard. Not only as a native New Englander who loves that city, and not only because I have so many friends and family there, but also because I was near Copley Square only a week earlier. My heart goes out to all of the past week’s victims, in Boston and worldwide.

Changing the subject entirely: I’d like to share some data compiled by Black Duck Software and North Bridge Venture Partners. This is their seventh annual report about open source software (OSS) adoption. The notes are analysis from Black Duck and North Bridge.

How important will the following trends be for open source over the next 2-3 years?

#1 Innovation (88.6%)
#2 Knowledge and Culture in Academia (86.4%)
#3 Adoption of OSS into non-technical segments (86.3%)
#4 OSS Development methods adopted inside businesses (79.3%)
#5 Increased awareness of OSS by consumers (71.9%)
#6 Growth of industry specific communities (63.3%)

Note: Over 86% of respondents ranked Innovation and Knowledge and Culture of OSS in Academia as important/very important.

How important are the following factors to the adoption and use of open source? Ranked in response order:

#1 – Better Quality
#2 – Freedom from vendor lock-in
#3 – Flexibility, access to libraries of software, extensions, add-ons
#4 – Elasticity, ability to scale at little cost or penalty
#5 – Superior security
#6 – Pace of innovation
#7 – Lower costs
#8 – Access to source code

Note: Quality jumped to #1 this year, from third place in 2012.

How important are the following factors when choosing between using open source and proprietary alternatives? Ranked in response order:

#1 – Competitive features/technical capabilities
#2 – Security concerns
#3 – Cost of ownership
#4 – Internal technical skills
#5 – Familiarity with OSS Solutions
#6 – Deployment complexity
#7 – Legal concerns about licensing

Note: A surprising result was “Formal Commercial Vendor Support” was ranked as the least important factor – 12% of respondents ranked it as unimportant.  Support has traditionally been held as an important requirement by large IT organizations, with awareness of OSS rising, the requirement is rapidly diminishing.

When hiring new software developers, how important are the following aspects of open source experience? Ranked in response order:

2012
#1 – Variety of projects
#2 – Code contributions
#3 – Experience with major projects
#4 – Experience as a committer
#5 – Community management experience

2013
#1 – Experience with relevant/specific projects
#2 – Code contributions
#3 – Experience with a variety of projects
#4 – Experience as a committer
#5 – Community management experience

Note: The 2013 results signal a shift to “deep vs. broad experience” where respondents are most interested in specific OSS project experience vs. a variety of projects, which was #1 in 2012.

There is a lot more data in the Future of Open Source 2013 survey. Go check it out. 

, , ,

Bug Invaders! Angry Code! World of Compilecraft!

Everything, it seems, is a game. When I use the Waze navigation app on my smartphone, I earn status for reporting red-light cameras. What’s next: If I check in code early to version-control system, do I win a prize? Get points? Become a Code Warrior Level IV?

Turning software development into a game is certainly not entirely new. Some people live for “winning,” and like getting points – or status – by committing code to open-source projects or by reporting bugs as a beta tester. For the most part, however, that was minor. The main reason to commit the code or document the defect was to make the product better. Gaining status should be a secondary consideration – a reward, if you will, not a motivator.

For some enterprise workers, however, gamification of the job can be more than a perk or added bonus. It may be the primary motivator for a generation reared on computer games. Yes, you’ll get paid if you get your job done (and fired if you don’t). But you’ll work harder if you are encouraged to compete against other colleagues, against other teams, against your own previous high score.

Would gamification work with, say, me? I don’t think so. But from what I gather, it’s truly a generational divide. I’m a Baby Boomer; when I was a programmer, Back in the Day, I put in my hours for a paycheck and promotions. What I cared about most: What my boss thought about my work.

For Generation Y / Millennials (in the U.S, generally considered to be those born between 1982 and 2000), it’s a different game.

Here are some resources that I’ve found about gamification in the software development profession. What do you think about them? Do you use gamification techniques in your organization to motivate your workers?

Gamification in Software Development and Agile

Gamifying Software Engineering and Maintenance

Gamifying software still in its infancy, but useful for some

Some Thoughts on Gamification and Software

TED Talk: Gaming can make a better world 

, , , ,

From Apple to Microsoft to Tesla, rumors abound

teslaIf there’s no news… well, let’s make some up. That’s my thought upon reading all the stories about Apple’s forthcoming iWatch – a product that, as far as anyone knows, doesn’t exist.

That hasn’t stopped everyone from Forbes to CNN to the New York Times from jumping in with breathless analysis of the rumor.

Turn the page.

More breathless analysis focused on why Microsoft’s stores and retail partners didn’t have enough stock of the Surface Pro tablet. Was this intentional, some wondered, part of a scheme to make the device appear more popular?

My friend John P. Mello Jr. had solid analysis in his article for PC World, “Microsoft Surface Pro sell-out flap: Is the tablet really that popular?

I think the real reason is that Microsoft isn’t very good at sales estimation or manufacturing logistics. Companies like Apple and HP have dominated, in large part, because of their master of the supply chain. Despite its success with the Xbox consoles, Microsoft is a hardware newbie. I think the inventory shortfall was a screw-up, but an honest one.

After all, when Apple or Samsung run out of hot items, nobody says “It’s a trick.”

Can’t leave the conversation about rumors without mentioning the kerfuffle with the New York Times’s story, “Stalled Out on Tesla’s Electric Highway.” In short: Times columnist John M. Broder claims that the Tesla Model S electric car doesn’t live up to its claimed 265-mile estimated range. Tesla founder Elon Musk tweeted “NYTimes article about Tesla range in cold is fake.”

Everyone loves a good twitter-fight. Dozens of pundits, and gazillions of clicks, are keeping this story in the news.

, , ,

Preying on human weakness with well-designed faux emails

This past week, I’ve started receiving messages from eFax telling me that I’ve received a fax, and to click on a link to download my document. As a heavy eFax user, this seemed perfectly normal… until I clicked one of the links. It took me to a malware site. Fortunately, the site was designed to target Windows computers, and simply froze my Mac’s browser.
The faux eFax messages were well designed. They had clean headers and made it through my email service provider’s malware filters.
Since then, six of those malicious messages have appeared. I have to look carefully at the embedded link to distinguish those from genuine eFax messages with links to genuine faxes.
The cybercrime wars continue unabated, with no end in sight. I’ve also received fake emails from UPS, asking me to print out a shipping label… which of course leads me to a phishing site.
Malicious email – whether it’s phishing, a “419”-style confidence scam, or an attempt to add your computers to someone’s botnet – is only one type of cybercrime. Most of the time, as software developers, we’re not focusing on bad emails, unless we’re trying to protect our own email account, or worrying about the design of emails sent into automated systems. SQL Injection delivered by email? That’s nothing I want to see.
Most of the attacks that we have to content with are more directly against our software – or the platforms that they are built upon. Some of those attacks come from outside; some from inside.
Some attacks are successful because of our carelessness in coding, testing, installing or configuring our systems. Other attacks succeed despite everything we try to do, because there are vulnerabilities we don’t know about, or don’t know how to defend against. And sometimes we don’t even know that a successful attack occurred, and that data or intellectual property has been stolen.
We need to think longer and harder about software security. SD Times has run numerous articles about the need to train developers and tester to learn secure coding techniques. We’ve written about tools that provided automated scanning of both source code and binaries. We’re talked about fuzz testers, penetration tests, you name it.
What we generally don’t talk about is the backstory – the who and the why. Frankly, we generally don’t care why someone is trying to hack our systems; it’s our job to protect our systems, not sleuth out perpetrators.
We are all soldiers in the cybercrime war – whether we like it or not. Please read a story by SD Times editor Suzanne Kattau, “Cybercrime: How organizations can protect themselves,” where she interviewed Steve Durbin, for the Information Security Forum. It’s interesting to see this perspective on the broader problem.
, ,

Celestial navigation, driving by GPS and agile development

Going agile makes sense. Navigating with traditional methodologies doesn’t make sense. I don’t know about you, but nothing sucks the life out of a software development project faster having to fully flesh out all the requirements before starting to build the solution.

Perhaps it’s a failure of imagination. Perhaps it’s incomplete vision. But as both a business owner and as an IT professional, it’s rare that a successfully completed application-development project comes even close to matching our original ideas.

Forget about cosmetic issues like the user interface, or unforeseen technical hurtles that must be overcome. No, I’m talking about the reality that my business – and yours, perhaps – moves fast and changes fast. We perceive the needs for new applications or for feature changes long before we understand all the details, dependencies and ramifications.

But we know enough to get started on our journey. We know enough to see whether our first steps are in the first direction. We know enough to steer us back onto the correct heading when we wander off course. Perhaps agile is the modern equivalent of celestial navigation, where we keep tacking closer and closer to our destination. In the words of John Masefield, “Give me a tall ship and a star to steer her by.”

Contrast that to the classic method of determining a complete set of requirements up front. That’s when teams create project plans that are followed meticulously until someone stands up and says, “Hey, the requirements changed!” At that point, you stop, revise the requirements, update the project plan and redo work that must be redone.

Of course, if the cost of creating and revising the requirements and project plan are low, sure, go for it. My automobile GPS does exactly that. If I tell it that I want to drive from San Francisco to New York City (my requirements), it will compute the entire 2,907-mile journey (my project plan) with incredible accuracy, from highway to byway, from interchange to intersection. Of course, every time the GPS detects that I missed an exit or pulled off the highway to get fuel, the device calculates the entire journey again. But that’s okay, as the cost of having the device recreate the project plan when it detects a requirements change is trivial.

In the world of software development, the costs of determining, documenting and getting approvals for a project’s requirements and project plans are extremely expensive, both in terms of time and money. Worse, there are no automated ways of knowing when business needs have changed, and therefore the project plan must change also. Thus, we can spend a lot of time sailing in the wrong direction. That’s where agile makes a difference – be design, it can detect when something going wrong faster than classic methodologies.

In a perfect world, if it were easy to create requirements and project plans, there would be no substantive difference between agile and classic methodologies. But in the messy, every-changing real world of software development that I live in, though, agile is the navigation methodology for me.

, , ,

When the cloud was good, it was very very good. But when it was bad, it was horrid

Cloud computing took a big hit this week amid two significant service outages.

The biggest one, at least as it affects enterprise computing, is the eight-hour failure of Amazon’s Simple Storage Service. Check out the Amazon Web Services service health dashboard, and then select Amazon S3 in the United States for July 20. You’ll see that problems began at 9:05 am Pacific Time with “elevated error rates,” and that service wasn’t reported as being fully restored until 5:00 pm.

About the error, Amazon said,

We wanted to share a brief note about what we observed during yesterday’s event and where we are at this stage. As a distributed system, the different components of Amazon S3 need to be aware of the state of each other. For example, this awareness makes it possible for the system to decide to which redundant physical storage server to route a request. In order to share this state information across the system, we use a gossip protocol. Yesterday, we experienced a problem related to gossiping our internal state information, leaving the system components unable to interact properly and causing customers’ requests to Amazon S3 to fail. After exploring several alternatives, we determined that we had to temporarily take the service offline so that we could clear all gossipped state and restart gossip to rebuild the state.

These are sophisticated systems and it generally takes a while to get to root cause in such a situation. We’re working very hard to do this and will be providing more information here when we’ve fully investigated the incident. We also wanted to let you know that for this particular event, we’ll be waiving our standard SLA process and applying the appropriate service credit to all affected customers for the July billing period. Customers will not need to send us an e-mail to request their credits, as these will be automatically applied. This transaction will be reflected in our customers’ August billing statements.

Kudos to Amazon for issuing a billing adjustment. However, as we all know, the business cost of a service failure like this vastly exceeds the cost you pay for the service. If your applications were offline for eight hours because Amazon S3 was malfunctioning, that really hurts your bottom line. This wasn’t their first service failure, either: Amazon S3 went down in February as well.

Less significant to enterprises, but just as annoying to those concerned, involved hosted e-mail accounts hosted on Apple’s MobileMe service. MobileMe is the new name of the .Mac service, and the service was updated in mid-July along with the launch of the iPhone 3G. Unfortunately, not everything worked right. As you can see from Apple’s dashboard, some subscribers can’t access their email. Currently, this is affects about 1% of their subscribers — but it’s been like that since last Friday.

According to Apple,

We understand this is a serious issue and apologize for this service interruption. We are working hard to restore your service.

This reminds me of the poem from that great Maine writer, Henry Wadsworth Longfellow:

There was a little girl
Who had a little curl
Right in the middle of her forehead;
And when she was good
She was very, very good,
But when she was bad she was horrid.