5 Expensive Traps of DIY Hadoop Big Data Environments

Some myths are rooted in truth -- and myths about Apache Hadoop, the open source software framework for very large data sets, are no exception. Yes, Hadoop runs on cheap commodity computer hardware, and it's easy for users to add nodes. But the devil is in the very expensive details, especially when you're running Hadoop in a production environment, warns Jean-Pierre Dijcks, Oracle master product manager for big data.

'IT departments will think 'I've got servers anyway' or 'I can buy inexpensive ones, and I've got some people, so it will cost next to nothing to build our own Hadoop cluster,'' Dijcks says. 'They want to explore this technology and play with it-and exploration is a good thing.'

But IT departments can find that their Hadoop experiments head down the proverbial rabbit hole, piling up expenses they didn't anticipate as business colleagues breathe down their necks to deliver. Dijcks cites five common mistakes IT leaders make with their DIY Hadoop clusters.

1. They try to do it on the cheap

Not having a clear idea about what the cluster is supposed to accomplish (other than analyze data of some kind), many IT departments buy the cheapest servers possible (since everyone knows that Hadoop runs on commodity boxes).

"Hadoop is known to be self-healing, so if a node goes down on a server, it's not a problem," Dijcks says. "But if you buy inexpensive servers, you're more likely to have nodes down and spend more time fixing hardware. And when you have a chunk of nodes that aren't working, you've lost that capacity."

If the Hadoop cluster is just an experiment, none of this is a big deal. However, what starts out as an experimental project often ends up in a production environment, Dijcks says.

IT departments figure, "We've invested a lot of time, we've worked on this very hard, and now we need to put it into production," Dijcks says. "You can learn on throwaway servers, because if [the environment] goes down, no worries -- just restart it. But in production, the cluster needs to stay up through hardware failures, human interaction failures, and whatever can happen."

Forrester, in its Q2 2016 report 'The Forrester Wave: Big Data Hadoop Optimized Systems,' notes that it takes considerable time and effort to install, configure, tune, upgrade, and monitor the infrastructure for general-purpose Hadoop platforms, while preconfigured Hadoop-optimized systems offer faster time to value, lower costs, minimised administration effort, and modular expansion capabilities.

2. They introduce too many cooks

Most IT departments divide themselves into software, hardware, and network groups, and Hadoop clusters cross all those boundaries. So the DIY Hadoop cluster ends up being the product of quite a few very opinionated chefs.

"It's a situation where you have a recipe for how to do it, but people in charge of the different component areas don't follow their part of the recipe exactly because they prefer something slightly different from what the recipe calls for," Dijcks says. And at the end, the cluster doesn't quite work like it was supposed to.

After a bit of troubleshooting, the system will hopefully be up and running and ready for handoff to the IT operations people to run in production. But at that point, Dijcks says, "that's where another learning curve begins. They may not be familiar with a Hadoop cluster, so you'll see human errors and downtime and a range of issues."

3. They fail to appreciate that Hadoop DIY projects are a Trojan horse

Once Hadoop clusters shift to production, companies typically find they need to staff up to keep them running.

"Of course, people cost money-and just like the rest of IT, this staff will spend the majority of its time on maintenance and not innovation," Dijcks says. So much for a Hadoop cluster being inexpensive.

Furthermore, he points out, this staff will need to have knowledge of Hadoop systems -- and of course, Hadoop experts are highly coveted.

"You can't just repurpose people and expect them to go from zero to Hadoop experts in a short time," Dijcks warns. And even if you hire experienced people, IT environments vary wildly -- as do the components of DIY Hadoop clusters. So all those configurations, connections, and interdependencies in your specific environment will take some time to understand.

4. They underestimate the complexity and frequency of updates

New releases of Hadoop distributions, such as those from Cloudera and Hortonworks, come every three months. These typically include new features, functionality, updates, bug fixes, and more.

"In addition to all the human activity it takes to keep a Hadoop cluster running, there's a new learning curve every three months with the new upgrade," Dijcks says. "The moment you finish one upgrade, you have to start planning the next. It's fairly complicated, so some people start to skip updates."

Skip updates a few times, and you're in unchartered waters. When you eventually decide you're going to update, you may be going from version 5.4 to 5.7-and perhaps no one's done that before, particularly with the combination of components in your DIY Hadoop cluster and the applications, configurations, and operating systems in your IT environment.

While Cloudera and Hortonworks try to test as many scenarios as possible, "They can't test your specific OS version or the impact of your specific job operation," Dijcks says. "Environments might have a Cisco router or Red Hat operating system or IBM hardware with some other OS. And if it's a cluster that's being used for a big data production project, you may have significant downtime when you try to catch up on updates."

5. They're ill-prepared for the security challenge

In the early days of Hadoop, security wasn't seen as a big issue because the cluster sat behind a big firewall. Today, Dijcks says, "security is the biggest problem child of them all."

Kerberos authentication is now built into Hadoop to address those concerns, but some IT organizations don't know how to deal with this protocol. "Integrating Kerberos into an organization's Active Directory is extremely complex," he says. "You have to do a lot of integration work between Active Directory and a range of components, as well as with the Hadoop cluster itself. There's very little documentation for this, and it involves security admins and other groups in IT that almost speak different languages."

Some IT departments end up contracting with Cloudera, Hortonworks, or other external parties to secure their DIY Hadoop clusters.

"It can take some time to get this set up and tested and certified and working," Dijcks says. "And then every three months, you do it again with the new Hadoop update, making sure the applications and configurations and everything else still work."