Debunked! 9 Myths about Big Data and Hadoop

This excerpt is from InfoWord. To view the whole article click here. ♦

Jun 25, 2015
No comments yet
Industry News, Left-hand

Big Data, Hadoop
8944 Views

10 years ago
Debunked! 9 Myths about Big Data and Hadoop

By: Andrew Oliver, Strategic Developer, InfoWord
Originally published at www.infoworld.com

These unfounded beliefs about budget skills, technology, and technology fit can lead you astray.

Big data analytics is one of the major trends every company is told it must jump on for competitive advantage, even survival. As a result, there’s a lot of mythology around big data. Those myths can lead you astray, wasting resources or putting you on dead-end paths. They can also cause you to miss opportunities where budget approaches help.

Here are the nine biggest myths about big data and Hadoop that you should not believe.

Myth No. 1: You can get data scientists

Recently, a presales engineer at one of my company’s partners mentioned how much trouble his firm had finding data scientists. I asked about the qualifications his company was seeking. Well, they need to have a doctorate in math, a background in computer science, and what amounts to an MBA, not to mention actual work experience in all of those fields. I asked, “How old is this person, 90?”

Here’s what actually exists:

Good mathematicians who write crap Python and often need the business stuff spoon-fed to them
Good computer science people who understand some math
Good computer science people who understand business after working enough problems
Business types who understand math
Subject matter experts
Leaders who know how to get these people to work together

Because that company could not find this data-scientist unicorn, it had to create a working group with a cross-section of expertise. This is in fact what you have to do.

Myth No. 2: Everything is new

Technologists like to throw away the past, preferring tools that are new for what they claim is a totally new reality or problem set. That’s rarely the case.

For example, the Kafka message broker is portrayed as a big-data-needs-a-new-tool product. But compared to other message brokers, it has a pretty poor feature set and is immature. What’s actually new (meaning different): Kafka is architected for the Hadoop platform and with massive distribution in mind. That could be useful, if you can accept its flaws.

That said, sometimes you need more sophisticated routing and guarantees. Use ActiveMQ or a more robust option for those situations.

Myth No. 3: Machine learning is what you need

I estimate that about 85 percent of what people call machine learning is simple statistics. Most of your problems are probably simple math and analysis. Start there.
Myth No. 4: You are special

As the philosopher Dirden once said, “You are not special. You are not a delicate and unique snowflake.” Guess what? About half of the industry is busy writing the same ETL scripts for many of the same data sources and custom-creating the same analysis. Hell, in any sizable company, many departments probably are duplicating this work as well.

Needless to say, it’s a good time to be a big data consultant.

Myth No. 5: Hive is fast

Hive is not fast. It cannot be made to impress you. Yes, the new version is better, but it will still underwhelm you from a performance perspective. It scales well, but you may need multiple tools in your chest to hit Hadoop with SQL.
Myth No. 6: You can use clusters with fewer than 12 nodes

Hadoop 2+ barely fits on 12 nodes — anything less and you will wait forever for it to even start. Plus, anything you run will complete in cricket time, if at all. (Well, you can run “hello world” on 12 nodes.) Hadoop 2 runs more processes, which means you need more nodes and more memory.

Spark will do better minus the load time from HDFS so long as the data set fits in memory.

Myth No. 7: Virtualization is a solution for your data nodes

Your vendor told you no. Your IT team balked. No, you cannot put data nodes on your SAN. But If you put your management nodes in VMs, you could bottleneck if writing the logs and any journals hit latency, or you get low IOPS or high latency to the data nodes.

That said, Amazon Web Services and others navigate these issues and still manage reasonable performance and scalability. You can too, but you need to distinguish this from your internal file servers and your external corporate presence site, as well as manage hardware and virtualized resources effectively.

Remember: Throughput and latency are orthogonal. HDFS cares about both in different places.

This excerpt is from InfoWord. To view the whole article click here.

By: Andrew Oliver, Strategic Developer, InfoWord
Originally published at www.infoworld.com

EXCLUSIVE HIGHLIGHTS

This excerpt is from InfoWord. To view the whole article click here. ♦

Related

10 years ago
Debunked! 9 Myths about Big Data and Hadoop

These unfounded beliefs about budget skills, technology, and technology fit can lead you astray.

Myth No. 1: You can get data scientists

Myth No. 2: Everything is new

Myth No. 3: Machine learning is what you need

Myth No. 5: Hive is fast

Myth No. 7: Virtualization is a solution for your data nodes

Leave a Reply Cancel reply

Login

Industry News

Connect with Us

Subscription

ADVERTISEMENTS

Produced By:

Archives

The Machine Learning Times © 2025 • 1221 State Street • Suite 12, 91940 • Santa Barbara, CA 93190
Produced by: Rising Media & Prediction Impact

EXCLUSIVE HIGHLIGHTS

This excerpt is from InfoWord. To view the whole article click here. ♦

Related

10 years agoDebunked! 9 Myths about Big Data and Hadoop

These unfounded beliefs about budget skills, technology, and technology fit can lead you astray.

Myth No. 1: You can get data scientists

Myth No. 2: Everything is new

Myth No. 3: Machine learning is what you need

Myth No. 5: Hive is fast

Myth No. 7: Virtualization is a solution for your data nodes

Recommended

Five Trends in AI and Data Science for 2025

AI data readiness: C-suite fantasy, big IT problem

AI Optimism vs. Skepticism: Bridging the Gap Between Hype and Practicality

How Gen AI and Analytical AI Differ — and When to Use Each

Leave a Reply Cancel reply

Login

Industry News

Connect with Us

Subscription

ADVERTISEMENTS

Produced By:

Archives

The Machine Learning Times © 2025 • 1221 State Street • Suite 12, 91940 • Santa Barbara, CA 93190 Produced by: Rising Media & Prediction Impact

10 years ago
Debunked! 9 Myths about Big Data and Hadoop

The Machine Learning Times © 2025 • 1221 State Street • Suite 12, 91940 • Santa Barbara, CA 93190
Produced by: Rising Media & Prediction Impact