Big data analytics is one of the major trends every company is told it must jump on for competitive advantage, even survival. As a result, there’s a lot of mythology around big data. Those myths can lead you astray, wasting resources or putting you on dead-end paths. They can also cause you to miss opportunities where budget approaches help.
Here are the nine biggest myths about big data and Hadoop that you should not believe.
Recently, a presales engineer at one of my company’s partners mentioned how much trouble his firm had finding data scientists. I asked about the qualifications his company was seeking. Well, they need to have a doctorate in math, a background in computer science, and what amounts to an MBA, not to mention actual work experience in all of those fields. I asked, “How old is this person, 90?”
Here’s what actually exists:
Because that company could not find this data-scientist unicorn, it had to create a working group with a cross-section of expertise. This is in fact what you have to do.
Technologists like to throw away the past, preferring tools that are new for what they claim is a totally new reality or problem set. That’s rarely the case.
For example, the Kafka message broker is portrayed as a big-data-needs-a-new-tool product. But compared to other message brokers, it has a pretty poor feature set and is immature. What’s actually new (meaning different): Kafka is architected for the Hadoop platform and with massive distribution in mind. That could be useful, if you can accept its flaws.
That said, sometimes you need more sophisticated routing and guarantees. Use ActiveMQ or a more robust option for those situations.
I estimate that about 85 percent of what people call machine learning is simple statistics. Most of your problems are probably simple math and analysis. Start there.
Myth No. 4: You are special
As the philosopher Dirden once said, “You are not special. You are not a delicate and unique snowflake.” Guess what? About half of the industry is busy writing the same ETL scripts for many of the same data sources and custom-creating the same analysis. Hell, in any sizable company, many departments probably are duplicating this work as well.
Needless to say, it’s a good time to be a big data consultant.
Hive is not fast. It cannot be made to impress you. Yes, the new version is better, but it will still underwhelm you from a performance perspective. It scales well, but you may need multiple tools in your chest to hit Hadoop with SQL.
Myth No. 6: You can use clusters with fewer than 12 nodes
Hadoop 2+ barely fits on 12 nodes — anything less and you will wait forever for it to even start. Plus, anything you run will complete in cricket time, if at all. (Well, you can run “hello world” on 12 nodes.) Hadoop 2 runs more processes, which means you need more nodes and more memory.
Spark will do better minus the load time from HDFS so long as the data set fits in memory.
Your vendor told you no. Your IT team balked. No, you cannot put data nodes on your SAN. But If you put your management nodes in VMs, you could bottleneck if writing the logs and any journals hit latency, or you get low IOPS or high latency to the data nodes.
That said, Amazon Web Services and others navigate these issues and still manage reasonable performance and scalability. You can too, but you need to distinguish this from your internal file servers and your external corporate presence site, as well as manage hardware and virtualized resources effectively.
Remember: Throughput and latency are orthogonal. HDFS cares about both in different places.
This excerpt is from InfoWord. To view the whole article click here.
By: Andrew Oliver, Strategic Developer, InfoWord
Originally published at www.infoworld.com