Machine Learning Times
Machine Learning Times
EXCLUSIVE HIGHLIGHTS
How Generative AI Helps Predictive AI
 Originally published in Forbes, August 21, 2024 This is the...
4 Ways Machine Learning Can Perpetuate Injustice and What to Do About It
 Originally published in Built In, July 12, 2024 When ML...
The Great AI Myth: These 3 Misconceptions Fuel It
 Originally published in Forbes, July 29, 2024 The hottest thing...
Where FICO Gets Its Data for Screening Two-Thirds of All Card Transactions
 Originally published in The European Business Review, March 21,...
SHARE THIS:

CONTINUE READING: Access the complete article in INMA, where it was originally published 

9 years ago
Choosing the Right Vendor for your Big Data Needs

 

Choosing the right vendor for managing Big Data requires each organization to consider its own needs, including preferences related to data storage, scripting language, and preferred usage tools.

PATIMES BANNER
USE CODE PATIMES16 for 15% off Predictive Analytics World Conference pass.

Hadoop, NoSQL, and all of the others living in the cloud storage/tools world are in an explosive stage of development, with many new companies jumping into the fray with a widget that makes something possible (time series) or the large firms making something easier to do (drag and drop H-SQL).

All of this inventive effort makes selecting the right set of tools to build your technology stack very difficult. The classic cliché of 20:20 hindsight will prove yet again that your great tech decision was wrong. But, you can’t stand on the sidelines; you have to pick something.

So, if you are sure you and your company are ready for the Big Data stack (versus lots of data), let’s jump down the rabbit hole.

The simplified stack:

  1. Needs assessment.
  2. People.
  3. Cloud storage.
  4. Database technology.
  5. Extraction/loading tools.
  6. Governance and data forensics tools.
  7. Query tool and language.
  8. Scripting language.
  9. Analytical (statistical) tools.
  10. Presentation.
  11. API and other connectors to enable decisions.

Each layer in the stack is a unique tool/vendor decision point. Come to think of it, the list itself is a decision point: to do the list or hire a firm to manage it for you.

There are vendors out there that provide everything on the list plus the consultants to plug it all in for you. Think shrink-wrapped service provider. It’s a fairly simple approach, with its own risks and rewards.

My preference is to study each of the layers and build a vendor/service/importance matrix. This will let you see at a glance the whole stack, who fills a stack position and who doesn’t, an evaluation of each at each X/Y coordinate of the matrix, and the relative importance of the matrix position. You then have a quick tool to begin narrowing vendors and make decisions based on what is important to your organisation.

Let’s look at the cloud storage level. I just Googled “Big Data cloud storage companies.” It’s no surprise there are more than 32,000,000 results. So, filtering past the companies that pay to get to the top of the search list, you see the big players: Amazon, Microsoft, IBM, EMC (Dell), Google, and Oracle. And then there is the mass of others – some are re-branding the big guys’ stuff under their own brand with some (maybe) added services.

So how do you choose?

Develop selection criteria. Use the matrix concept from above, not necessarily as fully as you would a formal RFP but close, to narrow the field. It is important to do this because most often in the Big Data space, the software tools and vendor names will be completely new to you.

Even then, a vendor name isn’t everything. But in this fast changing space, a big name might be completely wrong for your company – so a safe choice might be the best choice or the worst choice.

Note I said “might.” You need to pick on the basis of needs and fulfillment of those needs. Have your RFP go-to questions answered to support your choice: service level (up time), support, storage location, price, company financials, security capabilities, public/private/hybrid cloud, and an exit strategy (move to another vendor) are at the top of my criteria list for storage vendor.

Storage location? Yep, it is important in the cloud space.

For example, if you have personal identifiable information (PII) or company financial/performance information in the data you load to the cloud and if the storage is outside of the United States, what legal implications does that have? Subpoena and discovery/disclosure/retention laws are different. PII definitions and restrictions are different. The right to “forget me” requirements, and so on.

So, while the desire to be on the very edge of technology is intriguing, and the thought of open-sourcing everything is too, remember you are putting everything into a cloud of disk storage owned by someone else. You’ve got to get this one perfect the first time!

The next decision, and it is directly linked to the storage choice, is the database technology. Relational or non-relational. No-SQL or NewSQL. Hadoop, Netezza, Vertica, HANA, Informix, HBase, Cassandra, Membrain, Riak, Couchbase, MongoDB, Google Cloud SQL, Microsoft Azure, or Postgress. In memory. Storage based. Over lots of small servers and disk. Or big, big gear. Do you have the skills to operate these databases? Which one works with which storage providers?

Then you repeat this discovery (flare) analysis and decision (focus) process for every layer in the needed technology stack. It will take you time, and there isn’t a silver bullet solution out there. The mix of the data you are collecting and using in the decisions of your Web site/mobile app are factors that will move you across the spectrum of technologies until you find the right one for your particular needs.

The skill sets to navigate this space are new, and, unfortunately, just making a few key hires, could inadvertently direct the decisions on the tools to those familiar to the hires and not in sync with the needs of your company.

CONTINUE READING: Access the complete article in INMA, where it was originally published

One thought on “Choosing the Right Vendor for your Big Data Needs

  1. There is only vendor that could actually structure unstructured data: Oracle.
    1. Oracle obtains statistics on queries and data from the data itself, internally.
    3. Oracle gets 100% patterns from data.
    4. Oracle uses synonyms searching.
    5. Oracle indexes data by common dictionary.
    6. Oracle killed SQL, where SQL either does not use statistics at all, or uses manually assigned one, or (at Internet) uses ‘popularity’.

    Indeed, the only difference between IBM and Google is that Google can obtain statistics, ‘popularity’ – and IBM can only assign it manually.
    Oracle is the first who can automatically calculate statistics – see ATG Search Administration Guide – https://docs (dot) oracle (dot)
    com/cd/E24152_01/Search.10-1/ATGSearchAdmin/html/s1007understandingtermweights01.html

    However, Oracle still experienced some technical difficulties. Mainly Oracle cannot create a dictionary to index data.

     

Leave a Reply