MapR’s Google Deal Marks Second Big Data Cloud Win


Google I/O: 10 Awesome Visions

(click image for larger view and for slideshow)

June was an awesome month for Hadoop software distributor MapR, landing not one, but two high-profile deals to supply the software for Hadoop services inside the cloud.

MapR’s latest deal is tied to Google’s big June 28 announcement of the Google Compute Engine , new infrastructure-as-a-service (IaaS) that sets up the quest giant as a public-cloud rival to Amazon Web Services (AWS). MapR is one among no less than six partners debuting services at the Google infrastructure, that’s currently in limited beta release. MapR and Google are currently signing up customers to enroll in a personal preview of the Hadoop services as a way to run on Google Compute Engine.

News of the Google partnership came just two weeks after MapR and Amazon announced that services according to its M3 and M5 Hadoop software distributions can be available on AWS. Where Amazon’s own Elastic MapReduce service runs on Apache Hadoop, the MapR-based services add high-availability features not yet supported on standard open source software.

A key appeal of the AWS and Google services is generally the facility to process and analyze data that already resides within the cloud. The MapR-based services on AWS, to illustrate, are integrated with Amazon’s Simple Storage Service (S3) and DynamoDB NoSQL database . Google AdWords and Google (Web) Analytics are both potentially rich, high-volume sources of search and click on-stream data that Google Compute Engine customers could presumably tap without costly and time-consuming data-integration and knowledge-movement steps.

“The large challenges in media are knowing who to focus on, when to focus on, appropriate price points, and appropriate keyword bids, so that you could easily see related digital media and advertising analyses performed on Google’s cloud,” MapR VP of promoting Jack Norris told InformationWeek.

[ Want more on Google's new public cloud infrastructure? Read Google Compute Engine: Hands-On Review . ]

By tapping compute capacity on demand, customers could potentially economize in the event that they experience peaks and valleys in capacity utilization. In a test of Google Compute Engine performance , Norris said MapR recently tested its beta Hadoop service by putting in a 1,256-node cluster and running an industry-standard benchmark terasort job. The cloud-based system completed the job in a single minute and 20 seconds, in line with Norris, whereas the arena record is one minute and two seconds.

“The record was set on a system that had twice as many cores, four times the variety of disks, 200 more servers than the system we prepare at the Compute Engine, and the price of the infrastructure was locally of $5 million,” Norris said. “For the test that we ran at the Google Compute Engine, the price could be about $16.”

Comparable tests of MapR-based Hadoop clusters haven’t been performed on Amazon’s infrastructure, Norris said. In terms of AWS, companies use the S3 services for everything from Web logs and click on-through data to genomics data, and that they use Amazon Elastic MapReduce and MapR-based Hadoop for analytics.

“The cloud is likewise a terrific target for business continuity, so rather than having an entire second data center, you need to use run Hadoop clusters within the cloud, with mirroring synchronized between your on-premises and cloud-based targets,” Norris said.

Some analysts say clould-based services would be prohibitively expensive for long-term storage at high scale, making them most tasty for pilot tests, brief projects, and cases where the information already exists inside the cloud (as in relation to Google AdWords, Google Analytics, AWS S3, and DynamoDB). Norris took exception to that evaluation.

“i suspect we are going to see generations of cloud services, and [costs at scale] aren’t going to be as much of an element one day,” Norris said.

MapR distinguishes itself from Hadoop software distribution and support competitors Cloudera and Hortonworks by providing high-performance options not supported on standard Apache open source Hadoop software. MapR’s M5 distribution, as an instance, replaces the Hadoop Distributed File System (HDFS) with a spinoff of the Unix-based Network File System. M5 includes snapshotting, mirroring, and other high-availability features that are not currently supported at the current (1.0) Hadoop code line.

MapR describes the AWS and Google services according to its distributions as an endorsement of its architecture, but there are many options to run Cloudera and Hortonworks within the cloud. Hortonworks is the developer of the software used to run Hadoop on Microsoft’s Azure public cloud. And multiple providers run Hadoop services on AWS and other public clouds using Cloudera’s CDH Hadoop software distribution.

Responding to requests for touch upon MapR’s recent deals, Cloudera VP of product, Charles Zedlewski, said is an announcement, “Cloudera has led the industry in support for Apache Hadoop on public clouds, supporting Rackspace, AWS, and Softlayer dating back to 2009. Each month, tens of thousands of CDH instances are created on top of varied public cloud providers.”

Zedlewski also noted that Cloudera developed Apache Whirr, software now utilized by Cloudera and its competitors to run Hadoop distributions on public clouds.

The entire Hadoop movement was actually inspired by Google, which was a pioneer inside the use of MapReduce processing and published the white paper that guided the creators of Hadoop. Google still uses MapReduce processing extensively internally, but its software seriously is not distributed and its solution to MapReduce isn’t always made available as a service at the Google Compute Engine.

Pricing and repair details haven’t been finalized for MapR’s services at the Google Compute Engine. Basic compute pricing at the Compute Engine starts at $0.145 per hour for a single core with 3.75 gigabytes of memory. See our hands-on review of the Google Compute Engine private beta.

Big data places heavy demands on storage infrastructure. Within the new, all-digital Big Storage issue of InformationWeek Government, learn how federal agencies must adapt their architectures and policies to optimize all of it. Also, we explain why tape storage continues to outlive and thrive.

Source