The Fifth Elephant Blog — The Fifth Elephant

The Fifth Elephant is an annual conference on big data. This is the blog

Big Data and Decision Making - A free report 31 July 2012

The Economist Intelligence Unit and Capgemini conducted a survey of executives across the globe and supplemented the data from that survey with a series of interviews of senior business executives and experts on decision making and data.

The results are available to all in a report titled The deciding factor: Big Data and decision-making.

Some of the findings of the report include:

Firms that emphasise decision-making based on data and analytics have performed 5 to 6% better—as measured by output and performance—than firms that rely on intuition and experience for decision-making.
There is a growing appetite among organisations for data and data-driven decisions, despite their struggles with the enormous volumes being generated.
43% of respondents agree that using social media to make decisions is increasingly important.

Visit the EIU's website to download the full report.

Day 2 at The Fifth Elephant 31 July 2012

We had a quieter start at the registration desk because most people had already signed in on the first day and got their badges.

The talks on day 2 began with "Your Genome in the Cloud", a keynote by Ramesh Hariharan from Strand Life Sciences and was followed by a keynote on "Building Watson" by Karthik V from IBM.

Other talks covered things like the Aadhar project, Messaging Architecture in Facebook, and Exponetial Growth models which was received very well.

There were also speakers that spoke about big data in medical imaging, mobile analytics, recommendation engines, and data in retail banking and financial markets among others.

We also had a more active open house on day 2 with small groups forming to discuss MediaWiki efforts for Indic language support, OpenStreetMap enthusiasts gathering to talk about mapping and graphing data in general as well as a couple of demos that couldn't be run during regular talks for lack of time.

We ended the day with short introductions to up-coming HasGeek events Cartonama, JSFoo and Droidcon India.

Neo4j - The Graph Database 28 July 2012

Neo4j is a Graph Database that makes dealing with complex data much easier by providing an intuitive data model (the property graph) and providing a high performance, fully ACID compliant database that can traverse billions of nodes and relationships at orders of magnitude higher speeds than relational databases.

Also, graphs are great not only for data storage but for very cool visualizations, see e.g. http://mbostock.github.com/d3/ex/force.html and others.

Screenshot from http://maxdemarzi.com/category/visualization/, built on Neo4j

If you want to know more, feel free to join the Neo4j India meetup group or drop in on the forums. Also, there will be a conference in San Francisco dedicated to graphs in November, http://graphconnect.com.

May the source be with you!

This is a sponsored blogpost from Neo4j

Day 1 at The Fifth Elephant 28 July 2012

We've had a great first day at the Fifth Elephant with some hiccups here and there. There were a few internet connectivity issues to start with which we managed to fix later in the day.

Kiran was also trying out his new AV rig in Auditorium 1 that not only allowed him to edit video live but to also upload it to YouTube the same day! You can find these videos on the HasGeek YouTube channel.

We also tried to livestream the videos from Auditorium 1 at http://www.ustream.tv/channel/hasgeek. With the internet issues sorted out now, you should be able to follow along on the channel today too.

We've recorded video from the other two auditoriums too, and will be editing and uploading those soon after the conference ends.

In other news, the official Twitter hashtag for the conference, #the5el was trending along with an unofficial one #5el and it got so popular that spammers began to piggy-back on it!

Videos from day 1 28 July 2012

We’re experimenting with a new real-time video editing setup to improve the turnaround time for releasing videos from events. An event like The Fifth Elephant with three parallel tracks can take up to a month to process all videos. Our typical process involves cleaning up audio and watching each video to insert the speaker’s slides at the appropriate position, then exporting this for upload – a process which can take up to five hours per hour of video.

At The Fifth Elephant, we deployed a new video editing setup that allows us to process video in real-time. We put together the configuration with some trial and error over the past few weeks and ran the first proper yesterday at the event. We are pleased to say it works. The videos are rough in parts, but the audio clarity is excellent, and here is the best part: the videos are now online, available for you to watch and share.

Watch all videos from the main auditorium here →

Building a Big Data platform the Red Hat way 25 July 2012

Designing a scalable big data platform is one of the key decisions organizations will face in the near future. The platform they choose should enable them to deal with the scale and growth of data that has never been seen before. Big data is not just about running map reduce applications. There are several other factors that enterprises needs to consider, making this one of the most important decisions they will make in this decade.

As big data is crossing the chasm and is entering into mainstream enterprises. Red Hat has a suite of products for big data platform that allows you to address the full spectrum of big data business challenges. One can easily observe that big data deployments are dominated by Linux and the dominant Linux underneath the big data deployments is Red Hat Enterprise Linux. Red Hat Enterprise Virtualization is leading the way with its high performance and para-virtualized low I/O overhead as a good fit for I/O intensive big data workloads. Red Hat Enterprise Linux along with Red Hat Enterprise Virtualization makes for a compelling foundation in an organization's big data environment.

Adding to this mix, the Red Hat Storage helps enterprises to get the scalability in storage that they need to handle big data problems. Red Hat Storage reduces the distance between data silos and serves as the general purpose data store. Enterprises can integrate map reduce and other workloads that exploit data locality directly onto the Red Hat Storage clusters. Existing map reduce jobs can run seamlessly on these clusters without any modification.

On the compute side, Red Hat Grid, the leader in distributed computing, brings big compute to big data. As large enterprises have multiple Hadoop clusters, islands of data are getting created and it reduces the returns from the IT infrastructure. There is a need to consolidate them in to super cluster or federated clusters which allows independent pools to use each others' resources. Enterprises need a common interface layer for submission, monitoring and reporting of map reduce, and other jobs. Running Hadoop on Red Hat Grid provides this powerful capability. The name node and the data node that are part of the Hadoop instance can themselves be encapsulated as jobs. When run in this fashion all the policies, lifecycle functionality, scalability and migration capability that the grid provides are available to map reduce jobs running in the grid. Intelligence is built into the engine that can match the jobs to appropriate resources in appropriate places. It does a fair sharing of the resources subject to quotas, limits and priorities. It can dispatch jobs and data to resources, handle errors / failures and report results. It can store the information not just for analytics, but for other common needs such as metering and capacity planing.

Red Hat also has cloud offerings which complement the big data use case - openshift provides an open platform for building and deploying modern cloud applications. Infinispan and hibernate are good examples of middleware technology which are quite relevant in this space. Red Hat JBoss Data Grid which is a perfect fit in the in-memory data grid for real time big data market, would enable companies to scale their applications without adding to their relational database sprawl.

In a nutshell, Red Hat has an array of solution offerings which are connected to the big data movement. This combined with the fact that big data is an open source dominated landscape makes Red Hat the defacto choice for big data platform.

Know more about Red Hat portfolio of products for the big data platform at http://www.redhat.com

This is a sponsored blogpost from Red Hat

Videos of the Data Hacknight 25 July 2012

As part of The Fifth Elephant, an overnight Data Hacknight was organised in Pune and Bangalore and open to attendees of the Fifth Elephant.

Check out some of the projects at the Hacknight website for Pune and Bangalore or check out the videos on the HasGeek YouTube channel.

Simplify and Unify Storage deployments 18 July 2012

In this digital universe where data is growing at a fast pace, the infrastructure to store, manage and retrieve data is of paramount importance. Just about everyone in this universe is generating data at a pace never observed before - ranging from a simple purchase in a nearby retail store to storing a high-definition video in the cloud. Data has grown tenfold in five years and the growth curve is only getting steeper. The amount of data today if stacked in DVDs will reach halfway to Mars.

Of this growth, the unstructured data is growing atleast five times faster than the structured data which are stored in relational databases. Today, much of this data is simply accumulated and not analyzed to drive business optimization or new business opportunities. A study by McKinsey Global Institute projects 60% potential increase in retailers' operating margins possible with big data. So there is a need to effeciently store data and effectively analyze it to help take better business decisions. That said, 5% growth in IT spending has not kept pace with the 40% growth of data. Having different storage silos for different storage needs is not just inefficient, it is a nightmare to manage. This situation calls for a radically new approach to how storage deployments are managed in an enterprise.

Red Hat Storage can be the answer to simplifying and unifying storage deployments in a modern enterprise. GlusterFS, a key building block of Red Hat Storage Server, is an open source, POSIX-compatible distributed file system capable of scaling to several petabytes (actually, 72 brontobytes!) and handling thousands of clients. GlusterFS is based on a stackable user space design and can deliver exceptional performance for diverse workloads. Red Hat Storage solution clusters together storage building blocks from multiple commodity systems over Infiniband RDMA or TCP/IP interconnect, aggregating disk and memory resources and managing data in a single global namespace.

With the prevalence of cloud, object based storage solutions are gaining more foothold in enterprise IT. Red Hat Storage unifies the simplicity of NAS storage with the power of object storage technology. It provides a system for data storage that enables users to access the same data, both as an object and as a file, thus simplifying management and controlling storage costs, enabling enterprises to adopt and deploy cloud storage solutions. Enterprises can use this technology to accelerate the process of preparing file-based applications for the cloud and simplify new application development for cloud computing environments.

Red Hat Storage Server also provides compatibility for Apache Hadoop and it uses the standard file system APIs available in Hadoop to provide a new storage option for Hadoop deployments. As an open source, distributed storage solution designed to work on huge numbers of heterogeneous, commodity devices — Gluster is a perfect complement to the Hadoop ecosystem. Existing MapReduce based applications can use GlusterFS seamlessly without any need for rewrites. This functionality not only opens up data within Hadoop deployments to any file-based or object-based application, but also eliminates the centralized namenode from big data deployments driving efficiency and high availability.

By unifying the storage environments into a single scale-out pool, Red Hat Storage delivers unmatched value to enterprises. In comparison to traditional storage solutions which need to be maintained in different silos for different needs, Red Hat Storage is a new approach to solving the storage problems entirely in software. Whether deployed on-premise, or in a private, public, or hybrid cloud environment, Red Hat Storage makes it easier to access information, brings freedom of choice, and brings the power of community-driven innovation to your enterprise. For more information, please visit redhat.com/products/storage-server.

This is a sponsored blogpost from Red Hat.

Data Hacknights today 14 July 2012

As part of the larger Fifth Elephant conference, there will be a Data Hacknight in Pune and Bangalore starting at 2pm on the 14th of July and running until 10am on the 15th.

The data hacknight is open to enthusiasts, geeks, designers, mathematicians and statisticians. It is an occasion to work on a data project that you have always wanted to or pick a proposed project. Either way, there'll be a lot of other enthusiasts to team up with or learn new stuff.

Your hack can be looking for patterns, making analytical models, a cool visualization that provides new insight, learning a new tool, or maybe if you want to aim high, build your own cluster overnight!

The hacknight is free with a ticket to The Fifth Elephant. If not, a cover charge of Rs 500 applies. HasGeek will keep you well fed and caffeinated during the 16 hours of hacking.

Check out the projects at Pune and Bangalore.

Data is the next Intel Inside 21 June 2012

The past few years have seen an explosion in the amount of data generated. Social network interactions, scientific experiments like the Large Hadron Collider and genome sequencing, government data, data from sensors, and consumer and sales data from retail and e-commerce companies are some of the large sources of data available today. The dual problem of this “big data” is making sense of it and using it to make decisions.

Along with the data explosion, there has also been a rapid development of tools and techniques used to work with big data and and the rise of the field of “data science.” Both of these bring a raft of changes to the way businesses operate, including a shift to more open, flexible cloud-based systems and commoditized data management.

For example a recent report in the New York Times mentions that retailers like Walmart constantly analyse sales, demographic, and even weather data to tailor product selections at stores and to determine pricing markdowns. Shipping companies like UPS mine their traffic data and delivery times to improve routing and online match-making services constantly sift through personal data to improve their algorithms.

This predictive power of data is immense, and is used in various fields ranging from economic forecasting to public health. An interesting example of the latter is Google Flu Trends where search data from Google can predict an outbreak of flu even before the hospitals or health services report it.

Big Data does come with a few caveats; the old aphorism of “lies, damned lies and statistics” still holds true, but despite that, big data is here to stay. As Hal Varian, Chief Economist at Google says, “the ability to take data – to be able to understand it, to process it, to extract value from it, to visualize it, to communicate it – that’s going to be a hugely important skill in the next decades.”

Welcome to The Fifth Elephant! 21 June 2012

Welcome to the shiny new website for The Fifth Elephant! In case you’re wondering, The Fifth Elephant is a new, annual conference on Big Data from HasGeek.

With this website and blog, we’ll keep you updated not just on events in the run-up to the conference and details of the venues and places to stay in Bangalore, but also about interesting things happening in the world of data, analytics and visualisation. We’ll be posting speaker bios, videos, a few words from sponsors and partners, and a lot of interesting news and views about the field of big data.

We’d also love to hear from you, about your experiences working with big data and analytics, or if you’d like to tell us what a fantastic job we’re doing with The Fifth Elephant (or if we’re doing something terrible too!). If following a blog is not your thing, you can keep in touch with the Fifth Elephant on Twitter or you can post at our page on Facebook. We will be posting updates there too, so you won’t miss anything.

It’s also not too late to propose a talk for the conference or to vote on the ones already proposed. We have over 50 talks already proposed covering all kinds of topics. Head over to the Fifth Elephant funnel to vote or add your own. Tickets for the conference are also on sale, so make sure to purchase yours before the price goes up in the final weeks!

One last thing, why “The Fifth Elephant?” Well, for that answer, you’ll have to wait until the day of the conference! :)