Sunday, May 8, 2011

Installing Flume in the cluster - A complete step by step tutorial

Flume Cluster Setup :




In the previous post, you used Flume on a single machine with the default configuration settings. With the default settings, nodes automatically search for a Master on localhost on a standard port. In order for the Flume nodes to find the Master in a fully distributed setup, you must specify site-specific static configuration settings.

Before we start:-
Before we start configure flume, you need to have a running Hadoop cluster, which will be the centralize storage for flume. Please refer to Installing Hadoop in the cluster - A complete step-by-step tutorial post before continuing.

Installation steps:-

Perform following steps on Master Machine.
1. Download flume-0.9.1.tar.gz from  https://github.com/cloudera/flume/downloads   and extract to some path in your computer. Now I am calling Flume installation root as $FLUME_INSTALL_DIR. 

2. Edit the file /etc/hosts on the master machine (Also in agent and collector machines) and add the following lines.

192.168.41.67 flume-master
192.168.41.53 flume-collector
hadoop-namenode-machine-IP hadoop-namenode

3. Open the file $FLUME_INSTALL_DIR/conf/flume-site.xml and Edit the following properties.

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>flume.master.servers</name>
<value>flume-master</value>
</property>
<property>
<name>flume.collector.event.host</name>
<value>flume-collector</value>
<description>This is the host name of the default "remote" collector.
</description>
</property>
<property>
<name>flume.collector.port</name>
<value>35853</value>
<description>This default tcp port that the collector listens to in order to receive events it is collecting.
</description>
</property>
</configuration>

4. Repeat step 1 to 3 on collector and agents machines.
Note: - The Agent Flume nodes are co-located on machines with the service that is producing logs.

Start flume processes:-

1. Start Flume master:- The Master can be manually started by executing the following command on Master Machine.
        1.1 $Flume_INSTALL_DIR/bin/flume master
1.2 After the Master is started, you can access it by pointing a web browser to http://flume-master:35871/. This web page displays the status of all Flume nodes that have contacted the Master, and shows each node’s currently assigned configuration. When you start this up without Flume nodes running, the status and configuration tables will be empty.

2. Start Flume collector:- The Collector can be manually started by executing the following command on Collector Machine.
         2.1 $Flume_INSTALL_DIR/bin/flume node –n flume-collector

2.2 To check whether a Flume node (collector) is up, point your browser to the Flume Node status page athttp://flume-collector:35862/. Each node displays its own data on a single table that includes diagnostics and metrics data about the node, its data flows, and the system metrics about the machine it is running on. If you have multiple instances of the flume node program running on a machine, it will automatically increment the port number and attempt to bind to the next port (35863, 35864, etc) and log the eventually selected port.

2.3 If the node is up, you should also refresh the Master’s status page (http://flume-master:35871) to make sure that the node has contacted the Master. You brought up one node whose name is flume-collector, so you should have one node listed in the Master’s node status table.

3. Start Flume agent:- The Agent can be manually started by executing the following command on Agent Machine (agent Flume nodes are co-located on machines with the service that is producing logs.)
      3.1 $Flume_INSTALL_DIR/bin/flume node –n flume-agent

      3.2 Perform step 2.3 again.

Note: - Similarly you can start other Flume agent by executing following commands:-
Start second agent:- $Flume_INSTALL_DIR/bin/flume node –n flume-agent1
Start third agent:- $Flume_INSTALL_DIR/bin/flume node –n flume-agent2

Configuring Flume nodes via master:-

1. Configuration of Flume Collector: - On the Master’s web page click on the config link. Enter the following values into the "Configure a node" form, and then click Submit.
Node name: flume-collector
Source: collectorSource(35853)
Sink: collectorSink("hdfs://hadoop-namenode:9000/user/flume /logs/%Hoo ","%{host}-")
Note: - The collector writes to an HDFS cluster (assuming the HDFS namenode machine is called hadoop-namenode)

2. Configuration of Flume Agent:- On the Master’s web page, click on the config link. Enter the following values into the "Configure a node" form, and then click Submit.
Node name: flume-agent
Source: tail(“path/to/logfile”)
Ex:- tail("/home/$USER/logAnalytics/dot.log")
Sink: agentSink("flume-collector",35853)

Note: - Use same configuration for each Flume Agent.

Friday, May 6, 2011

Installing Flume in the pseudo mode - A complete step by step tutorial




Flume is a distributed, reliable, and available service for efficiently moving large amounts of data soon after the data is produced.

The primary use case for Flume is as a logging system that gathers a set of log files on every machine in a cluster and aggregates them to a centralized persistent store such as the Hadoop Distributed File System (HDFS).

Installation in pseudo-distributed mode:-


In pseudo-distributed mode, several processes of flume are run on single machine.There are two kinds of processes in the system:
1. Flume Master: - The Flume Master is the central management point, controls the Flume node data flows and monitors Flume nodes.
2. Flume Node: - The Flume nodes are divided into two categories:-
2.1 Flume Agent: - The agent Flume nodes are co-located on machines with the service that is producing logs.
2.2 Flume collector: - The collector listens for data from multiple agents, aggregates logs, and then eventually write the data to HDFS. 

Fig: - Flume processes and there configuration.

Before we start:-

Before we start configure flume, you need to have a running Hadoop cluster, which will be the centralize storage for flume. Please refere to Installing Hadoop in the cluster - A complete step by step tutorial post before continuing.


Installation steps:-

1. Download flume-0.9.1.tar.gz from https://github.com/cloudera/flume/downloads and extract to some path in your computer. Now I am calling Fllume installation root as $Flume_INSTALL_DIR.



2. The Master can be manually started by executing the following command:


2.1 $Flume_INSTALL_DIR/bin/flume master


2.2 After the Master is started, you can access it by pointing a web browser to http://localhost:35871/.This web page displays the status of all Flume nodes that have contacted the Master, and shows each node’s currently assigned configuration. When you start this up without Flume nodes running, the status and configuration tables will be empty.


3. The flume collector can be manually started by executing the following command in another terminal.


3.1 $Flume_INSTALL_DIR/bin/flume node –n flume-collector


3.2 To check whether a Flume node is up, point your browser to the Flume Node status page athttp://localhost:35862/. Each node displays its own data on a single table that includes diagnostics and metrics data about the node, its data flows, and the system metrics about the machine it is running on. If you have multiple instances of the flume node program running on a machine, it will automatically increment the port number and attempt to bind to the next port (35863, 35864, etc) and log the eventually selected port.


3.3 If the node is up, you should also refresh the Master’s status page (http:// localhost: 35871) to make sure that the node has contacted the Master. You brought up one node whose name is flume-collector, so you should have one node listed in the Master’s node status table.


4. Configuring a collector via master:-

4.1 On the Master’s web page click on the config link. Enter the following values into the "Configure a node" form, and then click Submit.

Node name:flume-collector

Source: collectorSource(35853)

Sink:collectorSink("hdfs://hadoop-namenode:9000/user/flume /logs/%Hoo ","%{host}-")

Note: - The collector writes to an HDFS cluster (assuming the HDFS nameNode is called namenode).


5. The flume node can be manually started by executing the following command in another terminal.

5.1 $Flume_INSTALL_DIR/bin/flume node –n flume-agent

5.2 Perform step 3.2 and 3.3 again.


6. Configuring an agent via master:-

6.1 On the Master’s web page, click on the config link. Enter the following values into the "Configure a node" form, and then click Submit.

Node name:flume-agent

Source: tail(“path/to/logfile”)
Ex:-tail("/home/impetus/logAnalytics/dot.log")

Sink: agentSink("localhost",35853)


7. To check whether data is stored into hdfs or not, you can check it by pointing browser to http://localhost:50070/.