Finally started to play with bigdata and pentaho. On my specific case, Cloudera CDH3u4. At Mozilla we have a few clusters of over 80 machines that we're using to backup a bunch of services
It took me a while to get my head around the concepts of how kettle integrated with the mapreducer tasks. When I did, the first thing I noticed is how complex it is to know what's happening. Until Matt Casters and friend get the chance to implement PDI-9148, we need to do things manually - as in inspecting logs, etc.
My first approach was writing to text files. I tested direct output to hdfs, but for some reason didn't work. Using direct file system means that output will be spread through all the cluster nodes. This approach generally sucks.
I also thought about using some hand-made logic in a javascript step, but then looked at the WriteToLog step. This step generally works, but with a great flaw on it; it has no way to limit the output of it. If we have millions of rows, we'll have a huge log generated - and that's not good.
If it's not there, just do it yourself, the code is open. So I did. I added the ability of specifying a limit to the output of the step. This is very useful to inspect how the dataset is looking inside a map or reduce task. Once I deployed this change to my cluster, this is how my tasktracker log looks like (I ran this with a previous writeToLog version and ended up with a crashed browser and almost half a gigabyte of log files). This shows the first 5 lines of our dataset, with the key and value of our dataset:
I'll work with the kettle team in order to put this into the main code line, hopefully will be in 4.4.1 and 5.0. This is PDI-9195
Debugging mapreduce tasks
It took me a while to get my head around the concepts of how kettle integrated with the mapreducer tasks. When I did, the first thing I noticed is how complex it is to know what's happening. Until Matt Casters and friend get the chance to implement PDI-9148, we need to do things manually - as in inspecting logs, etc.
My first approach was writing to text files. I tested direct output to hdfs, but for some reason didn't work. Using direct file system means that output will be spread through all the cluster nodes. This approach generally sucks.
I also thought about using some hand-made logic in a javascript step, but then looked at the WriteToLog step. This step generally works, but with a great flaw on it; it has no way to limit the output of it. If we have millions of rows, we'll have a huge log generated - and that's not good.
An improved Write To Log step
If it's not there, just do it yourself, the code is open. So I did. I added the ability of specifying a limit to the output of the step. This is very useful to inspect how the dataset is looking inside a map or reduce task. Once I deployed this change to my cluster, this is how my tasktracker log looks like (I ran this with a previous writeToLog version and ended up with a crashed browser and almost half a gigabyte of log files). This shows the first 5 lines of our dataset, with the key and value of our dataset:
I'll work with the kettle team in order to put this into the main code line, hopefully will be in 4.4.1 and 5.0. This is PDI-9195