I have written the code to generate the word and its corresponding frequency of occurrence for task1-input1.txt excluding the stop words in the stopwords.txt
public class TopKCommonWords {
public static class TokenizerMapper
extends Mapper
}
This are my arguments.
I understand that by changing
FileInputFormat.addInputPath(job, new Path(args[0]));
from 0 to 1, I can get the words and its frequency of occurrence in task1-input2.txt.
For example in my output of occurrences:
task1-input1: task1-input2:
coffee 3 coffee 2
happy 10 good 3
good 6 sweet 5
How can i compare these 2 output and only return the common and the ones with the least frequency?
The expected result should be:
If you wanted to sum words from all files, you don't need to combine output files, instead, you can use addInputPath multiple times to read multiple files using MultipleInputs class
Alternatively, you should be able to pass input folder as an argument to read all files within it.
If you want to find the word with minimum count per file, you'll need a second reducer
You already have output location as a variable
Path output1 = new Path(args[3];
FileOutputFormat.setOutputPath(job, output1));
So create another job that reads that location
But you might be able to use only one job if you use a Combiner to do the word count, and using the filename as your key