k-most frequent lines in a log file

I was looking for small problems to solve to get some programming practice and gsathya suggested that I try solving this.

I immediately started wondering about a good way to solve this. Something I recently read about UNIX tools came to mind. I conjectured that I could compose a solution using some arcane permutation of uniq, sort and head to achieve this. It turned out that I was right: this excellent answer has all the details of the UNIX approach of solving this problem.

I thought this may be an educational problem to solve by writing a (non-shell) program to do the same thing.

As soon as I see the word “frequency”, I start thinking whether using a hash-table is a viable option.

A hash-table like { line : freq_of_occurence , ... } can be used to contain all the strings of the log file mapped to their frequency.

for line in file:
    if line exists in hash_table:
        increment its frequency
    else
        set the its frequency to 1

finally return the hash_table

The above pseudo-Python describes what we need to do.

Before we go any further, I’d like to mention that we are using the Core library. This is different from OCaml’s standard library. This page has some installation instructions.

You should be able to do the following in utop:

utop[85]> #require "core";;
utop[86]> open Core.Std;;

Please refer to Real World OCaml if you run into any trouble, and surely let me know if any part of the tutorial is not as helpful as it could be.

Let’s get started coding!

We need a way of getting the lines of the log file. The function In_channel.read_lines takes a filepath as an argument and returns a list of strings.

The idiomatic way of iterating over collections in OCaml is by using its iter function. We can use this to iterate over the list of strings returned by In_channel.read_lines.

Let’s inspect the type of List.iter in the REPL.

utop[24]> List.iter ;;
- : 'a list -> f:('a -> unit) -> unit = <fun>
(*   ^-- (1)     ^---- (2)        ^--- (3) *)

a list containing elements of type 'a. In other words, a list containing elements of any type.
a function that takes an argument of type 'a
unit is represented as () in code and is like void.
-> means “returns”.
f: is just the label of the second argument.

The above type signature means that List.iter is a function that takes two arguments: arg 1 is a list containing elements of type 'a and arg 2 is a function that takes an argument of type 'a returning unit. The unit on the extreme right is the return type of List.iter.

Let us iteratively figure out how to iterate over a list.

utop[30]> List.iter ;;
- : 'a list -> f:('a -> unit) -> unit = <fun>

We know what that type signature means. Let’s see what happens when we give it a list as an argument. This function normally expects a List and a function as an argument, but we’re only giving it one.

Supplying a function that takes n arguments with a different number of arguments is usually an error in other languages. Let’s see how OCaml behaves:

utop[29]> List.iter [1;2;3;4;5] ;;
- : f:(int -> unit) -> unit = <fun>
(*   ^----(1)           ^-- (2) *)

a function from int to unit.
The final return value.

Currying is what is happening here.

Let’s write a simple function that from int to unit that we could possibly use as the second argument.

utop[34]> let print_integer x = printf "%d\n" x;;
val print_integer : int -> unit = <fun>

utop[36]> print_integer 1337;;
1337
- : unit = ()

We can use this function to print out all the elements of the list. We use ~f: as the name of the label.

utop[38]> List.iter [1;2;3;4;5] ~f:print_integer ;;
1
2
3
4
5
- : unit = ()

It turns out that for one-off use, it’s more convenient to use an anonymous function. Anonymous functions (or lambdas) are defined using fun.

utop[40]> (fun x -> x * x);;
- : int -> int = <fun>
(* type signature of a function from int to int *)
utop[41]> (fun x -> x * x) 16;;
- : int = 256
(* anonymous function that squares its argument is passed 16 *)

Therefore, this:

utop[39]> List.iter [1;2;3;4;5] ~f:(fun x -> printf "%d\n" x) ;;

should work just as nicely.

We know how to generate a list containing the lines of the log-file, and how to iterate over a list. We need to figure out what to do with each line of the list. That’s what should be in the body of the function argument to List.iter.

In the skeleton below, we create a Hashtbl, fill it up and then return it.

let generate_frequency_table file_path =
  let frequency_table = Hashtbl.Poly.create () in
  List.iter 
    ~f:(fun line ->
            (* code to handle each line *)
       )
    (In_channel.read_lines file_path);
  frequency_table (* this hash-table is returned by the function *)
;;

Let’s digress and discuss how hash-tables work in a language like Python.

In [1]: table = {1 : "one", 2 : "two", 3 : "three" }

In [2]: table[1]
Out[2]: 'one'

In [5]: table.get(1)
Out[5]: 'one'

In [4]: table.get(4)
# this returns None, because 4 isn't a key in the hash-table

So, for a hash-table {KeyType : ValueType}, get(KeyType) returns the KeyType’s associated ValueType or None if it doesn’t exist.

You just can’t have a function in OCaml that returns type A in some cases, and another type B in other cases. How OCaml gets around this is by using Option types. For an enlightening discussion of user defined types please read this.

The option type in OCaml is predefined like this:

type 'a option = Some of 'a | None

option is a polymorphic type, which means that 'a could be any type, just like how the list type can hold elements of type 'a. This means that a list can contain elements of any type, as long as they are all of the same type.

We can think of the type int option as: “it is either an int or None”.

We concern ourself with understanding option types because a function we use, Hashtbl.find returns an option type, as we can see below.

utop[42]> Hashtbl.find ;;
- : ('a, 'b) Hashtbl.t -> 'a -> 'b option = <fun>

(* - : ('KeyType, ValueType) Hashtbl.t -> 'KeyType -> 'ValueType option = fun *)

Hashtbl.find takes a Hashtbl.t and a 'KeyType and returns a 'ValueType option.

utop[75]> let table = Hashtbl.Poly.create () ;;
val table : ('_a, '_b) Hashtbl.t = <abstr>

utop[82]> Hashtbl.add table ~key:"two" ~data:2 ;;
- : [ `Duplicate | `Ok ] = `Ok

utop[83]> Hashtbl.add table ~key:"one" ~data:1 ;;
- : [ `Duplicate | `Ok ] = `Ok

utop[84]> Hashtbl.find table "two" ;;
- : int option = Some 2

Now that we understand the basics of Hashtbl, please read this so that the following code is trivial to understand.

let current_count = 
   match Hashtbl.find frequency_table line with
     | None -> 0
     | Some freq -> freq
   in Hashtbl.replace frequency_table ~key:line ~data:(current_count + 1)

Whew! After a lot of digressions, we finally arrive at the full function. I hope that its decomposition was easy to understand.

let generate_frequency_table file_path =
  let frequency_table = Hashtbl.Poly.create () in
  List.iter 
    ~f:(fun line ->
      let current_count = 
        match Hashtbl.find frequency_table line with
          | None -> 0
          | Some freq -> freq
      in Hashtbl.replace frequency_table ~key:line ~data:(current_count + 1))
    (In_channel.read_lines file_path);
  frequency_table
;;

Okay. Now we’ve got a hash-table containing the lines and the frequencies. We need to find a way of getting the most frequent lines from this map. Turns out that maps aren’t that great for k-th maximum calculations.

We have two choices before us:

We can load all the (Line, Freq) tuples into an array, sort it in descending order and then take the first k elements. Sorting is O(n log n), but it does more work than necessary, as we only need k most-frequent elements.
We can use heaps.

Heaps are better with the assumption that k is much smaller than the number of lines n in the log file. This is because the maximum (or minimum, if we are using a min-heap) element is always guaranteed to be at the top of the queue. Once this maximum is removed, the heap is readjusted so that the next maximum element becomes the top most element. Readjusting a heap k times might be better than having to sort a possibly very large vector of (Line, Freq) tuples.

Normally tuple comparison happens lexicographically, i.e. like a dictionary.

Lexicographically speaking: "be" > "abracadabra", because we start by comparing characters in corresponding places. We compare the corresponding characters in the next position, only if there is a tie.

We need a function that compares the second (integral) element of two 2-tuples.

let tuple_greater_than a b =
  let a_snd = Tuple.T2.get2 a in
  let b_snd = Tuple.T2.get2 b in
  a_snd - b_snd
;;

You might be wondering why we don’t do a_snd > b_snd instead of finding the difference and returning a numeric value. This is because the function passed as an argument to order a heap has the type 'a -> 'a -> int.

utop[62]> Hash_heap.Heap.create ;;
- : ?min_size:int -> cmp:('a -> 'a -> int) -> unit -> 'a Heap.Removable.t = <fun>
(*                      ^--- This is the comparison function *)

The function labeled with cmp takes two elements of a (polymorphic) heap and returns an integer. As we are storing 2-tuples in the heap, tuple_greater_than is defined so that it can operate on that type.

A naïve implementation that I first came up with loaded all the (Line, Freq) tuples into a max-heap, and then removed the top k elements. This approach is bound to spend up a lot of time in readjusting a large heap; a O(log n) operation, where n is the number of unique lines. Surely not optimal.

While discussing this problem, gsathya whipped up a quick implementation in Python, the distinctive feature of which is that he uses a min-heap that only holds k elements. We can restrict the size of the min-heap to k elements, and then only insert elements into it if the current element is greater than the minimum.

This min-heap of size k will eventually hold the k largest elements, i.e. at the end of the operation it will hold the kth largest element as the top most element and all the other elements greater it.

Let’s see how we generate this k sized min-heap. The iter functions defined for the various collection types, iterate over all the elements. We need to find a way to take k elements from the frequency_table and then insert them into the min-heap. Let’s discuss the code we use to achieve this:

  let k_heap = Hash_heap.Heap.create ~cmp:tuple_greater_than () in (* 1 *)
  Hashtbl.keys frequency_table                                     (* 2 *)
  |> Sequence.of_list                                              (* 3 *)
  |> (fun seq -> Sequence.take seq k)                              (* 4 *)
  |> Sequence.iter                                                 (* 5 *)
      ~f:(fun line ->
        let freq = Hashtbl.find_exn frequency_table line in
        Hash_heap.Heap.add k_heap (line, freq);                    (* 6 *)
        Hashtbl.remove frequency_table line);

Creates a Heap.
Generates a list of keys
Converts the list to a Sequence. Using sequences should hopefully be better than passing massive lists from one function to another.
Ideally, we’d be able to write |> Sequence.take k, but because that function hasn’t been defined with labeled arguments, we have to use a trivial lambda to explicitly specify the order.
We iter over every line in this k sized sequence.
We know that Hashtbl.find returns an option, but as we are sure that the associated key exists, we use Hashtbl.find_exn to find and “unwrap” the option value returned. Hashtbl.find_exn raises an exception if the key is not found, but we don’t have to worry about that here. We add the (Line, Freq) tuple to the heap, and finally remove the line from the hash-table.

Instead of using pipes, we could rewrite the above chunk of code like this:

(Sequence.iter
  (Sequence.take
    (Sequence.of_list
      (Hashtbl.keys frequency_table))
    k)
  ~f:(fun line -> (* ... *)))

This looks Lispier, but I prefer the version with pipes as that is more idiomatic OCaml. In the code shown above we have to understand what’s happening from the inside out, i.e. from the innermost function call. Using the pipe operator we can comprehend the code more naturally from the outside in, actually following the sequence of the flow of data.

We have to iterate over the remaining lines in the hash-table and if we find a line that has a frequency greater than frequency of the top element in the min-heap, we remove the top element, and add a new tuple.

We end up with a min-heap containing the k-most frequent lines. However, we want to print the lines in descending order, so we create a new max-heap from which we can remove the elements one by one. Notice that we use tuple_less_than instead of tuple_greater_than.

let reversed_heap = Hash_heap.Heap.create ~cmp:tuple_less_than ()
in Hash_heap.Heap.iter k_heap
~f:(fun tuple -> Hash_heap.Heap.add reversed_heap tuple);

The full code for generate_heap is given below:

let generate_heap frequency_table k =
  let k_heap = Hash_heap.Heap.create ~cmp:tuple_greater_than () in

  Hashtbl.keys frequency_table
  |> (fun seq -> Sequence.take seq k)
  |> Sequence.iter
      ~f:(fun line ->
        let freq = Hashtbl.find_exn frequency_table line in
        Hash_heap.Heap.add k_heap (line, freq);
        Hashtbl.remove frequency_table line);

  Hashtbl.iter frequency_table
    ~f:(fun ~key:line ~data:freq ->
      if freq > Tuple2.get2 (Option.value_exn (Hash_heap.Heap.top k_heap))
      then Hash_heap.Heap.add k_heap (line, freq)
      else ());

  let reversed_heap = Hash_heap.Heap.create ~cmp:tuple_less_than ()
  in Hash_heap.Heap.iter k_heap
  ~f:(fun tuple -> Hash_heap.Heap.add reversed_heap tuple);

  reversed_heap
;;

The last bit of code to discuss, is the code that deals with removing the top element from the max heap.

for _i = 1 to num_pops do
  let top = Option.value_exn (Hash_heap.Heap.pop heap) in
  printf "%d : %s\n" (Tuple2.get2 top) (Tuple2.get1 top);
done

We remove the top element num_pops times and because we are sure that the element exists we can Option.value_exn to unwrap the return value of Hash_heap.Heap.pop.

That’s about it, really. You should be able to understand the the complete implementation of the problem here. There’s also an earlier C++ version which makes heavy use of the standard libary.

Thanks for your patience in reading this rather long walkthrough. I hope you were able to gain something from it.

k-most frequent lines in a log file

- Samuel Chase