(blog ‘lucindo)

um dia eu aprendo a programar

Indexando com CL

Inspirado pelo post do Gleicon que usa Ferret (Ruby e C) para indexar o kernel do linux, resolvi fazer o mesmo em CL.

Esse código usa Montezuma, que é uma tradução do Ferret para Common Lisp (é 100% CL):

(eval-when (:compile-toplevel :load-toplevel :execute)
  (require :montezuma))

(defpackage :montezuma-test
  (:use :cl)
  (:export #:add-dir-to-index #:search-index))

(in-package :montezuma-test)

;; maybe this isn’t a fast way to read a file
(defun slurp-file (filename)
  (with-open-file (stream filename :direction :input)
    (let ((seq (make-string (file-length stream))))
      (read-sequence seq stream)
      seq)))

(defparameter *index* (make-instance ‘montezuma:index
                                     :path “/tmp/montezuma-test”))

(defun add-dir-to-index (dir-name)
  (cl-fad:walk-directory
   dir-name
   #’(lambda (file)
       (ignore-errors
         (montezuma:add-document-to-index
          *index* `((“file” . ,(princ-to-string file))
                    (“content” . ,(slurp-file file))))))))

(defun search-index (keyword)
  (montezuma:search-each
   *index*
   (concatenate ’string “content:” keyword)
   #’(lambda (doc score)
       (format t “~a score: ~a~n”
               (montezuma:document-value
                (montezuma:get-document *index* doc)
                “file”)
               score))))


Pequeno teste (apenas com os .c e .h):

$ wget http://kernel.org/pub/linux/kernel/v2.6/linux-2.6.23.1.tar.bz2
$ tar jxf linux-2.6.23.1.tar.bz2
$ find linux-2.6.23.1 -type f -not \( -name "*.c" -o -name "*.h" \) -exec rm {} \;
$ find linux-2.6.23.1 -type f | wc -l
18438
$ du -hs linux-2.6.23.1
257M linux-2.6.23.1
$ sbcl --noinform --no-linedit
* (load (compile-file "montezuma-test.lisp"))
....
T
* (time (montezuma-test:add-dir-to-index "/home/lucindo/linux-2.6.23.1"))
Heap exhausted during garbage collection: 264 bytes available, 520 requested.
Gen StaPg UbSta LaSta LUbSt Boxed Unboxed LB LUB !move Alloc Waste Trig WP GCs Mem-age
0: 0 0 0 0 0 0 0 0 0 0 0 2000000 0 0 0.0000
1: 0 0 0 0 0 0 0 0 0 0 0 2000000 0 0 0.0000
2: 0 0 0 0 0 0 0 0 0 0 0 2000000 0 0 0.0000
3: 0 0 0 0 0 0 0 0 0 0 0 2000000 0 0 0.0000
4: 0 0 0 0 0 0 0 0 0 0 0 2000000 0 0 0.0000
5: 73866 73908 0 0 63308 1533 87 186 0 266268432 438512 232094792 0 3 0.8641
6: 0 0 0 0 5781 0 0 0 0 23678976 0 2000000 5628 0 0.0000
Total bytes allocated=536069368
fatal error encountered in SBCL pid 6239(tid 3085203120):
Heap exhausted, game over.
LDB monitor
ldb> quit

Máquina com 512M de RAM. SBCL não aguentou :(

Fonte: montezuma-test.lisp



 | Enviar por e-mail  | Hits para esta publicação: 609

Deixe uma resposta.