-
Notifications
You must be signed in to change notification settings - Fork 0
alamar/clojure-xml-stream
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
Hi *! I've tried a few searches on parsing XML files larger than memory, didn't find anything and wrote a simple framework for parsing XML via STAX to lazy sequence of defrecords. It is therefore capable of reading several GB of xml without much problems. It is quite declarative but also quite ugly. Take a peek: (technical babble after the fold) $ git clone git://github.com/alamar/clojure-xml-stream.git $ cd clojure-xml-stream $ ant It turns this completely-invented XML: <ground> <tree-species> <tree id="1"><name>Pine</name></tree> <tree id="2"><name>Birch</name></tree> <tree id="4"><name>Palmtree</name></tree> </tree-species> <forests> <forest id="1"> <name>Red Forest</name> <trees> <tree refid="1"><branch direction="left"/><branch direction="south"/></tree> <tree refid="2"><branch direction="right"/><branch direction="south"/><branch direction="west"/></tree> <tree refid="1"><branch direction="southwest"/></tree> </trees> </forest> <forest id="2"> <name>Dark Wood</name> <trees> <tree refid="2"><branch direction="right"/><branch direction="south"/><branch direction="west"/></tree> <tree refid="4"><branch direction="northwest"/></tree> </trees> </forest> </forests> </ground> into a lazy sequence of: #:example.TreeSpecies{:id 1, :name Pine} #:example.TreeSpecies{:id 2, :name Birch} #:example.TreeSpecies{:id 4, :name Palmtree} #:example.Forest{:id 1, :trees [#:example.Tree{:species-id 1, :branches (:left :south)} #:example.Tree{:species-id 2, :branches (:right :south :west)} #:example.Tree{:species-id 1, :branches (:southwest)}], :name Red Forest} #:example.Forest{:id 2, :trees [#:example.Tree{:species-id 2, :branches (:right :south :west)} #:example.Tree{:species-id 4, :branches (:northwest)}], :name Dark Wood} using this code: (defrecord TreeSpecies [id name]) (defrecord Forest [id trees name]) (defrecord Tree [species-id branches]) (defmulti ground-element class-element) (defmulti tree-element class-element) (defmethod ground-element :tree [stream-reader _] (TreeSpecies. (attribute-value stream-reader "id") nil)) (defmethod ground-element [:TreeSpecies :name] [stream-reader tree] (assoc tree :name (element-text stream-reader))) (defmethod ground-element :forest [stream-reader _] (Forest. (attribute-value stream-reader "id") [] nil)) (defmethod ground-element [:Forest :name] [stream-reader forest] (assoc forest :name (element-text stream-reader))) (defmethod ground-element [:Forest :tree] [stream-reader forest] (assoc forest :trees (conj (:trees forest) (Tree. (attribute-value stream-reader "refid") (dispatch-partial stream-reader tree-element))))) (defmethod tree-element :branch [stream-reader _] (keyword (attribute-value stream-reader "direction"))) (defmethod ground-element :default [stream-reader item] nil) (defmethod tree-element :default [stream-reader item] nil) (defn run [path] (with-open [input-stream (FileInputStream. path)] (let [objects (parse-dispatch input-stream ground-element)] (doseq [object objects] (println object))))) How it works: it reads elements and calls a method with the :ElementName If the method returns a record, it stuffs anything found in that element into this record. It can handle nested structures because it can parse subtrees (there is an example in code). The handler have to read events from stax (to get text nodes, for example), the only limitation is that handler should never iterate past END_ELEMENT of the element it was called on (or the parser would become confused). The syntax and the way I call assoc seem ugly to me, so all suggestions are welcome. Suggestions on naming and general architecture are welcome too. Maybe this can grow into something generally usable. Feel free to fork, use and complain.
About
Stream XML parsing API for clojure
Resources
Stars
Watchers
Forks
Releases
No releases published
Packages 0
No packages published