Skip to content

Commit

Permalink
simpler way of adding a list of terms
Browse files Browse the repository at this point in the history
fixes #25
  • Loading branch information
jtauber committed Feb 7, 2024
1 parent 38fce72 commit c67b783
Show file tree
Hide file tree
Showing 3 changed files with 31 additions and 0 deletions.
17 changes: 17 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -98,6 +98,23 @@ And here is an example with a two-level hierarchy:

Note that if the `count` is `1` you can omit it.

Entire lists of tokens can be added for a particular address in one go using `add(address, term_list)`:

```python
>>> import termdoc
>>> c = termdoc.HTDM()
>>> c.add("1.1", ["foo", "bar", "bar", "baz"])
>>> c.add("1.2", ["foo", "foo"])
>>> c.get_counts()["bar"]
2
>>> c.get_counts()["foo"]
3
>>> c.get_counts("1.2")["foo"]
2

```


You can **prune** a HTDM to just `n` levels with the method `prune(n)`.

You can iterate over the document-term counts at the leaves of the HTDM with the method `leaf_entries()` (this returns a generator yielding `(document_address, term, count)` tuples). This is effectively a traditional TDM (the document IDs will still reflect the hierarchy but the aggregate counts aren't present).
Expand Down
4 changes: 4 additions & 0 deletions termdoc/htdm.py
Original file line number Diff line number Diff line change
Expand Up @@ -42,6 +42,10 @@ def increment_count(self, address, term, count=1):
address = self.address_sep.join(address.split(self.address_sep)[:-1])
first = False

def add(self, address, term_list):
for term in term_list:
self.increment_count(address, term)

def load(self, filename, field_sep="\t", address_sep=None, prefix=None):
address_sep = address_sep or self.address_sep
with open(filename) as f:
Expand Down
10 changes: 10 additions & 0 deletions tests.py
Original file line number Diff line number Diff line change
Expand Up @@ -336,6 +336,16 @@ def test_two_arg_increment_count(self):
self.assertEqual(c.get_counts()["foo"], 3)
self.assertEqual(c.get_counts()["bar"], 3)

def test_add(self):
import termdoc

c = termdoc.HTDM()
c.add("1", ["foo", "bar", "bar", "baz"])
c.add("2", ["foo", "foo", "bar"])
self.assertEqual(c.get_counts()["foo"], 3)
self.assertEqual(c.get_counts("2")["foo"], 2)
self.assertEqual(c.get_counts("1")["bar"], 2)


if __name__ == "__main__":
unittest.main()

0 comments on commit c67b783

Please sign in to comment.