Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

take/takeWhile iterators for Counttables #13428

Closed
wants to merge 4 commits into from
Closed

take/takeWhile iterators for Counttables #13428

wants to merge 4 commits into from

Conversation

ghost
Copy link

@ghost ghost commented Feb 18, 2020

I was trying to write an implementation for word frequency but didn't come up with an elegant solution to print the 10 most frequent words. See also discussions in the Nim forum Idiomatic sequence functions. I thought it would be best to extend the tables library itself as in there we have direct access to the internal data structure of CountTables.

The word frequency code could be more succinctly written as:

import tables, strutils, sequtils, httpclient, sugar

var client = newHttpClient()
var text = client.getContent("https://www.gutenberg.org/files/135/135-0.txt")
 
var wordFrequencies = text.toLowerAscii.splitWhitespace.toCountTable
wordFrequencies.sort
for (word, count) in wordFrequencies.take(10):
  echo alignLeft($count, 8), word
#[
echo "--------------------"
for (word, count) in wordFrequencies.takeWhile((key, value) => value > 1000):
  echo alignLeft($count, 8), word
echo "--------------------"
for (word, count) in wordFrequencies.takeWhile((key, value) => key != "will"):
  echo alignLeft($count, 8), word
]#

The essential part being wordFrequencies.take(10). The resulting binary would be more efficient, too.

I think the PR is not complete. There should be the same functions for CountTableRefs and maybe other table types. Although I think take/takeWhile make most sense for CountTables. Maybe also take and takeWhile procs that return sequences had to be added. I am not sure about that though. Just let me know what you think of it and if need be how I should adapt the PR.

@andreaferretti
Copy link
Collaborator

Not that take and friends for tables aren't useful, but I think this is the same as

count = 0
for (word, count) in wordFrequencies:
  echo alignLeft($count, 8), word
  inc(count)
  if count == 10:
    break

It does not seem to need access the internal data structures of tables

@andreaferretti
Copy link
Collaborator

What would actually need access to the internal structure would be sort + take together. There is no need to sort the whole table just to figure out the top 10 entries. Python Counter has .most_frequent(k), which I guess is what we would need here

@disruptek
Copy link
Contributor

Seems pretty confusing to name this take when we have take procs for other tables.

If you really want to champion this PR, you will need to add a changelog entry, a {.since.} annotation, and tests -- which is probably where you'll discover the reason for the proc's omission from CountTable.

@disruptek
Copy link
Contributor

I think you have to start with the test, but consider that you can just sort the CountTable and then take largest or smallest ten times, assuming a take that matches that of other tables works...

@timotheecour
Copy link
Member

timotheecour commented Mar 7, 2020

take needs to be generic (work with any iterator), or not exist at all, otherwise it's just cruft. There's no reason to have a take that works with CountTable but not other types eg OrderedTable, TableRef, CountTableRef + countless other types in other modules, and adding all those procs is a bad idea for obvious reasons.

this works generically:

import macros
# `macro takeN(x: ForLoopStmt, n: int): untyped =` is currently not allowed but could be
macro takeN(x: ForLoopStmt): untyped =
  var body = x[^1]
  let inputs = x[^2][1]
  doAssert inputs.len == 2
  let elems = inputs[0]
  let maxIter = inputs[1]
  let count = genSym(nskVar, "count")
  expectKind x, nnkForStmt
  body.add quote do:
    `count`.dec
    if `count` == 0: break
  var newFor = newTree(nnkForStmt)
  for i in 0..x.len-3: newFor.add x[i]
  newFor.add elems
  newFor.add body
  result = quote do:
    var `count` = `maxIter`
    if `count` > 0: `newFor`
  echo result.repr

when isMainModule:
  {.push experimental: "forLoopMacros".}
  import tables, sequtils
  proc main =
    var t = toTable({1:1, 2:2, 3:3, 4:4})
    echo t
    for b in takeN (pairs(t), 1+1): echo b
    for a,b in takeN (pairs(t), 1+1): echo (k: a, v:b)
    # echo toSeq(takeN (pairs(t), 2)) # limitation: we can't compose
  main()

allowing for b in takeN (pairs(myTable), 3): echo b
but doesn't "compose" (see toSeq(takeN (pairs(t), 2)) comment); for that we'll need #11992

@ghost ghost closed this Mar 13, 2020
This pull request was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants