Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How Should We Implement Set and a first look at how potential API changes impact code. #84

Open
milroc opened this issue Jan 4, 2013 · 18 comments

Comments

@milroc
Copy link
Contributor

milroc commented Jan 4, 2013

Hey all,

I was trying to test out some stuff from #80, and rather than changing up everyone's work on the Matrix data structure, I thought I'd give Set a try. The reason this is not a PR is that there is still work to be done (namely testing and potential problems below).

problems with Set currently:

  1. Should you store an array of elements in the set?
    (rather than creating it from Object.keys() each time the values are necessary).
  2. Can a set contain any arbitrary shallow js object?
    (If not this should be redeveloped to work more efficiently with numbers only).
    My bias is for Set to work with JSON (potentially too large of an overhead)
    and objects developed within the library. Set will be primarily useful for
    working with numbers, but has a lot of strength as a basic data structure.
    (I use it all the time in python)
  3. Should this extend to deep objects? (arrays) (should deep checks occur).
    This becomes problematic when working with the methods that accept input
    from multiple types (basically the developer would have to wrap any arrays that they'd
    like to add to a set in another array, and the functions below would need to be changed).
  4. Should remove be part of the method chaining and not return true | false?
    My recommendation is to have it be part of method chaining and then have the static method return T/F.
    There's some code repetition but it helps keep the API following a similar structure.
    And if someone needs to determine if all were in the set then they could do:
    if (number.set.remove(A, [1,2,3]).reduce(function(a, b) { return !!a && !!b; }))
    A.remove([1,2,3]);
  5. Should Set be an extension of the Array object? If so we will likely come into insertion
    complications (a self balancing binary search tree or a trie, handled in a javascript array
    could cause a lot of array recreation). It might be possible to extend the Array object and
    contain similar logic laid out below (see 1). This is also useful because we could implement:
    map, reduce, splice, join, etc. on the set with minimal effort.
  6. Any //TODO seen in the commit linked below.
@milroc
Copy link
Contributor Author

milroc commented Jan 4, 2013

milroc@bdc926d

@ethanresnick
Copy link
Contributor

Thanks for putting together a real preliminary implementation—I think it'll help clarify a lot of the things we were discussing in the abstract in #80. I'll go through this later today or tomorrow and also catch up on the other issue.

@milroc
Copy link
Contributor Author

milroc commented Jan 5, 2013

Should be noted that not all of them are implemented. Refactoring for example. We also might consider collapsing all of the stuff we put into core into numbers itself. numbers.core.sum([0,1,2,3]) becomes numbers.sum([0,1,2,3])

@ethanresnick
Copy link
Contributor

Took a stab at implementing Set: https://github.com/ethanresnick/numeric.ly/blob/master/Set.js

Few thoughts:

  • I stored the members as values rather than keys on the Set so that we have the ability to have Sets as members, whereas keys can only be strings/integers. I don't mind that we now have to enforce uniqueness manually, but I'm not sure about the performance implications, which you were mentioning in bullet 5. Is there a faster way to do this? Also, I implemented Set as an extension of Array rather than as Object, in order to get the magic of the length property.

`````` ```

  • Right now arrays being added as members get converted to Sets and primitives get cast to numbers. But this doesn't happen recursively.

  • Can you elaborate on what you had in mind re Sets working with JSON

  • I changed add/remove to only accept individual members, whereas union and difference expect a Set/array. I think this addresses what you were saying in point 3, if I'm understanding you correctly. It also makes the role of the functions a little clearer.

  • I agree that remove should be part of the method chaining, and it's now implemented like that. As far as checking whether all the elements to be removed (now done through difference) were actually in the Set, we could do that in a static method I guess, but I want to think about it a little more (exhausted now).

@milroc
Copy link
Contributor Author

milroc commented Jan 7, 2013

Cool,

  • The performance implications I made in bullet 5 were based off of my theoretical understanding of implementing a set on top of an array. Meaning that when you either insert an element, you keep the array sorted (thus resulting in something like a binary search tree, keeping the insert complexity at a O(nlog(n)) and a search (and delete which is dependent on search) complexity at O(log(n))), while working with something like a hashmap (dictionary (object) with a hashing function), you would have a theoretical complexity of O(1) for insert search and delete. While these are the theoretically optimal solution, when you run basic performance comparisons you find (see jsperf below):
  • I like the change for add/remove for one only and union/difference. The point I made by 3. was more along the lines of developing a hashing function that can support any type of value (for example, hashing an array of 100 numbers would be: "Array(0,1,...,99)" and rehashing it would be complex string parsing, thus a "is it worth it to support what Python calls 'unhashable values'. If we don't support 'unhashable types' we might want to consider overloading add/remove to work like union and difference. That way it will make it easier for the novice mathematician to work with.
  • Sets working with JSON would be part of that whole hashing support. Since you're not actually hashing the values this makes it easier.

This is great that we have the basics of both types of implementations to look at and compare between. When we design this we need to keep three things in mind.

  1. What will be our developers use cases for Sets? Slightly faster insert/delete or primarily to compare a value to the set? The main use case, I imagine, will be comparison. While they will be initially working with data with sets relatively back and forth, and performance of comparing won't be an issue, it will become an issue when working with live code where a set will likely be used as a comparison tool.
  2. How will our hashing function effect the performance numbers seen in the tests?
    This should be constant amongst all implementations, though long strings ("Matrix([[0,1,...n]...m]")may have performance issues for dictionaries/objects that we do not know about (could someone, I might in a day or so if noone does, test this).
  3. What will choosing one of the implementations do to the API? (Will our developers all assume that any object we have in the API be written as an array, if so will there be confusion if we choose lookup performance over perhaps a more ideal API?) This I think other people should chime in on. Going with the API is my initial gut reaction, but there are things that are possible to do in an array that I'm not sure if we should let our developers do with a Set, which we could either override or remove support for.

Based off of the +/- to both I really don't know which one we should go with.

Notes on ECMAScript 6 Set support:
(1) http://www.nczonline.net/blog/2012/09/25/ecmascript-6-collections-part-1-sets/
(2) https://developer.mozilla.org/en-US/docs/JavaScript/Reference/Global_Objects/Set

@ethanresnick
Copy link
Contributor

Totally agree that we have to optimize for search/comparison, which using an Array doesn't do. At the same time, I don't think trying to hash objects is necessarily the way to go—it introduces too much extra complexity and even the best hashing function will require sacrifices (e.g. no way to make a Set of functions, because hashing the function will destroy the scope).

I wonder if instead the Set could have two internal data stores, an object for primitives and an array for objects. Something like:

/**
 * Returns whether arg is a string or a number (except -0), which are the only things that
 * can be preserved losslessly as object keys (Infinity, NaN, false, null, etc. can't be).
 */ 
function _isPrimitiveData(arg) {
  var type = typeof arg;
  return type === "string" || (type==="number" && 1/arg !== -Infinity);
}

function has(value) {
  if(_isPrimitiveData(value) ) {
    return this.__primitiveData.hasOwnProperty(value);
  }

  return numbers.util.inArray(value, this.__objectData) !== -1;
}

function add(value) {
  if(_isPrimitiveData(value)) {
    this.__primitiveData[value] = true;
  }
  else {
    this.__objectData[this.objectData.length] = value;
  }

  //put the value on the array for interoperability
  this[this.length] = value;

  return this;
}

function remove(value) {
  var internalIndex = -1, publicIndex;

  //remove the value from the appropriate internal collection
  //and set internalIndex if the value exists on the public array.
  if(_isPrimitiveData(value)) {
    if(this.has(value)) { 
      internalIndex = true; 
      delete this.__primitiveData[value];
    }
  }
  else if((internalIndex = numbers.util.inArray(value, this.__objectData)) !== -1) {
    this.__objectData.splice(internalIndex, 1);
  }

  //remove value from the public array if we found it earlier
  if(internalIndex !== -1) {
    if((publicIndex = numbers.util.inArray(member, this)) !== -1) {
      this.splice(publicIndex, 1);
    }
  }

  return this;
}

Thoughts?

Obviously some extra overhead in keeping the public array up to date, but that's heaviest in remove, which I think is fine given the added convenience.

@milroc
Copy link
Contributor Author

milroc commented Jan 7, 2013

> x = [0, 1, 2];
[ 0, 1, 2 ]
> y = [0, 1, 2];
[ 0, 1, 2 ]
> z = [2, 3, 4];
[ 2, 3, 4 ]
> a = x;
[ 0, 1, 2 ]
> x === y
false
> y === z
false
> x === a
true

Above shows that comparing two objects will only work when both objects point to the same point in memory, which with most primitive values it's fine, but Objects get tricky (and Objects based off of arrays get even trickier).

We would either need to do:

Run A.isEqual(B)

for any internal type we develop every time we check for a value (thus significantly increasing the this.has runtime, thus also this.add and this.remove).
an isEqual for Matrix would be:

//check if basic values (length etc) are equal then:
var i = 0, l = this.length, j = 0, m = this[0].length;
for (i = 0; i < l; i++) {
  for (j = 0; j < m; j++) {
    if (typeof this[i][j] === "number") {
      if (this[i][j] !== that[i][j]) return false; 
    } 
    else if (typeof this[i][j] === "complex") { //I know the compare is not right
      if (!this[i][j].isEqual(that[i][j])) return false;
    }
  }
}
return true;

A benefit of this technique would mean that we could let this work with every type of Object, though with some performance losses.

Create a hash

(I don't mean this in the exact theoretical way, toString() would be fine). Of all objects that come in (A.toString() will be part of the add function). This means that anytime you compare values you would only do: this.__primitiveData.hasOwnProperty(A.toString()); an example toString of a matrix of complex numbers would be:
"Matrix([[0 + 1i, 1 + 0i], [1 + 1i, 1 + 2i]])"

The reason toString is usually frowned upon is when you're dealing with non-structured data (e.g: JSON) it is hard to enforce. The benefit of creating Set in this library is that we have very structured types(currently, and I assume in the foreseeable future) where order matters, thus a strict toString method would work for the majority of our libraries "primitives"

The problem here is that when we convert to an array, the "unhash" function will be crazy stupid. Another possibility is that we could, instead of having the hash point to a true value, point to the value that is being added:
this._data[d.toString()] = d;
that way you create an Array by:

var ret = []; 
for (var d in Object.keys(this._data)) { 
  ret.push(this._data[d.toString()]); 
} 
return ret;

Another problem here is we would have to have something listening (or assume that the developer won't abuse the 'private' nature of this._data, my preference) to our this._data, such that when an item gets updated the key associated with it will be deleted and reupdated.

Another possibility

If we prefer to keep this an extension of the Array class, is to do all the this.has logic within the dictionary this._mappedData and when we push and pop, remove from the this._data area as well (problem is keeping track of where things are, but based off of the jsPerfs it really shouldn't matter). This also means toArray() would literally return this._data.
This is basically your recommendation above with the addition of working with our primitives and anything that a developer would like to extend themselves, if you have an order matters object where the toString will represent the data than it should work in the set.

@ethanresnick
Copy link
Contributor

Ok, ok, I think I understand the problem now. Thanks for walking me through it.

Just to paraphrase what you said to make sure we're on the same page:

We need a check more sophisticated than === to decide whether to an object member exists, which is a problem totally ignored by my above approach (which was just trying to move some lookups from O(n)/O(log(n)) to O(1) without resorting to any hashing).

So, to check object equality, we could run A.isEqual(B), but the problem is that'd we'd have to run that check against every existing object whenever we try to add a new member, which is much slower (on reasonably sized sets anyway) than just hashing the new potential member and checking whether an object with that hash already exists. So we should probably use some sort of hashing solution.

If I'm understanding you right, then I agree with all of that. In addition, I think:

  • Set should continue to extend/wrap Array. In other words, I should be able to build a Set and then pass it directly to d3 or jQuery without having to do any toArray. So every add/remove should do a this.push/this.splice.

  • When we hash objects as they're going in, we should still store the raw object in the dictionary, because many objects will be too hard, if not impossible, to recreate from their string representations, as we've both said.

Details of the hashing function

Here's a first draft, with comments on the issues I see:


function isPrimitive(arg) {
  var type = typeof arg;
  return arg === null || type !== 'object' && type !== 'function';
}

function hash(arg) {
  var type = typeof arg;

  if(isPrimitive(arg)) {
    //use typeof to differentiate between false, "false", 4, "4", etc.
    return type + arg;
  }
  else if(type==="function") {
    //this is pretty good but definitely not perfect 
    //(e.g. Array.prototype.x would get the same hash as Object.prototype.x)
    return type + arg.toString();
  }
  else if(typeof arg.uid === "function") {
    //if our data types are going to go directly on top of arrays, as should be the case
    //for most of them (e.g. Matrix), then we need to look for a method other than
    //toString (here, I'm calling it uid), because [].toString is not an overwriteable prop
    return arg.uid();
  }
  else if(arg.toString === Object.prototype.toString || arg.toString === Array.prototype.toString) {
    //if we're dealing with an object with a default toString, we have to use JSON.stringify instead
    //because Object.prototype.toString is worthless ({"bob":true}.toString() === "[object Object"])
    //and Array.prototype.toString is a little too lenient ([10].toString() === [[10]].toString === "10")
    return JSON.stringify(arg);
  }
  else {
    return arg.toString();
  }
}

Some tests:


> hash(NaN)
> "numberNaN"
> hash("NaN")
> "stringNaN"
> hash(null)
> "objectnull"
> hash("null")
> "stringnull"
> hash(-0)
> "number0"
> hash(0)
> "number0" //same as above. Who cares.
> 
> hash({"bob": true});
> "{"bob":true}"
> hash({"jim": true});
> "{"jim":true}"

//assuming Set defines a uid() that does reordering

> hash(Set([4,5,6]))
> "Set(4, 5, 6)"
> hash(Set([6,5,4]))
> "Set(4, 5, 6)"
> 
> ```
> ```

One last note, if it's a function, we should require `===` equality, because two functions with the same source but that were defined in different scopes may not behave the same. (Also fixes the imperfect hash issue above.)

function has(arg) {
var hash = hash(arg), type = typeof arg;

return Object.prototype.hasOwnProperty.call(this._data, hash) && (type !== "function" || this.data[hash]===arg);
}

@milroc
Copy link
Contributor Author

milroc commented Jan 8, 2013

Sorry if I wasn't getting my point across properly will work on being more concise and talking about the first principles of the problem.

I like that we've agreed on the need to hash, but some things to note:

1.

We should consider a function to extend the set to only work quickly in addition/deletion.

var S = numbers.createSet().fast();

this creates a set that will store everything in the object and will not extend an array. While this is not ideal from an API standpoint, it will be a nice alternative for working on cases where an internal array will not be necessary.

2.

When we hash objects as they're going in, we should still store the raw object in the dictionary, because many objects will be too hard, if not impossible, to recreate from their string representations, as we've both said.

This is not necessary, since the values are going to be in the array we are extending. Also the only time obtaining the value is useful is when we are converting the Set to another type (say object).

The time complexity of your current implementation is, which is limited given the fact we are extending Array:

add:

O(1) since unsorted

this._arr.push(d); 
this._dict[hash(d)] = d;
remove:

O(n) worst case since the array is unsorted

if (this.has(d)) {
  var h = hash(d);
  delete this._dict[h];
  var i, idx, l = this.length;
  for (i = 0; i < l; i ++) {
    if (hash(this._arr[i]) === h) {
      idx = i;
      break;
    }
  }
  this._arr.splice(idx, 1);
}
has:

O(1) (really fast)

return this._dict.hasOwnProperty(hash(d));
toArray:

O(1) (it's fast, since the array is already in memory)

return this._arr;

Based off of the above, you can see that having the object store the values too, has no benefit, if this was written by extending an object, then it would have a benefit. It might be best if they just store references to true instead.

If you are not happy with the performance of the remove function, we could keep the internal array sorted. This would only effect the add and remove basic functions (see below for complexity changes). The biggest problem here would be the type of sort function we would use (which is also a big problem for Set.uid()).

  • add: O(log(n)) average case
  • remove: O(log(n)) average case

3.

JSON is an unordered set of values, so each javascript engine will sort values differently and when we call JSON.stringify(arg) in different browsers it will have different values for the set. All this means is that we will need to ensure that the Set creation and insertions all occur where they are meant to occur. Which, I believe, is already handled thanks to your implementation.

4.

I'm confused as to what is meant by "function." Functions, such as function() { return 2; } should be handled in the manner you recommended. However if we end up creating a "Function" object that will represent types of functions (linear, quadratic, etc) with, then we will need to treat these independent of the scope.

@ethanresnick
Copy link
Contributor

I'm open to also having a Set that extends Object, which would look very similar to our current one. API for creating it might be something like: numbers.createSet(arr, true). And it'd have a forEach method to do easy iterating without exposing the hashes.

Having add be as fast as possible (unsorted) at the cost of more searching in remove doesn't necessarily seem like a bad tradeoff to me, as it's plausible that users will do much more adding than removing. But I'm not sure. Also, that shows one benefit to storing the object in the dictionary: it would make remove somewhat faster because it wouldn't have to hash every value. But I'm not sure if it has other (memory?) costs.

As for sorting in Set.uid, can we simply do this.sort()?

@milroc
Copy link
Contributor Author

milroc commented Jan 9, 2013

remove is limited by the fact that we are storing an array of the objects, if you push the logic of deleting into remove then you would essentially run into a complexity limit in toArray (which is roughly equivalent to my first pass at the code, albeit without extending array).

this.sort() is viable, I'd just to write a more advanced sort function this.sort(this.sortFunc) that will account for certain things. Though I might be overengineering at this point.

@ethanresnick
Copy link
Contributor

I don't understand what you mean by "if you push the logic of deleting into remove then you would essentially run into a complexity limit in toArray". Isn't the logic for deleting is already in remove, and won't toArray will always be O(1)?

@milroc
Copy link
Contributor Author

milroc commented Jan 9, 2013

function toArray() {
  var ret = [];
  for (var key in this._obj) {
    ret.push(this._obj[key]);
  }
}

If you add elements to an array you must remove them at the same time. Searching (see above remove function in an earlier comment) is a necessity. If you wish to remove that as a necessity, don't store any values in an array and when an array is needed, convert from an object to an Array. Not sure if this is possible if we are extending the array class.

@ethanresnick
Copy link
Contributor

Hey,

Just getting back to this; needed to take a break for a few days.

I see what you're saying about toArray, but obviously that only applies if the basic object doesn't act as an array, which it should. For the fast version, like we were talking about earlier though, that code is relevant.

@ethanresnick
Copy link
Contributor

Also, there's an issue that's been nagging at me throughout this conversation (see code):

var x = [1,2];
var y = [1,2];
var s = numbers.createSet();

s.add(x); //all good

x.push(3); //but now we modify x, and things goes wrong

s.add(x); //x (a duplicate) is added again because its new hash doesn't match the old one
s.add(y); //y is rejected even though it's not a duplicate of x anymore

Am I right in thinking that's how the code would work, or am I missing something?

If this is what would happen, you could say that the x.push step should be invalid, but we're talking about javascript here, so things are going to be mutable, and it seems like we should embrace that.

One solution, then, might be to require a re-hash, i.e.:

x.push(3);
s.rehash(x); //will search for x in the array and recalculate its hash

s.add(x); //rejected as expected
s.add(y); //accepted as expected

But I really don't like the API of that; too inconvenient, and error-prone. So I'd be tempted to give up a fair amount of performance (not the strong suit of this library anyway) for convenience. So has might look like:

function has(arg) {
  var hash = hash(arg), type = typeof arg;

  for(var i=0, len=this.length; i<len; i++) {
    if(hash(this[i])===hash) { 
      return (type !== "function" || this[i]===arg);
    }
  }
  return false;
}

I think this'll be faster than the isEqual solution outlined earlier—the arg being tested is only hashed once, and we could cache each member's hash in its object (assuming the user opts into only doing modifications through the API) to get the complexity of subsequent hash()es down to O(1), i.e.:

AnyType.uid = function() {
  if(this._uid && this.useCache) { return this._uid; }
  else {
    //build uid and then;
    this._uid = uid;
  }
}

AnyType.modify = function() { 
  //....
  this._uid = null;
}

Further, we could go back to an _isPrimitiveData like solution from the above comments, so that members whose hash can't change are still checked in O(1).

But, of course, has will still end up significantly slower.

Any better ideas?

@milroc
Copy link
Contributor Author

milroc commented Jan 15, 2013

So Ethan,

If you're checking if something exists or is contained in the set, then the change in state of x is not a bad thing. It should result in False if x has been changed. This should not be a concern to the Set object. Set is not meant to contain references of where data is in memory, it should merely just contain values.

var x = 3;
var s = numbers.createSet();
s.add(x); // [3]
s.add(x); // [3]
x = 4;
s.has(x); // false yay, we know that x has been modified
s.add(x); // [3, 4]

This is fine, if not ideal behavior.

The problem is if the data in the array is not congruent to the hash due to referencing issues:

var x = [3];
var s = numbers.createSet();
s.add(x); // s = { _obj: {'[3]': true}, _data: [[3]] }
x = [4]; // s = { _obj: {'[3]': true}, _data: [[4]] }

If this is the case, we need to find a way to ensure that the expected { _obj: {'[3]': true}, _data: [[3]] } behavior occurs when x is modified (perhaps a copy on add).

If a developer wants to remove x from the set if they modify it, then:

function modify(x, s, newX) {
  if (s.has(x)) {
    s.remove(x);
    s.add(newX);
  }
  x = newX;
}

Can you rephrase the benefits from AnyType.uid I think it is covered by the ideas above but I am not sure.

@ethanresnick
Copy link
Contributor

Yes, obviously this only applies to references (which is what I was alluding to with the mention of the _isPrimitiveData check).

Copy on add is an option. It's a little unexpected maybe (has to be clearly documented), but it's probably the best option.

@sjkaliski
Copy link
Member

Hey all, sorry I've been MIA for a bit, busy with the holidays and now work. Finally free to get on this for a bit. I'll be reading this over and getting back shortly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants