Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Time series support in Base? #3524

Closed
johnmyleswhite opened this issue Jun 24, 2013 · 42 comments · Fixed by #7285 or #7654
Closed

Time series support in Base? #3524

johnmyleswhite opened this issue Jun 24, 2013 · 42 comments · Fixed by #7285 or #7654
Labels
needs decision A decision on this change is needed

Comments

@johnmyleswhite
Copy link
Member

Is it possible to merge Calendar into Base, possibly after some bikeshedding? We had so much discussion about time series support last year, then Calendar became the de facto time series tool without ever entering Base.

@quinnj
Copy link
Member

quinnj commented Jun 24, 2013

+1

If it's ironed out enough, I think time series support is a very natural expectation of Base material.

@StefanKarpinski
Copy link
Member

I really wish that this didn't entail depending on ICU, but we probably should.

@staticfloat
Copy link
Member

Please ping me when this lands, as I should probably update build recipes and debian package requirements to include libicu.

@johnmyleswhite
Copy link
Member Author

How much functionality can we retain without ICU?

@quinnj
Copy link
Member

quinnj commented Jun 24, 2013

I've started digging into it a little bit, and @nolta can speak to it more, but currently Calendar is very dependent on ICU. Excluding ICU would basically require rewriting everything, but I'm also inclined to think we could get the majority of the functionality in a very Julian way without ICU and also without it being too difficult to implement (ICU has pretty good documentation of everything). Would definitely be a fun project if we decided to go that way.

@nolta
Copy link
Member

nolta commented Jun 25, 2013

@karbarcca It depends on what you mean by "majority of the functionality". Implementing zulu time in pure julia would be easy. Adding timezone support, however, would be quite difficult. As i see it, proper timezone handling is Calendar's raison d'etre.

@aviks
Copy link
Member

aviks commented Jun 25, 2013

@karbarcca agree with @nolta. Zulu time is implemented in pure Julia in https://github.com/aviks/SimpleDate.jl (the api is different from Calendar, but that would be a trivial change)

Adding timezone support is the missing piece. I had some julia code to parse the olson database, but never got to implementing the conversions. On top of that, one needs leap second support.

Another large piece of functionality that ICU provides is date formatting/parsing.

All of which is certainly doable, but is a large chunk of work.

@ViralBShah
Copy link
Member

ICU is big, but it is easy to get on Linux and is included in OS X. I believe it also works on Windows. Even if we can reimplement much of what we need in Windows, it will be quite a chore to support. I am in favour of using ICU and bringing Calendar into Base. Over time, we could reduce our dependence on ICU, if circumstances demand.

@quinnj
Copy link
Member

quinnj commented Jun 25, 2013

My vote would be to start out with a simpler featureset in pure Julia (a la SimpleDate.jl) with a clean API that can be expanded over time and exclude the ICU dependence. Starting out with a more SQL-like support with Date, Time, DateTime types, lubridate-style arithmetic, duration, period, and interval support, and IO support with parsing/formatting is a solid foundation that provides a lot of basic functionality. As for leap seconds, we could take ICU's approach and ignore them :) (basically leave it to the operating system to figure out). Timezone support is definitely a bigger chunk of work to do manually, but there are simple ways to include basic functionality (as @aviks mentioned) and marking it as a future feature for full support I think is reasonable.

@aviks
Copy link
Member

aviks commented Jun 25, 2013

@karbarcca happy to give you commit access to SimpleDate if you want to run with that codebase.

@johnmyleswhite
Copy link
Member Author

In general, I like @karbarcca's gung ho attitude. My major worry is that we can't start using times without leap seconds and then introduce them later unless we tell people that the date time support is a draft at best.

@quinnj
Copy link
Member

quinnj commented Jun 25, 2013

Thanks @aviks, I've enjoyed going over your code today and I'm taking a stab at adding features while trying to follow Calendar.jl's conceptual framework to get a working julia Calendar/Timezone implementation.

The thing with leapseconds is there really isn't a lot of consensus on how to handle them at all. Most languages/applications (including ICU) have taken the approach that they're not going to worry about it and let it be an OS problem. Implementation can be tricky, but we can take a stab at it if we deem it important enough. I found blog post here with a good walkthrough of considerations, and also came across a slightly hacky way that Google deals with leapseconds. The problem with a lot of "solutions" (hacks mainly) is that they're all pretty much clever ways to trick servers into dealing with an extra second every once in a while and not really a formal API or anything. One thing I thought of was doing a simple cache of year seconds/milliseconds and basing our date parsing off of them. That would allow us to easily manually add seconds as needed, calculate accurate durations/intervals, and also has the added benefit of giving us much faster date parsing.

Anyway, I'll plug away a little more and see if I can't push something for review.

@simonbyrne
Copy link
Contributor

I suspect that ignoring leap seconds might be best since:

  1. everyone else does it, making compatibility difficult if julia doesn't
  2. if people do care about leap seconds, they would probably just use TAI (or have their own special method for dealing with it all)

I think the easiest and least confusing solution would be to define a TAI timezone (or a seperate TAI datetime type), and define a method for converting between that and UTC by adding/subtracting the appropriate number of seconds: that way, if you want the length of an interval for between t2 and t1 (stored in UTC), you could do something like

t2 - t1 # duration ignoring leap seconds
TAItime(t2) - TAItime(t1) # duration accounting for leap seconds

@quinnj
Copy link
Member

quinnj commented Jul 5, 2013

Ok, I just created new repo with a bunch of stuff I've been working on the last 2 weeks. Basically it ended up being a much larger beast than I anticipated, but that most of you were probably aware of. :) I think it's a really good start though for Date, Periods, TimeZone, and DateTime support in pure Julia.
Here's the repo: https://github.com/karbarcca/Datetime.jl

The main influences for the code are as follows:

  • @aviks SimpleDate.jl package
  • @nolta Calendar.jl package
  • the R lubridate package
  • Java's Joda-time (which is considered a gold standard across languages even and will soon be merged into core Java)

High-level framework/concepts:

  • a Calendar abstract type to represent a certain calendar's way of date calculations (== chronologies in Joda-time)
  • a TimeZone abstract type which is subtyped by all identified zones in the Olson tz database (see timezone.jl)
  • Period bits types that represent certain relative/absolute durations of time, including Year, Month, Day, Hour, Minute, and Second
  • a Date immutable with fields for year, month, and day and is parameterized by a certain Calendar (the ISOCalendar is used by default). This is our "low-precision" type that doesn't have to worry about timezones and leap seconds and is easier to reason about in terms of period arithmetic. Similar to a partial datetime or LocalDate in Joda, or a simple Date type in many DBMS. This is very fast to work with doing ranges and arithmetic.
  • a DateRange1 type that can be used to create frequences given start/period/stop inputs. I think a pure DateRange type could represent Intervals in lubridate/Joda where just a start/stop are given (used with arithmetic, not generating frequencies at all)
  • a 64-bit DateTime bits type that is parameterized by a certain Calendar as well as a TimeZone. It's value represents the number of Rata Die seconds (seconds since 0000-01-01T00:00:00), similar in concept to a Rata Die day number (see code for more date algorithm comments). Since Unix time's epoch is 62135596860 Rata Die seconds, it's trivial to convert Unix timestamps to DateTime types. Using the current default ISOCalendar though, leap seconds are included (try @assert datetime(1972,7,1,0,0,0) - datetime(1972,6,30,23,59,59) == 2, so wrapping time() in datetime() will give you a DateTime 25 seconds or so in the future. (We could correct this with the now() function though to give the true current time that someone sees on the clock).

Potentially useful additions:

  • I mentioned a DateRange type; having a DateTimeRange type would map Intervals in lubridate/Joda; this also open up sub-Second possibilities (having a start DateTime instant and noting the attoseconds from that instant); this stems from a lengthy discussion I found in the forums last year
  • I think a Time immutable as Date's analogue (field-based with Hour, Minute, Second) would be potentially useful as well
  • I would really love to support Temporal Expressions a la runt in Ruby: link

This is definitely a first draft, and I'm positive there are holes to be plugged and lots of refinement needed. I would really appreciate any questions, critiques, discussion to push this forward.

Sources:
SimpleDate.jl - https://github.com/aviks/SimpleDate.jl/blob/master/src/SimpleDate.jl
Calendar.jl - https://github.com/nolta/Calendar.jl/blob/master/src/Calendar.jl
SO - http://stackoverflow.com/questions/2532729/daylight-saving-time-and-timezone-best-practices
W3 - http://www.w3.org/TR/timezone/#floating
Joda - http://joda-time.sourceforge.net/userguide.html
Date Algorithms - http://mysite.verizon.net/aesir_research/date/date0.htm
Google Group Discussion - https://groups.google.com/forum/#!searchin/julia-dev/parametric$20bitstype/julia-dev/YlUS7899gro/iLtq8WnATrkJ
Insane convo pull request - #698 (comment)

@aviks
Copy link
Member

aviks commented Jul 6, 2013

Thanks @karbarcca , this is great. A couple of quick comments while I have more of a play with this

  • The loading of timezone.csv seems dependent on the cwd of the Julia process. May this should be developed as a package intially, and loads made relative to Pkg.dir()
  • I assume you have a script to generate the timezone.csv when the Olson db is updated?
  • This currently seems about two orders of magnitude slower than Calendar.jl for simple data arithmetic.
  • What is the precision of the DateTime . I personally think that its is fine if very high precision time usage needs specialised libraries, but others may disagree.

@quinnj
Copy link
Member

quinnj commented Jul 6, 2013

@aviks

  • I've actually pushed a fix for loading timezone.csv, so let me know if you're still seeing a problem there. I'm actually unsatisfied with how the timezone data is handled in general and this was really just a "get it working first" kind of solution. I'd love to brainstorm some more ideas of how to be efficient here.
  • Yeah, there's a script to generate the dataset
  • All of the DateTime stuff is definitely slower right now, particularly if you're using non-UTC timezones. Unacceptably slow really. If you're using the lower precision Date type though, it should be faster from my initial benchmarks.
  • The DateTime precision is 64-bits with time measured in seconds, so the max date we can show is 1585318-12-06T15:29:42 UTC. This seems well beyond anything anyone should need for timestamps. I agree that for greater precision, an add-on package could provide even more functionality. The discussion here is what spurred my comment about having a possible TimeDelta{p} 64-bit type that could represent a second^10*-p for higher precision intervals. From the discussion, it seems this was following NumPy's approach, but simpler because we could provide one parameterized type as opposed to NumPy's 26 or so new types they introduced for each power of 10. I guess I just am not familiar with actual use-cases enough to know how to best implement something like this or how best to work with this kind of type. Maybe someone who has had experience or is familiar with sub-second timing needs can comment a little more if this is something Julia should provide out of the box or is something better left to a package. I know @StefanKarpinski was involved a lot in that conversation.

@quinnj
Copy link
Member

quinnj commented Jul 9, 2013

Ok, I added Datetime to METADATA and a README to the package repo, so hopefully more can try it out, kick tires, give it a whirl. I added a perf.jl file that runs common operations 1:1000000 times and returns the results. I also included a calendarperf.jl file that was my baseline for comparing with the Calendar.jl package.

Overall, I'm really pleased with how far the performance has come. @timholy's profiler was a great help (and hopefully on windows soon?). Performance-wise, Datetime is either on-par or faster on almost every benchmark.

The remaining performance issues are when timezones are specified. I'd say it's at an acceptable/working level (compared to the first draft), but still 2x-4x slower than Calendar.jl. Right now, the timezone data is serialized in matrices for each timezone in the tzdata/ folder and loaded as the timezone is called for (I like this approach because it manages memory better than slurping the entire db into memory to hold while the user works with timezones). The problem is that with certain timezones, its matrix is 100+ rows and currently, a simple linear search is used to do the lookup. I'm positive some kind of binary/trie/radix/indexed implementation would really help, but I'm actually not very experience with some of these advanced data structures/algorithms to try it out. I'd appreciate anyone's input here.

Anyway, it's been a ton of fun working on this stuff and I've really enjoyed how much I've learned about bitstypes and type parameters through the process; it's definitely expanded my understanding on how Julia works and the potential there really is thru the type system.

Feedback welcome!

@ViralBShah
Copy link
Member

This is really quite amazing. I am a bit taken for the moment, but will certainly jump into this in the next few days.

@quinnj
Copy link
Member

quinnj commented Jul 15, 2013

Vacations are always nice to mull things over.

I've thought a lot about the Datetime stuff and particularly about timezone/leap second support and how to do it in a way that's both efficient and maintainable. Here's what I'd like to propose:

  • Including something along the lines of the new Datetime2 package; it includes a simple Date and Zulu/UTC-based DateTime implementations (plus support for Period types, DateRange, DateTimeRange, etc.). The code is fast, efficient (~400 lines of code), and provides a lot of date/time functionality without having to deal with any timezone/leap second business. The other main factor here is that this code isn't likely to change (other than normal optimizations, tweaks, etc.).
  • The creation of a DateExtensions package (that could possibly live in the JuliaLang organization). This package would include timezone and leap second support similar to what I've pushed in the original Datetime package mentioned earlier in this thread. It would be fully compatible with Base.Datetime code and really just extend it for the additional functionality. The main driving force for splitting this functionality into a separate package that a user would add thru the package manager is maintainability. Leap seconds are announced no more than 6 months before one occurs (either the end of June or end of December). This would fly while Julia is pre-1.0, but imagine if companies are eventually anchoring to v1.0 or v2.0 and new leap seconds are added while a new Julia version is more than 6 months away. It's the same issue with timezone information. The Olson tz database is updated regularly, much more frequently than Julia language releases. Splitting timezone data/functionality into it's own package allows the package to maintain a similar release cadence with Julia, but also include its own "updates" to its release branches when they happen, without the fear of breaking user's code.

Feel free to check out the new Datetime2 package (it's really fast!), and I'd love to hear everybody's thoughts on the proposal.

@StefanKarpinski
Copy link
Member

I really appreciate how much thought you've put into this. I think that @nolta is the other person here with the most expertise in date and time stuff, and @aviks has also done a lot of work on this stuff, so I defer to your collective judgements, but that sounds like a sound plan to me.

@johnmyleswhite
Copy link
Member Author

I've started using this and think it would be great to have in Base. Getting this finalized will still take some work, but this is very close to the kind of design I'd like to see (as a person without any detailed expertise in time representations).

@ViralBShah
Copy link
Member

Also cc: @milktrader

@nolta
Copy link
Member

nolta commented Jul 17, 2013

Neat stuff! I have some reservations about the API, however, in particular the way periods are handled.

Eager down-conversion, e.g., years(1) + days(1) == days(366) feels like a mistake. It's an approximation, and it makes period arithmetic non-associative:

julia> using Datetime2

julia> d = date(2013,7,17)
2013-07-17

julia> (d + years(4)) + days(1)
2017-07-18

julia> d + (years(4) + days(1))
2017-07-17

I also don't think splitting the package in two is a great idea.

@quinnj
Copy link
Member

quinnj commented Jul 17, 2013

@nolta, can you elaborate more on why you think splitting the package in two is a bad idea?

I agree that at first glance, it seems unintuitive and a little weird, but I think the advantages I mentioned in having always-up-to-date timezone/leap second information is a major win. w.r.t timezones, every other major datetime package (Joda, Noda, etc.) ships with a static repo of the timezone data and details a long, complicated download-reformat-recompliation process for manually updating. And for leap seconds, I would argue that we shouldn't support leap seconds in Base under any circumstance. The fact that a new leap second can occur within 6 months would quickly render a static release useless (imagine running a server logging timestamps, expecting leap second support). We'd put ourselves in the same camp as Joda/Noda detailing a manual update process that would surely turn off users. With the package system stabilizing, I think it provides an excellent--and simple--way to provide updates of timezone/leap second data.

As for the period arithmetic, I agree that there's a possible gotcha, but there's also not a clear solution without losing some expected behavior or have inconsistencies (e.g. not allow years/months, but allow days). I see a few options:

  1. Keep it as is, with default conversions (365 days in a year, etc.) and be upfront/explicit about it and possible gotchas associated with leap years and year + date arithmetic. Note that any period + month operation already results in an error.
  2. Get rid of inter-period conversions. Then your first example above would calculate the same, but the 2nd, with years(4) + days(1) in parenthesis, would result in an error.
  3. Provide a warning/message any time a conversion happens; I initially did this and included months, but it was pretty annoying and probably not a good way to go.
  4. Do the same as 2, but also provide a CompoundPeriod type (basically a Dict with (Period=>Value), when arithmetic is done, it follows a specific order (greatest to least, first to last, etc.) This is what Noda has done.

@nolta
Copy link
Member

nolta commented Jul 17, 2013

@nolta, can you elaborate more on why you think splitting the package in two is a bad idea? ... With the package system stabilizing, I think it provides an excellent--and simple--way to provide updates of timezone/leap second data.

I agree with all of this. But these arguments work equally well with the proposition "we shouldn't merge Datetime into Base".

Splitting Datetime and merging part of it into Base is liable to create two tightly-coupled modules with different release schedules. So instead, let's not split Datetime up, and leave it as a package. As you say, the package system works great.

\4. Do the same as 2, but also provide a CompoundPeriod type (basically a Dict with (Period=>Value), when arithmetic is done, it follows a specific order (greatest to least, first to last, etc.) This is what Noda has done.

This is also what Calendar does, so it's the solution i prefer. But if Datetime remains a package, then you're free to implement the solution you prefer.

@aviks
Copy link
Member

aviks commented Jul 17, 2013

But these arguments work equally well with the proposition "we shouldn't merge Datetime into Base".

I think DateTime is important enough that there should be a single canonical implementation. Imagine a system where DataFrames depends on one kind of datetime object, and @milktrader 's TimeSeries package usage a different type of Date. One would be converting between different types of dates all over one's codebase. This arises quite often in Java projects, where one dependent library uses JODATime, and another uses java.lang.Date ... (at least there are converters available in this case)

While one may make such an argument about any facility, I believe datetimes are fundamental enough that this matters a lot.

The best (only?) way of ensuring this is to have a solid date time implementation in Base.

@quinnj
Copy link
Member

quinnj commented Jul 18, 2013

Splitting Datetime and merging part of it into Base is liable to create
two tightly-coupled modules with different release schedules.

I'm not sure I follow/understand you here. It would probably be clearer if
the DateExtensions package existed already, but what I meant to convey
earlier and what I forsee is a tightly controlled interaction between the
DateTime module in base and the DateExtensions package. The base module
would release along with the rest of Julia base, and the DateExtensions
would follow the same schedule for any major updates (though I don't forsee
many major updates as it will mainly be a data repo). In between releases,
the only updates to the DateExtensions package would be timezone and leap
second data additions, which would be entirely non - breaking.

Is there some concern or potential misstep I'm missing with this kind of
setup?

Jacob

@staticfloat
Copy link
Member

It seems like the main concern about putting everything in Base is the lack
of a good way to update things in a timely manner. Perhaps it's time to
think outside of the box; is there any reason we couldn't use the machinery
inside of Pkg to provide updates to packages that we could ship with Julia?
E.g. something in between Base and ~/.julia? Either something that you can
using without having to Pkg.add(), (e.g. a simple package that just
comes preinstalled) or something that is automatically used a la Base.

@milktrader
Copy link
Contributor

Have some time this week to jump into this conversation. Great stuff.

@nolta
Copy link
Member

nolta commented Jul 18, 2013

@karbarcca Thanks for the details. But i read your plan and think, "This sounds like a hassle. Why bother?"

@aviks I don't really buy this argument. There are lots of important packages not in base. If Datetime is high quality, people will use it. If interop becomes a problem, we can ask maintainers to switch.

Maybe i'm wrong, but my gut feeling is that the benefit of including this in base is modest at best, and not worth the cost of splitting up the package.

@StefanKarpinski
Copy link
Member

I agree with most of @nolta's points. Preventing fracturing of date/time representations is largely a social issue and partly a technical issue of having an official date/time package that's good enough that everyone wants to use it instead of rolling their own.

The biggest argument to me for having a time representation in base is that we might want to have functions in base return objects of that type. The time() function, for example. But I'm not sure if those should just use float seconds since epoch or whatever.

@milktrader
Copy link
Contributor

R has some experience iterating through ways of dealing with time and there is a good man page at ?.leap.seconds which is pretty long to post here (I'll do it if someone wants it).

I think base would be well-served to have at least a foundational time-based type. It can have a timezone field that defaults to UTC so those data objects that don't need low latency (ie, daily, monthly, yearly) can use it. This time type can then be tagged as an IndexVector in a DataFrame and will be a path to plotting basic time series.

Being able to plot seasonal monthly birth rates from a DataFrame should be available out of the box (base) while those who want more precision and ability to aggregate across specific time periods can access a package.

@quinnj
Copy link
Member

quinnj commented Jul 19, 2013

So after having a go at splitting the Datetime package into two, I have to admit that @nolta was right in that it would probably be too much of a hassle. Turns out the actual implementation of having two modules try and be tightly linked is pretty tricky without straight up redefining/overriding exact methods, which doesn't seem like a great user experience (if a workaround to #265 comes about, this could probably be managed).

So while splitting the package in two conceptually seems like a great idea for maintainability, practically it doesn't seem to be the best solution.

I've merged the revised/enhanced codebase of Datetime2 into Datetime now and will plan on deleting the Datetime2 repository. My plan is to keep testing/enhancing Datetime (I actually just pushed temporal expression support which ends up being very natural through using anonymous functions as the "step" in a DateRange; see the bottom of the README).

I'm happy to help support anything that would like to be included in Base, otherwise, Datetime can continue to be its own package.

@aviks
Copy link
Member

aviks commented Jul 19, 2013

@nolta @StefanKarpinski Well, to me the notion of a programming language without date/time support in its standard library seems incomplete. I suppose in the same way a language without BLAS/LAPACK support will seem incomplete to most Julia users. So I would want some kind of date time module in base.

But, at the end of day, that's a pretty subjective opinion.

@ViralBShah
Copy link
Member

Yes, but things can develop outside and be brought into Base later.

@milktrader
Copy link
Contributor

Good point. We haven't gotten beyond 0.2 yet. How about a milestone, say by 0.5?

@ggggggggg
Copy link
Contributor

Re @staticfloat thinking outside the box. If DateTime moves into base, the issues regarding keeping the leap seconds and timezone information up to date come to the forefront. What if there was a package like OutsideData, that contained timezone.csv and a leap_second.csv file, which is installed by default, and possibly auto-updated on julia start (in juliarc.jl by default, so it is easy to turn off?). That could provide a nice way to keep this information, and possibly other information on a similar release schedule, up to date, and separate from stable code.
Or the first call to a function that uses OutsideData prints a line warning that it is out of date, sort of like deprecated functions.

@quinnj
Copy link
Member

quinnj commented Aug 29, 2013

@ggggggggg, that's a great idea! And we've actually been discussing doing just that over here. I just pushed some changes the other day that should allow us to do this.

@StefanKarpinski
Copy link
Member

Auto-updating anything is not ok. You can't start julia and have it make network connections you didn't ask for. In general anything like that should be opt-in not opt-out.

@StefanKarpinski
Copy link
Member

I think including the data updates in point releases is probably fine.

@quinnj
Copy link
Member

quinnj commented Aug 29, 2013

Good clarification @StefanKarpinski. Yes, we wouldn't be pushing anything automatically, but it could still be as simple as installing an OutsideData package that could be manually updated with Pkg.update() and the timezone/leap second data would refresh.

@pao
Copy link
Member

pao commented Jul 18, 2014

Reopening; this was closed by the accidental bizarro-merge in #7825.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs decision A decision on this change is needed
Projects
None yet
Development

Successfully merging a pull request may close this issue.