Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Master List of data.table Issues for GSoC '24 (Josh) #1

Open
8 of 11 tasks
joshhwuu opened this issue May 27, 2024 · 27 comments
Open
8 of 11 tasks

Master List of data.table Issues for GSoC '24 (Josh) #1

joshhwuu opened this issue May 27, 2024 · 27 comments

Comments

@joshhwuu
Copy link
Owner

joshhwuu commented May 27, 2024

Blog:

https://joshhwuu.github.io/

Issues:

A list of pull requests can be found here: https://github.com/Rdatatable/data.table/pulls/joshhwuu

@joshhwuu joshhwuu changed the title Master List of data.table Issues for GSoC '24 Master List of data.table Issues for GSoC '24 (Josh) May 27, 2024
@joshhwuu
Copy link
Owner Author

@tdhock @Anirban166 Hey mentors, we should use this thread as a way to discuss things that don't directly have to do with the other members of data.table, feel free to comments if you have any questions or concerns.

@joshhwuu
Copy link
Owner Author

joshhwuu commented May 29, 2024

Going to start updating this thread to keep mentors updated with progress.
@tdhock @Anirban166

May 28, 2024 (Week 1)

I submitted/merged PR #6150 Fix Windows Parsing Issue which aims to fix an encoding issue related with the base locale on certain Windows builds, as pointed out by Michael and can be found here: https://github.com/Rdatatable/data.table/actions/runs/9140204969/job/25133295034
For this issue I tried to recreate the problem by setting locales in my Linux system, was able to reproduce with LC_CTYPE being the latin-1/Windows 1252 equivalent for Linux, and then set the tests in a local to use a utf-8 locale, which should work for modern systems, although I'm not sure if it'll work on all Windows builds, even those that might not have utf-8 available as a locale.

PR #6151 Update fifelse documentation has also been submitted, awaiting review, this PR updates fifelse documentation to warn users not to use it in cases such as recursive functions because of its evaluation pattern, and hints to fcase, need advice on whether to update fcase documentation as well.

Draft #6158 fread blank.lines.skip gains new value has also been submitted, current LF help on a question I'm having with the C code, if anyone has any suggestions that would be great.

@tdhock
Copy link

tdhock commented Jun 1, 2024

looks great, thanks for sharing. please also consider making a blog, posting weekly

@joshhwuu
Copy link
Owner Author

joshhwuu commented Jun 1, 2024

Yes, I have set one up but forgot to link it, I'll link it here and at the top of the page here, I'll look to post an update on the blog every saturday/sunday: https://joshhwuu.github.io/

@Anirban166
Copy link

Looks great to me as well!
Keep up the good work and don't forget to include links to your blog posts as and when you get done writing/publishing them (along with reproducible examples and everything else we have been discussing during our weekly calls apart from sharing links to your blog)

@joshhwuu
Copy link
Owner Author

joshhwuu commented Jun 3, 2024

Hey mentors! As promised, this is the newest entry on my blog linked here

@tdhock
Copy link

tdhock commented Jun 3, 2024

Hi Josh your blog looks great thanks
also please add links on https://github.com/rdatatable/data.table/wiki/Articles
and you may consider making an RSS feed and submitting your blog to https://www.r-bloggers.com/add-your-blog/

@joshhwuu
Copy link
Owner Author

joshhwuu commented Jun 4, 2024

@tdhock @Anirban166

June 3, 2024 (Week 2)

This week, I am planning to work on #5611, #5558, #5411.

#5611 is a WIP, I submitted a draft PR. I will do some digging to see what exactly the two assertions are testing for, as they are nocov. If I can figure that out, then we can confirm how well the change works.

I just submitted a PR for #5558, awaiting review here: #6167. I thought the changes were quite straightforward but if anyone sees any potential issues, LMK.

#5411, like some of the old issues should be a documentation change, so I expect most of the time to be on reviewing, as documentation needs to be perfect.

Currently, I have PR #6165 open as well, just waiting on a review from Jan, so we'll see how that goes.

Of course, if I manage to finish most of my tasks, I'll be adding in more work from my proposal/data.table issues. Cheers!

@joshhwuu
Copy link
Owner Author

joshhwuu commented Jun 10, 2024

@tdhock @Anirban166

June 10, 2024 (Week 3)

Now that we're coming down to the last week of phase 1 of my proposed project, I have a few more issues to submit PRs for: #981 and #5411. I also recently added #5409 to work on as well, so we'll see how well the prior two issues go before I start on that one.

I have a few open PRs still, needing some reviews but those can be worked on as the summer progresses as well, just hopefully we'll merge some in the coming days so there's less merge conflicts with new PRs.

@joshhwuu
Copy link
Owner Author

Update on blog! @tdhock @Anirban166

https://joshhwuu.github.io/2024-06-16-GSoC-Wk3/

@joshhwuu
Copy link
Owner Author

joshhwuu commented Jun 17, 2024

@tdhock @Anirban166

June 17, 2024 (Week 4)

Today I opened a new PR for #4280, and will be looking to wrap that up with pending reviews. Additionally, I'm going to be looking at merging open PRs, most notably #6167 and #6165, whenever Michael and Jan get a chance to take a look. As for other open PRs, it seems that everyone agrees #6175 is unnecessary so I'll potentially be adding some documentation to point users in the right direction if they ever need something similar. #6158 has a big issue currently, which is that fread simply shouldn't (IMO) be able to skip beginning empty lines, as it is important that the first line read determines how columns are made. If we skip the beginning lines, the resulting data.table is always single-columned. Will follow up with Ben sometime to see if he agrees, and if my assumption is correct we can probably close that issue as well.

Something bigger I'll be working on:

Although this wasn't in my initial proposal, Issue #5409 is a very good example of why assign.c's internal SEXP assign should be refactored, as Ben mentioned here. IMO, the behavior of set and := should be the same, but currently both of them call very different parts of SEXP assign, causing slight differences such as the one outlined in Issue #5409. It would help to refactor this file, but seeing that it's a very long file I'll be working on this slowly as I proceed with other issues as well. Although I think assign is very well tested so refactoring should be pretty safe.

@joshhwuu
Copy link
Owner Author

Hey @Anirban166 @tdhock

Just a slight heads up that I'll be taking a final exam for my summer course on the next coming Monday, so in case there's not too much progress on current tasks that is why. However I'm feeling quite prepared for the exam so I don't think I'll be spending a whole lot of time on studying. Just a note that this is my only course this summer as well so I'll be done with everything after Monday, thanks!

@Anirban166
Copy link

Hey @Anirban166 @tdhock

Just a slight heads up that I'll be taking a final exam for my summer course on the next coming Monday, so in case there's not too much progress on current tasks that is why. However I'm feeling quite prepared for the exam so I don't think I'll be spending a whole lot of time on studying. Just a note that this is my only course this summer as well so I'll be done with everything after Monday, thanks!

All the best!

@joshhwuu
Copy link
Owner Author

@tdhock @Anirban166

June 25 (Week 5)

This week I'm going to be working on prototypes for adding a progress bar for large by operations, I'm expecting this to be on the heavier end. Jan said that we should test revdeps for PR 6165, but I noticed in .dev/revdep.R that this script is meant to be run by the package maintainer, and there aren't instructions to run the script so I am looking for advice on how to check those for this change. I added a new PR 6204, pending review. As of right now, PR #6175 doesn't seem like a good idea anymore, so I'm leaning towards closing and following up in the original issue to see how we should proceed. Hoping to make good progress on the bigger issues this week!

@tdhock
Copy link

tdhock commented Jun 26, 2024

Hi Josh, since last year we are no longer using .dev/revdep.R and instead using daily checks of master on the NAU Monsoon super-computer, https://github.com/Rdatatable/data.table/wiki/Release-management-and-revdep-checks which takes ~10 hours walltime instead of about ~20 days if you were to run .dev/revdep.R on your laptop, sequentially checking all 1505+ revdeps.
so we will merge your PR into master first, and then we will see the revdep results the next day on a page like this https://rcdata.nau.edu/genomic-ml/data.table-revdeps/analyze/2024-06-25/

@joshhwuu
Copy link
Owner Author

joshhwuu commented Jul 1, 2024

@tdhock @Anirban166

July 1 (Week 6)

Happy Canada Day! 🍁🎉 This week I'm going to be working on some of the larger issues I mentioned before - these have turned out to be quite difficult so I'm going to be spending a while on them. Other than that, I'm going to be browsing other open issues that I can potentially take on whenever I get the chance.

Also a quick note: I'll be visiting SF with a couple friends during the weekend, and as per GSoC rules, I shouldn't contribute code while in a different country, so I'll be unavailable during the weekend in case of review suggestions. I'll be back in Canada by next Monday however.

@tdhock
Copy link

tdhock commented Jul 3, 2024

great thanks please submit comments and partial PRs so you can get our feedback, even before the larger issues are solved

@joshhwuu

This comment was marked as off-topic.

@joshhwuu

This comment was marked as off-topic.

@tdhock

This comment was marked as off-topic.

@joshhwuu

This comment was marked as off-topic.

@joshhwuu

This comment was marked as off-topic.

@tdhock

This comment was marked as off-topic.

@joshhwuu
Copy link
Owner Author

@Anirban166 @tdhock

July 15 (Week 8)

Hey mentors, I apologize for the slow progress last week, I caught a nasty flu of some sort and had to take a few days to properly rest (Good thing I allotted a lot of time for these larger issues, we are still on track to finish 😆). Now that I'm better, I'm happy to say that I've been making good progress on the progress indicator for by operations, #6228. Alongside this, #6204 looks good to go as well. #6165 was merged so I'll have to take a look at revdeps this week. Still waiting on Michael's review of #6167, but with the new release on the horizon and some R CMD check occasional issues, I imagine this quite hefty PR is going to take a while before going to merge.

@joshhwuu
Copy link
Owner Author

joshhwuu commented Jul 23, 2024

@tdhock @Anirban166

July 23 (Week 9)

This week I'm looking forward to apply reviews and merge some of the open PRs that I have. Other than that, I'm actively seeking some larger issues to work on within the issue tracker, notably either from the 1.16.0 milestone or the most-requested issues list. As per my proposal, I'm looking to completing two more larger issues and I will have completed my goal. Obviously this doesn't mean I have to stop there, as I realize now that I've given myself quite a generous amount of time, so I'm going to do my best to do as much as I can!

@joshhwuu
Copy link
Owner Author

@tdhock @Anirban166

Hi mentors! For the past few weeks, I've been looking for more issues to close, and I've been able to write out some prototypes/proof-of-concepts for quite a few of them. As we are coming close to the release date for 1.16.0, I am a little hesitant to open new PRs, as it seems that it will be a while before they can get reviews and go towards merge. I find that I am more productive when my PRs are reviewed faster, as with GSoC I often work on several PRs at once, and when reviews are (understandably) slower, I tend to sometimes forget things about the changes, causing me to miss things.

Recently, I've been engaging in discussion and investigation regarding changes to test.data.table and its double printing problem and lintr's inefficiency with .Rraw files. I hope to open PRs for these as I have local branches with changes ready. These will likely come out after 1.16.0 release on CRAN (even if that comes past the GSoC deadline). I have also been looking into refactoring some calls that can be simplified via helpers within our R code (centralizing handling of index retrieval, etc). On top of this, I've been also looking at some of the mutation testing results and will help to investigate.

My plan for the last two weeks of GSoC is to submit my remaining changes as PRs after 1.16.0 is live on CRAN and more of the team is available, try to actively participate in discussions and follow up with some old issues that can potentially be closed.

Overall, I've been very happy with the contributions I made, and would like to think that my work has been impactful (awesome to see my name on NEWS.md many times). If any of you have feedback/suggestion for me, please let me know.

Question:

At the end of GSoC, I'd like to formally thank the data.table team and anyone else who has helped out by commenting and reviewing. What would you recommend be the best medium to use to do this? Maybe some sort of email group or potentially a community blog post? It would be a short message though, so I'm not sure what would be best, let me know if you have any ideas. Thanks!

@tdhock
Copy link

tdhock commented Aug 15, 2024

best way to contact the team is via issues, which are typically not used for "thanks" but would be great to see you keep helping on issues/PRs after GSOC is done.
If you write a blog, please post a link on https://github.com/Rdatatable/data.table/wiki/Articles

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants