forked from scrubber/scrubyt
-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy pathCHANGELOG
355 lines (293 loc) · 16.3 KB
/
CHANGELOG
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
= scRUBYt! Changelog
== 0.4.3
== 23th April
- [NEW] option to close the firefox window after the scraping is finished (thanks to Mikkel Garcia and Damien Garros)
- [NEW] Added the ability to scrape, click_link, then scrape again. (Only for firefox agent) (thanks to Mikkel Garcia)
- [FIX] scRUBYt! now works with latest version of mechanize (thanks to nesquena, austinmoore and Leandro Nunes)
- [NEW] added a wrapper around the firewatir requirement to make firewatir optional
- [FIX] added test to prohibit traverse_from_match from attempting to traverse nil children (thanks to Dennis Sutch)
== 0.4.05
== 20th October
=<tt>changes:</tt>
- [NEW] possibility to use FireWatir as the agent for scraping (credit: Glenn Gillen, Glen Gillen and... did I mention Glenn already?)
- [FIX] navigation doesn't crash if a 404/500 is returned (credit: Glen Gillen)
- [NEW] navigation action: click_by_xpath to click arbitrary elements
- [MOD] dropped dependencies: RubyInline, ParseTree, Ruby2Ruby (hooray for win32 users)
- [NEW] scraping through frames (e.g. google analytics)
- [MOD] exporting temporarily doesn't work - for now, generated XPaths are printed to the screen
- [MOD] possibility to wait after clicking link/filling textfield (to be able to scrape inserted AJAX stuff)
- [NEW] possibility to fetch from a string, by specifying nil as the url and the html string with the :html option
- [FIX] firewatir slowness (credit: jak4)
- [FIX] lot of bugfixes and stability fixes
== 0.4.0 (unofficial)
=== 31st October, 2007
=<tt>changes:</tt>
- [NEW] possibility to define a default value for patterns
- [MOD] rewrite of to_flat_xml to a more robust algorithm
- [NEW] find_string method in text pattern; return the string if it's present in the input
== 0.3.4
=== 26th September, 2007
=<tt>changes:</tt>
It seems I have been too busy to update the changelog ;)
== 0.3.1
=== 29th May, 2007
=<tt>changes:</tt>
[NEW] complete rewrite of the output system, creating
a solid foundation for more robust output functions
(credit: Neelance)
[NEW] logging - no annoying puts messages anymore! (credit: Tim Fletcher)
[NEW] can index an example - e.g.
link 'more[5]'
semantics: give me the 6th element with the text 'link'
[NEW] can use XPath checking an attribute value, like "//div[@id='content']"
[NEW] default values for missing elements (first version was done in 0.2.8
but it did not work for all cases)
[NEW] possibility to click button with it's text (instead of it's index)
(credit: Nick Merwin)
[NEW] clicking radio buttons
[NEW] can click on image buttons (by specifying the name of the button)
[NEW] possibility to extract an URL with one step, like so:
link 'The Difference/@href'
i.e. give me the href attribute of the element matched by the example 'The Difference'
[NEW] new way to match an element of the page:
div 'div[The Difference]'
means 'return the div which contains the string "The Difference"'. This is
useful if the XPath of the element is non-constant across the same site
(e.g.sometimes a banner or add is added, sometimes not etc.)
[NEW] Clicking image maps; At the moment this is achieved by specifying an
index, like
click_image_map 3
which means click the 4th link in the image map
[FIX] Replacing \240 ( ) with space in the preprocessing phase
automatically
[FIX] Fixed: correctly downloading image if the src
attribute had a leading space, as in
<img src=' /files/downloads/images/image.jpg'/>
[FIX] Other misc fixes - a ton of them!
== 0.2.7
=== 12th April, 2007
=<tt>changes:</tt>
[NEW] download pattern: download the file pointed to by the
parent pattern
[NEW] checking checkboxes
[NEW] basic authentication support
[NEW] possibility to resolve relative paths against a custom url
[NEW] first simple version of to_csv and to_hash
[NEW] complete rewrite of the exporting system (Credit: Neelance)
[NEW] first version of smart regular expressions: they are constructed
from examples, just as regular expressions (Credit: Neelance)
[NEW] Possibility to click the n-th link
[FIX] Clicking on links using scRUBYt's aadvanced example lookup
[NEW] Forcing writing text of non-leaf nodes with :write_text => true
[NEW] Possibility to set custom user-agent; Specified default user agent
as Microsoft IE6
[FIX] Fixed crawling to detail pages in case of leaving the
original site (Credit: Michael Mazour)
[FIX] fixing the '//' problem - if the relative url contained two
slashes, the fetching failed
[FIX] scrubyt assumed that documents have a list of nested elements
(Credit: Rick Bradley)
[FIX] crawling to detail pages works also if the parent pattern is
a string pattern
[FIX] shorcut url fixed again
[FIX] regexp pattern fixed in case it's parent was a string
[FIX] refactoring the core classes, lots of bugfixes and stabilization
== 0.2.6
=== 22th March, 2007
The mission of this release was to add even more powerful features,
like crawling to detail pages or compound example specification,
as well as fixing the most frequently popping-up bugs. Scraping
of concrete sites is more and more frequently the cause for new
features and bugfixes, which in my opinion means that the
framework is beginning to make sense: from a shiny toy which
looks cool and everybody wants to play with, it is moving
towards a tool which you reach after if you seriously want
to scrape a site.
The new stuff in this release is 99% scraping related - if
you are looking for new features in the navigation part,
probably the next version will be for you, where I will
concentrate more on adding new widgets and possibilities
to the navigation process. Firewatir integration is very
close, too - perhaps already the next release will
support FireWatir navigation!
=<tt>changes:</tt>
* [NEW] Automatically crawling to and extracting from detail pages
* [NEW] Compound example specification: So far the example of a pattern had to be a string.
Now it can be a hash as well, like {:contains => /\d\d-\d/, :begins_with => 'Telephone'}
* [NEW] More sophisticated example specification: Possible to use regexp as well, and need not
(but still possible of course) to specify the whole content of the node - nodes that
contain the string/match the regexp will be returned, too
* [NEW] Possibility to force writing text in case of non-leaf nodes
* [NEW] Crawling to the next page now possible via image links as well
* [NEW] Possibility to define examples for any pattern (before it did not make sense for ancestors)
* [NEW] Implementation of crawling to the next page with different methods
* [NEW] Heuristics: if something ends with _url, it is a shortcut for:
some_url 'href', :type => :attribute
* [FIX] Crawling to the next page (the broken google example): if the next
link text is not an <a>, traverse down until the <a> is found; if it is
still not found, traverse up until it is found
* [FIX] Crawling to next pages does not break if the next link is greyed out
(or otherwise present but has no href attribute (Credit: Robert Au)
* [FIX] DRY-ed next link lookup - it should be much more robust now as it is uses the 'standard' example lookup
* [NEW] Correct exporting of detail page extractors
* [NEW] Added more powerful XPath regexp (Credit: Karol Hosiawa)
* [NEW] New examples for the new featutres
* [FIX] Tons of bugfixes, new blackbox and unit tests, refactoring and stabilization
== 0.2.3
=== 20th February, 2007
Thanks to the feedback from all of you, I managed to find a lot of bugs as well as write up a nice feature request list. The bugs are mostly fixed and also some shiny new features have been added. Stability was also improved by adding new tests and totally refacroring the whole code.
The new features make this release much more powerful than the previous one. Sites requiring login, submitting forms with button click, filling text areas, dealing with variable-size results, smart handling of attribute lookup, https, custom proxy setting and tons of bugfixes make this release capable of doing much-much more than it was possible in 0.2.0.
I have added also some shiny new examples - scraping reddit, del.icio.us, rubyforge login, wordpress automatic comment
ing for example.
=<tt>changes:</tt>
* [FIX] Cookies (and other stuff) are now taken into consideration
* [NEW] select_indices feature. Example:
table do
(row '1').select_indices(:last)
end
this will select only the last row;
possibility to specify a Range, or an array of indices, or other
constants like :first, :every_odd etc. More to come in the future!
* [FIX] digg.com next page problem fixed
* [FIX] Fetching of https sites
* [FIX] Next page works incorrectly when given an absolute path
* [FIX] Fixing exporting if the pattern parameters are parenthesized
* [NEW] Possibility to submit forms by clicking a button
* [NEW] Added new unit test suite: pattern_test
* [NEW] Possibility to set a proxy for fetching the input document
* [NEW] Added possibility to choose an option from a selection list (Credit: Zaheed Haque)
* [FIX] Image pattern example lookup fix
* [NEW] Possibility to prefilter the document before passing it to Hpricot (Credit: Demitrious Kelly)
* [FIX] corrected gem dependencies (Credit: Tim Fletcher)
* [FIX] remove duplicates only if there are more examples present
* [NEW] new examples: wordpress comment (Credit: Zaheed Haque), rubyforge login, del.icio.us, reddit and more
* [FIX] if there is no scraper defined, exit with a message rather than raise an exception
* [NEW] smart handling of attribute lookup: try to look up the attribute in the parent, but if it is not there, traverse up until it is found (this is useful e.g. if an image is inside a span and the span is inside an <a>)
== 0.2.0
=== 30th January, 2007
The first ever public release, 0.2.0 is out! I would say the feature set is impressive, though the the relyability still needs to be improved, and the whole thing needs to be tested, tested and tested thoroughly. This is not yet the release which you just pull out of the box anf works under any circumstances - however, the major bugs are fixed and the whole stuff is in a good-enough(TM) state, I guess.
=<tt>changes:</tt>
* better form detection heuristics
* report message if there are absolutely no results
* lots of bugfixes
* fixed amazon_data.books[0].item[0].title[0] style output access
and implemented it correctly in case of crawling as well
* /body/div/h3 not detected as XPath
* crawling problem (improved heuristics of url joining)
* fixed blackbox test runner - no more platform dependent code
* fixed exporting bug: swapped exported XPaths in the case of no example present
* fixed exporting bug: capturing \W (non-word character) after the\ pattern name; this way we can distinguish pattern names where one
name is substring of the other
* Evaluation stops if the example was not found - but not in the case
of next page link lookup
* google_data[0].link[0].url[0] style result lookup now works in the
case of more documents, too
* tons of others bugfixes
* overall stability fixes
* more blackbox tests
* more examples
* overall stability fixes
= 0.1.9
=== 28th January, 2007
This is a preview release before the first real public release, 0.2.0. Basically everything planned for 0.2.0 is in, now a testing phase (with light bugfixing :-) will follow, then 0.2.0 will be released.
=<tt>Changes</tt>:
* Possibility to specify multiple examples (hence a pattern can have more filters)
* Enhanced heuristics for example text detection
* First version of algorithm to remove dupes resulting from multiple examples
* empty XML leaf nodes are not written
* new examples
* TONS of bugfixes
= 0.1
=== 15th January, 2007
First pre-alpha (non-public) release
This release was made more for myself (to try and test rubyforge, gems, etc) rather than for the community at this time.
Fairly nice set of features, but still need a lot of testing and stabilizing before it will be really usable.
* Navigation:
* fetching pages
* clicking links
* filling input fields
* submitting forms
* automatically passing the document to the scraping
* both files and http:// support
* automatic crawling
* Scraping:
* Fairly powerful DSL to describe the full scraping process
* Automatic navigation with WWW::Mechanize
* Automatic scraping through examples with Hpricot
* automatic recursive scraping through the next button
=<tt>changes:</tt>
* [FIX] cookies (and other stuff) are now taken into consideration
* [FIX] digg.com next page problem fixed
* [FIX] fetching of https sites
* [FIX] Next page works incorrectly when given an absolute path
* [FIX] Fixing exporting if the pattern parameters are parenthesized
* [NEW] Possibility to submit forms by clicking a button
* [NEW] Added new unit test suite: pattern_test
* [NEW] Possibility to set a proxy for fetching the input document
* [NEW] Added possibility to choose an option from a selection list
* [NEW] select_indices feature. Example:
table do
(row '1').select_indices(:last)
end
this will select only the last row;
possibility to specify a Range, or an array of indices, or other
constants like :first, :every_odd etc. More to come in the future!
* [FIX] Image pattern example lookup fix
* [FIX] corrected gem dependencies (thanks to Tim Fletcher)
* [FIX] remove duplicates only if there are more examples present
* [NEW] new examples: gmail login, wordpress comment, del.icio.us, grab_rows (showcasing select_indices)
* [FIX] if there is no scraper defined, exit with a message rather than
raise an exception
* [NEW] smart handling of attribute lookup: try to look up the attribute in the parent, but if it is not there, traverse up until it is found (this is useful e.g. if an image is inside a span and the span is inside an <a>)
== 0.2.0
=== 30th January, 2007
The first ever public release, 0.2.0 is out! I would say the feature set is impressive, though the the relyability still needs to be improved, and the whole thing needs to be tested, tested and tested thoroughly. This is not yet the release which you just pull out of the box anf works under any circumstances - however, the major bugs are fixed and the whole stuff is in a good-enough(TM) state, I guess.
=<tt>changes:</tt>
* better form detection heuristics
* report message if there are absolutely no results
* lots of bugfixes
* fixed amazon_data.books[0].item[0].title[0] style output access
and implemented it correctly in case of crawling as well
* /body/div/h3 not detected as XPath
* crawling problem (improved heuristics of url joining)
* fixed blackbox test runner - no more platform dependent code
* fixed exporting bug: swapped exported XPaths in the case of no example present
* fixed exporting bug: capturing \W (non-word character) after the\ pattern name; this way we can distinguish pattern names where one
name is substring of the other
* Evaluation stops if the example was not found - but not in the case
of next page link lookup
* google_data[0].link[0].url[0] style result lookup now works in the
case of more documents, too
* tons of others bugfixes
* overall stability fixes
* more blackbox tests
* more examples
* overall stability fixes
= 0.1.9
=== 28th January, 2007
This is a preview release before the first real public release, 0.2.0. Basically everything planned for 0.2.0 is in, now a testing phase (with light bugfixing :-) will follow, then 0.2.0 will be released.
=<tt>Changes</tt>:
* Possibility to specify multiple examples (hence a pattern can have more filters)
* Enhanced heuristics for example text detection
* First version of algorithm to remove dupes resulting from multiple examples
* empty XML leaf nodes are not written
* new examples
* TONS of bugfixes
= 0.1
=== 15th January, 2007
First pre-alpha (non-public) release
This release was made more for myself (to try and test rubyforge, gems, etc) rather than for the community at this time.
Fairly nice set of features, but still need a lot of testing and stabilizing before it will be really usable.
* Navigation:
* fetching pages
* clicking links
* filling input fields
* submitting forms
* automatically passing the document to the scraping
* both files and http:// support
* automatic crawling
* Scraping:
* Fairly powerful DSL to describe the full scraping process
* Automatic navigation with WWW::Mechanize
* Automatic scraping through examples with Hpricot
* automatic recursive scraping through the next button