Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New ignoreset for LiveJournal crud #198

Open
Asparagirl opened this issue Dec 8, 2015 · 1 comment
Open

New ignoreset for LiveJournal crud #198

Asparagirl opened this issue Dec 8, 2015 · 1 comment

Comments

@Asparagirl
Copy link
Contributor

This is a partial (!) alphabetical list of the kinds of ads and tracking code one might find on even a simple, not-very-highly-trafficked LiveJournal community. Would be great to have a new ArchiveBot ignoreset created for all this kind of crud.

acuityplatform.com
ad.rambler.ru
ad.turn.com
api.plus1.wapstart.ru
autocontext.begun.ru
awaps.yandex.ru
begun-sync.rutarget.ru
c.betrad.com
casalemedia.com
choices-or.truste.com
counter.rambler.ru
data.repaynik.com
doubleclick.net
doubleverify.com
dsp.adviator.com
dsum.casalemedia.com
exch.quantserve.com
googletagservices.com
gum.criteo.com
i.ctnsnet.com
imrk.net
mc.yandex.ru
montblanc.rambler.ru
muser.r24-tech.com
optimized-by.rubiconproject.com
ox-d.ad.net
pix04.revsci.net
pixel.quantcount.com
pixel.yabidos.com
pr-bh.ybp.yahoo.com
profile.begun.ru
rtax.criteo.com
s.uuidksinc.net
simage2.pubmatic.com
ssp.adriver.ru
st.top100.ru
static.doubleclick.net
sync.madnetx.com
sync.rambler.ru
tap.rubiconproject.com
tns-counter.ru
tpc.googlesyndication.com
ums.adtechus.com
uptolike.com
us-u.openx.net

@hannahwhy
Copy link
Member

Hmm.

I agree that these sorts of sites slow crawling, but on the other hand they are part of the page as the crawler saw them, and there's an argument that they should remain in the page.

(On the other other hand, there's the possibility of stuff like malvertising.)

I don't think it'd hurt to make an ignore set that contains these domains, but I feel like it's something that should be applied as a measure of last resort. (The extra load these domains put on the grab could also be addressed by improving grab speed.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants