77.gif

Search (advanced search)
Use this Search form before posting, asking or make a new thread.
Tips: Use Quotation mark to search words (eg. "How To Make Money Online")

07-23-2016, 09:03 PM
Post: #1
Webharvy and Regex: How to scrape .jpg links only
hi everybody, thank you for your support in advance.

I'm trying to scrape images from banggood.com and i'd like to know if you know the regex code to scrape only .jpg links (not download images but only scrape links of those images), from html code.

Thank you very much!
07-24-2016, 02:39 AM
Post: #2
RE: Webharvy and Regex: How to scrape .jpg links only
I haven't used WebHarvey much (or in several years for that matter), but i can point you in the right direction in terms of Regex..

Here is a php-based Regex command. you can kind of see the structure if you look close..

/^https?:\/\/(?:[a-z0-9\-]+\.)+[a-z]{2,6}(?:\/[^\/#?]+)+\.(?:jpg|gif|png)$

The red part forms the structure of whatever domain you are within. The blue outlines the path following the domain. The green is your file extensions. I left all the image types in there for you to get an idea of what you're doing with your regex (just remove the "|gif|png" and you'll be good).

If webharvey doesnt like that string (variations of regex dont always work in scrapers..), try this one. It's a stripped down version, allowing any any url/domain and any of the three image extensions.

(http(s?):)|([/|.|\w|\s])*\.(?:jpg|gif|png)

The lack of structure casts a broader net - you could point this thing at yourwebsite.somehugefakefuckingdomainname and it will run it through - not technically the best for scrapping (which really does require consistency), but it'll grab your images anywhere you let it roam, so make sure to set your external depth so that it will stay within your domain. This one could feasibly jump of your domain if you have images hosted elsewhere (like on a CDN) and just keep on moving lol.

If NEITHER of those worked... Hit me up, I'll reinstall webharvey and take a look for/with you.
07-25-2016, 01:12 AM
Post: #3
RE: Webharvy and Regex: How to scrape .jpg links only
thank you a lot man! repped!




49.gif
Free counters!