You would like to exclude or include URLs from your crawl?

If you need to have a closer look at a specific part of your domain or if you want to exclude some areas you can follow these methods:

 

Open the "project settings" in the top right corner, move to tab "Advanced Crawl"

 

1. Subfolder only

cs1.png

With this settings your crawl will only contain all data from a certain subfolder ( e.g. "/wiki/)

 

2. Blacklist/Whitelist

cs2.png

You can exclude URLs from your crawl by adding blacklist rules, in this example we want to exclude the Magazine and our Wiki, we realize that by blacklisting the subfolders. The rules should look like this:

regex:/wiki/

This rule will exclude all URLs containing  /wiki/, please note that this does not take the folder hierarchy into account. For instance a URL domain.com/wiki/ will be excluded as well as domain.com/subfolder/wiki/

You can apply any rule in any depth, if you wish that certain sites to be excluded e.g.:

regex:https://en.ryte.com/magazine/onpage-becomes-ryte

 

You can add as many rules as you like.

 

Let's move on to the whitelist, the whitelist has the same functionalities as the blacklist but it works the opposite way. If you need to apply rules by "include only" than you can use our whitelist feature.

cs3.png

In this example we want to crawl ONLY our Magazine and Wiki. We realize that by whitelisting each subfolder:

regex:/wiki/

regex:/magazine/

Please note that this will also whitelist domain.com/subfolder/wiki/, you my need to adjust it in depth (regex:https://en.ryte.com/wiki/)

 

 

3.Test your settings

It can be very time consuming if you run your rules without testing only to see that it did not work when the crawl finished. Please test your settings first, in order to get familiar with the syntax.

cs4.png

Here we are testing with our whitelist rules from above, so only URLs from either the Wiki or the Magazine should be crawled.

If we enter a URL that should be excluded and the response status is 9xx (in this example 950) our settings are fine, if the status is still 200 then the rules did not work.

We can also test in the other direction by test-crawling the included content by entering a URL within the whitelist rules:

cs5.png

If we applied the rules successfully then the test will respond with a status of 200, this only makes sense if we testet excluded URLs first!

Test settings Status Codes:

200 - OK

950 - blocked by whitelist

951 - blocked by blacklist

 

ImportantIf you're test was not successful and you run it again with the same Test-URL it might be cached, so please enter a different URL each time you run a test in order to not have cache discrepancy.

You can apply all rules at once (whitelist/blacklist, Subfolder, subdomain etc) Please make sure that they don't cancel out each other!

 

 

 

Have more questions? Submit a request

0 Comments

Please sign in to leave a comment.
Powered by Zendesk