How to Quickly Check the Backlink Profile of Over 5,000 URLs
Migrating to new Content Management Systems (CMS’s) can be difficult at times, especially if the site that is being migrated has thousands of duplicated URLs that have high quality backlinks associated with them. If it’s your job then you better come up with a solution and hopefully this article will provide a few. I was recently tasked to analyse or come up with a way to segment these 5,000 URLs to make it more manageable. I’ll admit I don’t have much experience when it comes to migrating very large websites, so it took me a couple hours to come up with an idea as to how we could go forward. Rather than sit like a lemon all day long and ignore all of these soft 404s that were popping up left, right and centre in Google Webmaster Tools I came up with a few ideas and utilised a few industry tools to help.
The first idea that I came up with was using Regular Expressions in Excel to categorise each URL based on the keywords that are in the actual URL. I then realised that I was doing this the wrong way round. The idea behind this was to categorise these URLs and to write a rule to say that if a URL contains a certain keyword then it should redirect to the most appropriate page that likely contains the same keyword. The problem with this is that it isn’t accurate and you could possibly be redirecting URLs to a page that isn’t actually that relevant.
So what should I do instead?
Seeing as this client is quite large, it’s worth going through to see what the backlink profile is for each individual URL that we grabbed from Webmaster Tools (over 5,000). And that’s easier said than done if you don’t have the right tools. Luckily for me, the agency I work at has all the tools that you’d ever want for this to work.
At first I tried using ScrapeBox on a VPS that we have to scour the backlink profile for all of those 5,000 URLs using the Mozscape API. I left this running over night, but when I got home I looked for other solutions as I knew that Moz’s Index isn’t that large and I might not get all the backlink data as a result. I found out that MajesticSEO has a tool called the “Bulk Backlink Checker“. At this point I celebrated as I’ve found that MajesticSEO is arguably one of the best backlink checking tools in the SEO industry. I simply pasted all 5,000 URLs into a .txt file and uploaded it to MajesticSEO’s Bulk Backlink Checker and it had completed the search for all 5,000 URLs in around 30 seconds using the Majestic’s Fresh Index. I later also used this tool to scour these URLs using Majestic’s Historic Index and it took well under 30 seconds to do this. I was quite impressed with that as Scrapebox, on the VPS, was still going through all 5,000 of those URLs – from the day before. You can’t really blame Scrapebox however, as the MozScape API limits ScrapeBox by only allowing 1 request every 10 seconds.
The result of using Majestic SEO’s Backlink Checker
The results came back and out of those 5,000 URLs it found that 250 of the URLs in that total actually have backlinks in Majestic’s Fresh Index. I then deleted all the duplicate URLs and was left with only 97 URLs. My job had now been made much more manageable. I did however have to manually check each of those 97 URLs to see where a redirect would benefit the client the most (i.e. to the most relevant pages). I also found what I like to call a “golden nugget” and one of the URLs in that list of 97 URLs had links from The Guardian, 11 links from a university website, and a few other popular industry related websites. As noted in the paragraph above, I also used Majestic’s Historic Index and from this I found an additional 142 URLs with backlinks.
There’s still more work to be done, as we’d still like to reduce the amount of soft 404s. There’s also the possibility that MajesticSEO hasn’t found backlinks, where there might indeed be backlinks.
This brings me back to the 2nd paragraph where we will segment the rest of the URLs by categorising and possibly making blanket 301 rules to what we think are the most relevant pages.
If anyone else has better and much easier solutions then I am all ears. Also, Cyrus Sheperd actually wrote quite the article on this very subject that helped me quite a lot, so I have to link to it. The comments section is doubly useful.