- 4-5 brain cells
- Ready in 10 min
- 1 (domain) Serving
- 220 cals
At some point of your QA life, you will have to search for old and forgotten pages with errors. If you have experience on this and you know your website really good, it’s going to be the most boring thing in your life. On the other hand, if you are new to this… you will not know where to start. When I faced that problem for the first time, I didn’t have the time to build a script, so… go figure…
Let’s jump in, straight away.
A sitemap is a XML file which contains all the URLs that search bots are going to visit. Sitemap’s location is (almost) always this: http(s)://www.mydomain.com/sitemap.xml
Just because I don’t want to “abuse” any of the available sitemaps out there, we are going to create our own. So… open a “text editor” aka notepad and paste the following:
Save it as sitemap.xml
HTTP statuses (basics of basics)
200 – OK
301 – Moved permanently
400 – Bad requests
401 – Unauthorized
403 – Forbidden
404 – Not found
500 – Internal server error
502 – Bad gateway
503 – Service unavailable
You can read about the HTTP statuses here: https://en.wikipedia.org/wiki/List_of_HTTP_status_codes
At this tutorial I’m going to create a script that handles the basics statuses. Obviously a proper script should contain all the available statuses but I’m not going to do it. YOU can do it! Be my hero!
Create a php file and paste the following
Run it! And boom! All the URLs from the XML file on your screen. So what we did… We created a new DOM document and we loaded the XML file in it. The “DOMDocument” allow us to use some properties and make our life easier. For example the “getElementsByTagName”, which searches for all elements with given “loc” (location) tag name.
Yes you we can use regular expression and yes I like them more but the problem is that, you have to write the proper code for the regular and minified XML.
So let’s move on!
We have the URLs and now we have to check their status. To do so we have to add the following lines:
We “get_header” from the given URL, we keep the first cell  and then we print it on the screen. Why we print the first  cell? It’s because the “get_header” command, returns this:
So now if you run the script, you will get this:
https://www.google.com/HTTP/1.0 302 Found
http://google.com/abcHTTP/1.0 404 Not Found
https://www.amazon.comHTTP/1.1 200 OK
https://www.etsy.com/HTTP/1.0 200 OK
https://www.ebay.com/HTTP/1.0 302 Moved Temporarily
https://github.com/HTTP/1.1 200 OK
https://www.youtube.com/watch?v=aaaaaaaaaaaHTTP/1.0 301 Moved Permanently
So far so good. Not user-friendly but it works. A developers don’t need to print/echo any results but they trust their code and they can use counters. Let’s use counters.
So now you have the numbers, you have the statuses, it’s not user-friendly and all you have to do is do add some CSS or at least put them in a HTML table.