Can crawl Google PDF files

Hands-on tips: Use PDFs optimally for your own SEO strategy

As soon as the PDF file is created, each element (headline, images, text) is always in the same position. It does not matter from which format the PDF is generated. In this article you will get tips on how to deal with PDFs and learn how you can best use them for your SEO strategy.

How does Google deal with PDFs?

With highly competitive keywords, PDFs rarely appear in the TOP10 search results. Technically, however, Google does not differentiate between an HTML page and a PDF document. Instead, the focus of the search engine is on presenting the user with the best search result.

Texts: Google can index PDFs in any language or any character encoding, provided the document is not password protected or encrypted. Texts that are implemented as images are partially processed with the help of OCR algorithms and "read" accordingly. A simple test can be used to find out whether a text in a PDF can be read by Google with little effort: If the text can be copied from the PDF using copy & paste, Google will also be able to read and understand the text.

Images: Images from PDFs are not suitable for the classic Google image search. If you want to be found with the images from the PDF, it is advisable to create a classic HTML page.

Left: Like HTML documents, PDFs can also contain links. Similar to the HTML version, the links in PDFs can inherit link power. This was recently confirmed in a statement from Gary Illyes:

Illustration 1: Links in PDFs pass on link power

Danger: When dealing with PDFs, always keep in mind that PDF calls are not recorded by tracking solutions such as Google Analytics. It is therefore possible that a PDF attracts a large number of visitors, but this traffic is not used accordingly.

In order to identify potentials and weak points, a log file analysis is recommended in order to be able to examine calls to non-HTML files. Incidentally, log file analyzes are also well suited for examining crawler activities using the UserAgent.

Use PDFs correctly

From a search engine point of view, PDFs are a double-edged topic. On the one hand, PDFs as well as other document types can be listed in the search results. On the other hand, they do not offer the user any navigation or other elements of interaction with the site.

That is why it is important to think about the role PDFs play in your own SEO strategy. The most important question should be: "Can a PDF meet the expectations of a search engine visitor?"

Option 1:
Exclude PDFs that do not serve as landing pages from the index

If there is the assumption that an indexable PDF cannot meet the information needs of a user, one should ensure that the corresponding PDF is excluded from the search engine index.

The easiest way to keep PDFs out of the index is to use an x-robot in the HTTP header. Either a noindex or a canonical tag can be displayed via this. While the noindex only provides the search engine with the information that the content does not belong in the index, the canonical tag can be used to refer to an HTML version of the PDF.

Use case: Which solution is right for me?

Assuming you use a noindex in the HTTP header for these PDFs, you would waste link power and only the URLs that were linked from the PDF would benefit from it. The use of the canonical tag is particularly useful for PDFs that have generated many backlinks in the past. The canonical tag forwards the entire link power to the corresponding landing page to which it refers. The PDF would disappear from the index and the matching landing page would appear in its place in the search results.

Figure 2: Example of a landing page instead of a PDF

Don’t:

  • Block PDFs via Robotos.txt - PDFs are still indexed and the incoming link power is wasted.
  • PDF version of a page - some CMS offer a PDF version of all HTML pages by default. Using a canonical tag would solve the indexing problem, but search engines would still have to crawl the PDFs over and over again, wasting valuable crawler resources.

Identify indexable PDFs

A few clicks are enough to quickly identify indexable PDFs with OnPage.org Zoom. You can do this in the report "Indexability" → "What can be indexed?" Activate the filter "Indexable" (1) and click on the Mime-Type (2) "PDF".

Figure 3: Show only indexable PDFs

If the filters are activated, all PDFs that were found during the crawl are listed in the table below.

Figure 4: List of all indexable PDFs

A list of all PDF documents already contained in the Google index can be obtained with the combination of the search operators filetype: pdf and site: domain.tld:

Figure 5: List of all PDFs that are already in the Google Index

Option 2:
Ensure indexability of index-relevant PDFs

In some cases, providing PDFs for the Google Index can certainly offer added value for users. This is particularly useful when it comes to PDFs that satisfy a specific need for information on the part of the user, but the user does not need to interact with the website itself.

A good example are plans for public transport networks, such as the Munich S-Bahn and U-Bahn network. The aim of the user is to get quick information, download the PDF and save it on the mobile device without interacting with the website.

Figure 6: Example of a landing page suitable PDF in the search engine index

Figure 7: Network plan interior Munich as PDF

In order for a PDF to appear in the search engine index, the most important requirement is Indexability of the document.

Indexability criteria

  • HTTP status code is 200ok
  • Meta robots must not be NoIndex
  • Canonical tag, if present, must not point to any other URL

If at least one of these criteria is not met, the document cannot be indexed.

Non-indexable PDFs can be identified in just a few steps using OnPage.org Zoom. For this you can simply select the document type "PDF" in the report "Indexability" → "What can be indexed". You can then use the graph to display a list of the non-indexable PDFs and the reasons for them. (For example, all PDFs that are marked with the Meta Robots tag "noindex".)

Tip: Indexable PDFs should always contain a link to the corresponding landing page on the website. This gives the user the opportunity to quickly navigate to the website.

Conclusion

Just like HTML pages, PDFs can be listed in the search results. But not all PDF documents are suitable as landing pages at the same time. Therefore, you should carefully consider which role PDFs play in your own SEO strategy and how they bring the maximum benefit. For PDFs that are not suitable for landing pages and have a lot of incoming link power, it is advisable to use the x-robots element in the HTTP header to refer to a corresponding landing page. For index-relevant PDFs, you should ensure that they meet all indexing criteria.