Protection from robots

Search queries can be submitted not only by users, but by robots, as well. When there is a flood of queries from robots, you may exceed the limitations applied for usage of Yandex.XML.

To prevent unauthorized access to the search by robots, a security algorithm is used. If it is suspected that a query was submitted by a robot, a CAPTCHA is returned instead of search results (see this Wikipedia article about CAPTCHA).

To use the algorithm for protection from robots, the partner must pass information about the IP address and the "spravka" cookie for the request's author. The "spravka" cookie is generated on the Yandex.XML side and is returned the first time the user accesses search results. In the value that is received, the partner must replace the domain with his own, and then add the following string to the search response:

Set-Cookie: spravka=...

Information about the IP address and the "spravka" cookie are passed in the request header in the format:

X-Real-Ip: 99.999.999.99
Cookie: spravka=<value passed from Yandex>

The diagram below illustrates the steps performed for protection from robots.

  1. The user sends a query to the Yandex.XML partner.

  2. The search query is sent to the Yandex.XML service. The request must match the specified format.

  3. Yandex.XML initiates the algorithm for protection from robots. The values of the IP address and "spravka" cookie (if previously issued) are used for verification.

    Possible results of verification:

    • The request was probably not sent by a robot. The process continues to step 13.

    • The request was probably sent by a robot. The decision is made to display a CAPTCHA.

  4. Yandex.XML returns the partner an XML file in the following format:

    <?xml version="1.0" encoding="utf-8"?>
    <yandexsearch version="1.0">
    <response>
    <error code="100">Robot request</error>
    </response>
    <captcha-img-url>http://captcha.image.gif</captcha-img-url>
    <captcha-key>CAPTCHA ID</captcha-key>
    <captcha-status>Status</captcha-status>
    </yandexsearch>
    
  5. The user is returned a page containing a CAPTCHA.

  6. The user sends the CAPTCHA value to the partner.

  7. The partner sends the CAPTCHA value obtained from the user via a GET request in the following format:

    https://yandex.ru/xcheckcaptcha?key=<CAPTCHA number>&rep=<CAPTCHA value entered by user>
    
  8. The value received is checked by the Yandex.XML service. If the CAPTCHA value was entered incorrectly, the process continues to step 4. In addition, the captcha-status parameter is passed with the value “failed”.

  9. If the CAPTCHA value was entered correctly, Yandex.XML issues the user a "spravka" cookie and passes it to the partner in the header with the following format:

    HTTP/1.1 200 OK
    Set-Cookie: spravka=<cookie value`>`
    

    If the request passed to Yandex.XML in step 1 was saved successfully, the process continues to step 12.

  10. The partner lets the user enter a query.

  11. The user sends a query to the Yandex.XML partner.

  12. The search query is sent to the Yandex.XML service. Along with the request, the user's IP address and "spravka" cookie are passed.

  13. Yandex.XML processes the search query and generates results.

  14. An XML file with search results is returned to the partner.

  15. The partner returns the processed response to the user. If Yandex.XML issued a "spravka" cookie in step 9, it is saved on the user's computer.

Tip

To try out how this flow works, use this script.

Verifying correct CAPTCHA display

To get familiar with the response format returned by Yandex.XML when a CAPTCHA is displayed, send a request (the value of the query parameter of the search request) with the following string: “e48a2b93de1740f48f6de0d45dc4192a”.

The following GET request can be used by the user “xml-search-user” for reviewing the response format returned when a CAPTCHA is displayed:

wget -q --header="X-Real-Ip: 127.0.0.1" -SO- 'https://yandex.ru/search/xml?user=xml-search-user&key=03.44583456:c876e1b098gh65khg834ggg1jk4ll9j8&query=e48a2b93de1740f48f6de0d45dc4192a&showmecaptcha=yes'