We are experiencing temporal DNS problems with MDS and API
Incident Report for DataCite
Postmortem

On Sunday we upgraded our infrastructure with a load balancer (Amazon Classic Elastic Load Balancer) to better handle service upgrades without interruptions, and to make it easier to manage HTTPS/SSL connections.

After the system upgrade, we noticed one unexpected issue: the homepage at https://www.datacite.org wasn't properly resolving and led to an error page. There was also an issue with the board page. All other pages on the homepage were working as expected..

After we couldn't resolve this issue until Monday evening CET, we decided to update our internal DNS (domain name services) that is partly responsible for routing the traffic from the load balancer to the homepage pages, stored in Amazon S3. Unfortunately this lead to multiple service interruptions, including the MDS from the early hours Tuesday morning CET until Tuesday afternoon. The problem was exaggerated by the delayed nature of DNS updates, even though we had set the default 24 hours for DNS updates (ttl) to one hour.

Further investigation on Tuesday afternoon resolved the homepage issue: it was caused by mixed content (HTTPS/HTTP) on some pages, including the homepage landing page, caused by an image loaded over an insecure connection. The Chrome browser in particular has become more stringent about mixed content, leading to the error we first saw on Sunday. As part of this work all DataCite pages are now HTTPS only, with the only exception of the Schema, where requiring HTTPS could break XML validation.

We deeply apologize for the inconvenience that this service outage has caused you, and we have started work to prevent similar situations from happening again: * We have started to clean up our internal DNS, which was too complex (using a mix of public DNS and two private DNS zones), and configured in too many places. * As part of this work we have started to update our reverse proxy that is used to connect the load balancer to the respective DataCite service - using an API and Web UI for configuration instead of many configuration files. This process should be completed in the coming weeks. The MDS and blog are already using this updated service. * We have started to better separate out the various components of the search service (Frontend, Solr index, Sitemaps index) that are all running behind the search.datacite.org domain name.

Please let us know at support@datacite.org when you experience further issues related to this outage.

Posted Mar 08, 2017 - 11:24 UTC

Resolved
We have resolved the DNS issues that were affecting several services. We will continue monitoring our services for any issues.
Posted Mar 08, 2017 - 10:50 UTC
Update
The DNS issues with the MDS are resolved. We are still experiencing issues with the REST API for works, which also affects Search. The homepage at https://www.datacite.org is still not displaying correctly, the rest of the site works as expected. We will give a detailed report once the remaining issues have been resolved, hopefully later today.
Posted Mar 07, 2017 - 14:15 UTC
Identified
We are still experiencing issues with our internal DNS, affecting multiple services. We are working on resolving this.
Posted Mar 07, 2017 - 10:33 UTC
Monitoring
The DNS problems with MDS and API should have resolved. Please wait for another 12 hours if that is not the case. There are still some minor issues with reaching a few pages on the homepage, in particular https://www.datacite.org.
Posted Mar 05, 2017 - 11:15 UTC
Identified
After an upgrade of two of our load balancers we are experiencing temporal problems with domain name resolution (DNS) . This should resolve within the next 24 hours and currently affects MDS and API.
Posted Mar 05, 2017 - 00:31 UTC