On Sunday we upgraded our infrastructure with a load balancer (Amazon Classic Elastic Load Balancer) to better handle service upgrades without interruptions, and to make it easier to manage HTTPS/SSL connections.
After the system upgrade, we noticed one unexpected issue: the homepage at https://www.datacite.org wasn't properly resolving and led to an error page. There was also an issue with the board page. All other pages on the homepage were working as expected..
After we couldn't resolve this issue until Monday evening CET, we decided to update our internal DNS (domain name services) that is partly responsible for routing the traffic from the load balancer to the homepage pages, stored in Amazon S3. Unfortunately this lead to multiple service interruptions, including the MDS from the early hours Tuesday morning CET until Tuesday afternoon. The problem was exaggerated by the delayed nature of DNS updates, even though we had set the default 24 hours for DNS updates (ttl) to one hour.
Further investigation on Tuesday afternoon resolved the homepage issue: it was caused by mixed content (HTTPS/HTTP) on some pages, including the homepage landing page, caused by an image loaded over an insecure connection. The Chrome browser in particular has become more stringent about mixed content, leading to the error we first saw on Sunday. As part of this work all DataCite pages are now HTTPS only, with the only exception of the Schema, where requiring HTTPS could break XML validation.
We deeply apologize for the inconvenience that this service outage has caused you, and we have started work to prevent similar situations from happening again: * We have started to clean up our internal DNS, which was too complex (using a mix of public DNS and two private DNS zones), and configured in too many places. * As part of this work we have started to update our reverse proxy that is used to connect the load balancer to the respective DataCite service - using an API and Web UI for configuration instead of many configuration files. This process should be completed in the coming weeks. The MDS and blog are already using this updated service. * We have started to better separate out the various components of the search service (Frontend, Solr index, Sitemaps index) that are all running behind the search.datacite.org domain name.
Please let us know at firstname.lastname@example.org when you experience further issues related to this outage.