Position: Operations Manager
Where: Seattle, WA
Amazon Web Services is a dynamic and rapidly growing business within Amazon.com. We are building some of the largest and most complex distributed systems in the world, and we need world class people to help us implement and operate them.
We provide organizations with building block web services that allow them to innovate faster and operate their software more cost-effectively. These services-in-the-cloud include on-demand compute capacity, storage, content delivery, querying of structured data, message queuing, and more.
The AWS Identity & Access team is building and delivering the next generation of cloud computing security that supports the public AWS offerings like S3, EC2, and CloudFront. We are innovating new ways of building massively scalable distributed security systems involving identity management, federation, web services security, single sign-on, and much more. We enable our customers to control some of the most sensitive secrets on the Internet.
We have high standards for our computer systems as well as our employees: our systems are highly secure, highly reliable, highly available, and must function at massive scale; our employees are super smart, driven to serve customers, and fun to work with. The successful Manager does more than manage infrastructure, tools, and people. They will be the focal point for owning end-to-end customer experience and will be channeling the requirements to development teams to improve the service performance, scalability and reliability. They will also be in charge of forecasting and managing the end-to-end capacity as well as operating costs of large-scale services inside of AWS Directory Service.
You should have or be most of the following:
- Experience running and maintaining a 24×7 Internet-oriented production environment, preferably across multiple data centers, involving (preferably) hundreds of machines
- Demonstrable expertise around specifying, designing, and/or implementing system health, performance monitoring tools, and software management tools for 24×7 environments
- Hire talent that brings experience well ahead of the size of the current business to have the right people thinking at the right scale given high growth rates for the service.
- Ensure fleet and capacity management processes are manageable and scalable given growth rates, driving towards fully “lights out” management with appropriate dead man switches in place.
- A solid grasp of networking fundamentals, preferably including hands-on experience with load balancers, switches, routers, etc.
- Audit existing system metrics and alarms and address any gaps. In the event of an outage or degradation in service, ensure appropriate changes are put in place to prevent recurrence.
- Establish and maintain on-call procedures and knowledge base.
- Create a strategic roadmap including an evaluation of options for remote sites to provide better 24/7 support.
- Enhance analysis tools and reporting for key system metrics such as availability, latency, utilization and durability.
You will be expected to deliver on these kinds of things in the first six to twelve months on the job:
- Define and/or refine hardware requirements and selected designs, balancing raw up-front dollar cost with operability and TCO, from the data center infrastructure up specify and participate in the development and delivery of operability-related features such as system health monitoring, diagnostics, repair, and other self-healing automation
- Develop or further existing application and system management tools and processes that reduce manual efforts and increase overall efficiency
- Adapt and improve operations management systems and processes to accommodate rapid and increasing growth in systems and traffic
- Maintain fleet inventory management, including producing, maintaining, and evolving capacity plans for various components
- Monitor the health of the fleet, automating system health, maintenance tasks, and reporting systems as needed
- Perform various system maintenance tasks (your hands get dirty here), including configuration of new machines
- Manage directly assigned tasks and on-call duties gracefully
- BS Computer Science or other technical degree and related experience
- 5+ year experience running and maintaining a 24×7 production environment
- 10+ years technical experience with at least 5 years of management experience
- 4+ Experience with support procedures and methodologies for production computing environments
- 4+ Experience with service-oriented architecture and web services
- Experience with very large, high-throughput distributed systems
- Experience with some aspect(s) of computer security: network security, application security, security protocols, cryptography, etc.
- Excellent troubleshooting skills
- Excellent documentation skills
- Familiar with the challenges surrounding efficient operations and failure mode analysis in large complex distributed systems
- Able to show good judgment and instincts in decision making
- Able to prioritize and perform in complex, fast-paced situations
- Experience with service-oriented architecture and web services
- Experience with agile software development practices