About the Team:
The mission of our Digital Operations team is to operate a fault resilient, customer-centered, proactive DevOps team. The team is responsible for supporting systems that deliver AT&T’s customer experience, across multiple internet-facing eCommerce applications, databases, platforms and technology stacks. Our customer-journey centric Ops team is made up of Ops Engineers as well as Site Reliability Engineers (SREs) who are all focused on ensuring a highly available, resilient, performant and secure customer experience.
Job Summary:
Our Digital Operations team is looking for a Site Reliability Engineer (SRE) who is passionate about the customer experience and has analytical & multi-tasking abilities to thrive in a fast-paced environment. The SRE is responsible for ensuring that, as new features and applications are introduced to production, essential aspects for reliability such as availability, resiliency, latency, efficiency, change management, monitoring, emergency response, and capacity planning are conducted alongside development of the new features/applications. The SRE will develop automation code & scripts to proactively address customer issues, reduce mean time to repair and improve application availability. The position also includes collaborating closely with feature delivery teams as a bridge between development and operations by applying a software engineering mindset to system administration. This position will split time between operations/on-call duties and guiding the development of systems and software that help increase site reliability and performance to deliver business value. The SRE will need intimate knowledge of the current state of datacenter and cloud infrastructure, CI/CD pipeline tools, Kubernetes, Site Reliability Engineering practices, and ability to implements the plan for desired future state. Attention to detail and strong analytical skills are required, along with a “Customer-First” attitude!
Responsibilities and Day-to-Day View
Build software to help operations and support teams – Proactively build and implement services to make operations more effective and reduce toil. This includes adjustments to monitoring and alerting to automating scripts and code in production. Candidate can be tasked with building a homegrown tool from scratch to help with issues in software delivery or resolving impacts from outages/incident.
Fix support escalation issues: Optimize on-call rotations and processes – Improve system reliability through the optimization of on-call processes. Add automation and context to alerts – leading to better real-time collaborative response from on-call responders. Additionally, update runbooks, tools and documentation to help prepare on-call teams for future incidents.
Document “tribal” knowledge – Gain exposure to systems in both staging and production, and take part in work with software development, support, IT operations and on-call duties – to build up historical knowledge over time. Instead of silo-ing this knowledge, ensure constant upkeep of documentation and runbooks to ensure that teams get the information they need right when they need it.
Conducting post-incident reviews – Thorough and transparent post-incident reviews to keep teams honest and ensure that everyone is conducting post-incident reviews, documenting their findings and taking action on their learnings. Take action items for building or optimizing parts of the SDLC or incident lifecycle to bolster reliability of the service.
Develop automation for mission critical applications using scripts, programs
Provide customer impact analysis and troubleshoot complex issues using domain knowledge of AT&T Sales & Ordering flows, applications, and downstream interfaces
Support APIs in K8s environment
Contribute to design and implementation of new system layers utilizing principles of high-complexity compute environments.
Provide on-call support for Production customer facing issues
Work with developers, environment teams to identify necessary resources and remove constraints to increase application availability.
Roles and Responsibilities:
24 x 7 Production support and second level trouble shooting of incidents for mission critical high-performance applications
24 x 7 second level outage response for mission critical high-performance applications
24 x 7 Application performance monitoring, troubleshooting and corrective actions for mission critical high-performance applications
Shift timing (if any):
Primary / Mandatory skills:
Overall Experience: -7+ experience performing Production Support for Mission Critical, high performance applications
4+ years of experience using Docker, Kubernetes and Cloud environments
Strong experience in Unix, Networking and troubleshooting knowledge, Docker, Kubernetes and Cloud environments
Experience in Java, Python, Shell Scripts
Experience in building and leveraging automated CI and CD pipelines using technologies such as Azure DevOps Server, Jenkins, Maven, Ansible, Chef, SonarQube, Puppet, etc
Experience in Relational & NoSQL databases like Oracle & Cassandra. Excellent knowledge of SQL: Excellent written and verbal English communication skills to work in a Global team
Secondary / Desired skills:
Agile, Lean Agile and/or Scaled Agile methodologies
Knowledge of Java, ReactJS, Spring & Spring Boot framework, microservices & RESTful API architecture
Familiarity with version control systems (Git, Bitbucket) and modern version control for use in continuous deployments
Experience with visualization tools like Kibana and Grafana (EFK stack experience preferred)
Additional information (if any): Willing to work in Shift Duties, Willingness to learn is very important as AT&T offers excellent environment to learn Digital Transformation skills such as cloud, Big data, AI, Full stack etc.
Education Qualification: Bachelor’s/ Master’s degree in computer science or related field
Certifications (if any specific): Any Certification related to Primary / Mandatory Skills
• Kubernetes Certified Engineer or equivalent certification
• Azure / AWS certification
Experience:
7+ years of experience performing Production Support for Mission Critical, high performance applications (Telecom and eCommerce experience preferred)
4+ years of experience using Docker, Kubernetes, and Cloud environments
Solid understand and experience in Application Performance Monitoring tools like Dynatrace, AppDynamics, Introscope, etc.
4+ years of strong Unix, Networking and troubleshooting knowledge
4+ years of experience in Customer Experience Analytics tool like Quantum Metric or TeaLeaf
Solid understand and experience in Application Performance Monitoring tools like Dynatrace, AppDynamics, Introscope, etc.
4+ years of experience in Relational & NoSQL databases like Oracle & Cassandra. Excellent knowledge of SQL.
4+ years of experience J2EE applications and an application server like WebLogic, WebSphere or JBoss
2+ years of experience in Java, Python, Shell scripting
Experience with visualization tools like Kibana and Grafana (EFK stack experience preferred)
Experience mentoring & training others
Experience with Site Reliability Engineering preferred
Experience working in a large scale technically diverse organization
Experience with web-based applications, http, https, SSL/TLS
Should have strong understanding of security principles
AT&T is leading the way to the future – for customers, businesses, and the industry. We’re developing new technologies to make it easier for our customers to stay connected to their world. Together, we’ve built a premier integrated communications and entertainment company and an amazing place to work and grow. Team up with industry innovators every time you walk into work, creating the world you always imagined. Ready to #transformdigital with us? Apply now!
AT&T will consider for employment qualified applicants in a manner consistent with the requirements of federal, state and local laws
We expect employees to be honest, trustworthy, and operate with integrity. Discrimination and all unlawful harassment (including sexual harassment) in employment is not tolerated. We encourage success based on our individual merits and abilities without regard to race, color, religion, national origin, gender, sexual orientation, gender identity, age, disability, marital status, citizenship status, military status, protected veteran status or employment status
br{display:none;}.css-58vpdc ul > li{margin-left:0;}.css-58vpdc li{padding:0;}]]> Desired Candidate Profile – For more details or instant reply, Just-send ‘Hi’ through-WhatsApp on this...
Apply For This JobManager – Salesforce Initiatives and Business Analysis India, Mumbai WONDERING WHAT’S WITHIN DANAHER? TAKE A CLOSER LOOK. At first glance,...
Apply For This JobRelocation Assistance Offered Within Country Job Number #164400 – Mumbai, Maharashtra, India Who We Are Colgate-Palmolive Company is a global...
Apply For This JobJob Description We are looking for an SEO/SEM expert to manage all search engine optimization and marketing activities.You will be...
Apply For This JobFull Job Description About usLittle Heaven English SchoolPandeypur VaranasiCo-Educational English Mediam School Based on CBSE Pattern.Nursery to Class 8th. Serving...
Apply For This JobOverseas Jobs & Internship arrangement of all countries. Especially doing Singapore, Australia & GCC countries Job Location: Latvia Job Description: ...
Apply For This Job