I am a product developer who develops certain products in the insurance domain. We mostly use Splunk Cloud Platform for checking the logs. Whenever a check goes missing or a status is not correct, we generally check the logs first. Splunk Cloud Platform helps us to identify where the error is. We can search with various factors, and giving a proper prompt is important as it saves us a lot of time.
Recently, one of our branch networks where all the checks get stored had an issue. They had done IP whitelisting, and some of the IP addresses were not included in that IP whitelisting. This caused a global outage and all the claims or checks that were getting processed failed. When we tried to check through the logs, we found out that this issue was the cause. We had to reach out to another team that manages the environment which caused this IP whitelisting, the middleware. When we contacted them, they reverted most of the changes and we generated new payloads. Splunk Cloud Platform helped us in finding out the errors. Without knowing which error was affecting us, searching through Splunk revealed that the IP whitelisting was done.
Generally, in our scrum calls which start on our daily call, we go through our incidents and ServiceNow, and if we find anything stuck or any mismatch that has happened, the first thing we do is check the logs directly in the call. This allows the team to have a proper understanding of what is happening. At the start, if you are a fresher, it is not beginner-friendly because it is difficult to understand. However, over time, this would be the best tool that we will ever use.
I believe Splunk Cloud Platform's ability to show right from a payload is one of its best features. When a payload is generated, each log indicates what the user has done, including certain actions. We will know what the user has done. In case the person has missed a certain logic or we find an exception, we are currently finding an illegal state change exception where if the user is not following the check lifecycle. Our check lifecycle is from awaiting submission, requesting, requested, issued, and then cleared. If the user does not follow this lifecycle, for example if the user is trying to move the check from awaiting submission directly to issued instead of going from requesting to requested and issued, it will throw this exception. We will know about it in the logs itself. Splunk Cloud Platform helps us to check the logs and identify any possible errors that the user might have done, or any possible bad job or job failure that has occurred. Initially, to find anything for any troubleshooting, we go through the logs itself. That is the feature that stands out for me.
We have a customized prompt where, initially when you go to Splunk prod, we can search with a particular primary key. In my case, it would be a public ID or a claim number or a check number, anything. When we search with it, we can go right from the payload where we can see the operations and more. We tend to create a customized dashboard as well, so that any alerts that pop up will get displayed right there, so that any of the team members can pick up and solve that issue. We occasionally do manual searches also, but in lower environments. Splunk Cloud Platform does support our INT environment and DEV environment. In case we are trying to recreate some kind of scenario in DEV or INT, we could check the logs and see where the issue is recreating.