My main use case for
Apache SkyWalking includes not only monitoring microservices and APIs but also managing the entire health of the application. I will explain the domains and backgrounds where we currently use it. It does more than check microservices or heavy queries. It can be integrated from the IT point of view, where your IT team can easily integrate it on the DevOps side or where applications are being deployed. It is widely used for managing the entire health of the application, checking the current status and health of the application, and how your services are currently running. When I mention services, this essentially means your queries, including database queries or backend logic that has been written to perform either up-syncing or down-syncing of data into the database or retrieving something, updating or inserting queries. Apart from that, it can be used for checking how your APIs are working, such as out of multiple API calls, how many are succeeding versus failing. If there is a particular timeout, I can see the frequency of recurrence and the time duration of the timeout or how long the network is unreachable.
Even in payment applications where we have multiple applications, some related to payment processing may be failing or insert queries may not be working. There could be multiple layers on the backend side of the architecture design, and during issue resolution, it is very difficult to analyze where the actual pain point area is. Apache SkyWalking really helps identify that a particular API call failed on the payment side, perhaps deep down three layers in the architecture design, so you can see that it is failing because of a specific reason, such as network timeout, unreachable network, the bank server being down, or a third party payment server integration not responding due to heavy traffic load.
In some other domains, beyond checking health, if your applications or servers are running on pods or Kubernetes containers, you can check the health of your pods as well. We have moved from outsystems to Mendix and other Java hosted applications and .NET, which all utilize Kubernetes and nodes. You can easily check which node is working fine and which is not in good state, how much traffic is currently passing through those containers or nodes, how they are integrated, and which one is responding fine versus not responding well enough. These are many areas where you can easily identify issues with the help of Apache SkyWalking. Because of its open use case platform, it helps from the licensing point of view and covers a wide area of use cases.
In terms of projects, I would like to share a couple of examples. One of our patient services applications was facing issues with API failures. It was initially identified that this might be because of Java database upgradation, the fact service getting down, or perhaps a global outage of some database server, so the entire API services was getting affected. Then some fact line services started getting impacted, and because of that, a few of our Mule APIs were not working fine. Since the project had the dependency of cross-functional team members, each team was trying to identify where the actual cause was lying. At a high level, we thought that the Java API might not be connecting properly with the fact API or the Mule API internally calling the fact API, which was not getting reached properly. Someone was trying to reach out to the Mendix team to see if they could figure out and find the logs, and it could be the .NET or other applications depending on what kind of application the team was currently working on. With the help of Apache SkyWalking, you can definitely have this in place and easily identify that for this particular time duration, this was the API call that went off and this was the feature that got stopped, and these are the documents that did not reach properly. You can easily identify the area and reach out to that team, stating that you need to check out these particular APIs, and you can reach out to the support team or the vendor if needed so that on the particular SLA, those can be taken care of on priority.
Apart from that, there is one more use case I would like to share regarding one of the applications on the local platform we built. Apache SkyWalking can be integrated there also because most of the time when a lot of traffic is coming for a particular second, there is sometimes a huge spike on Grafana or the logs and it is very hard to see that for a particular instant this much huge traffic is coming while your CPU or memory is quite low. You need to increase your space, but the logging is not able to maintain properly or pods are getting crashed and new pods are getting recreated. It is very hard to identify the logs to understand what is happening. Even in that area, you can easily integrate Apache SkyWalking and easily identify your Kubernetes containers and node health.
Apache SkyWalking offers the best features for integrating into the IT department to check microservices, the entire end-to-end health of the application, the node, Kubernetes, which queries are running fine, and which queries are running slow. From the SLA side, most queries should get completed within 200 milliseconds. If it is taking longer than that expected time, someone has to take the initiative to see where the room for improvement is.
I have been using Apache SkyWalking while encountering a couple of scenarios in the IT department along with a couple of projects we were working on. That is where I was doing some self-exploration to see how we can try to get through the bottlenecks of the root causes and how we can easily identify what the RCA is, why lots of microservices and APIs are getting failed, and what the bottleneck is. Because that project had a dependency of cross-team members, that is where I got to know about Apache SkyWalking and explored it. It is a really wonderful tool to go ahead with the IT team.
Apache SkyWalking helps me visualize data and performance by easily visualizing how the entire ecosystem is currently working. For example, if we have lots of Kubernetes containers in place and nodes being interconnected to multiple projects or products inside the organization, manually it is very hard to check out and take the export of the health of the containers and see how the traffic is going through which container is fine or bearing a lot of load and how we can shift it. Manually, it is going to take a lot of time. Visualizing it with the help of Apache SkyWalking is going to be a game-changer in such a way, reducing your time on that. You can easily visualize how the entire ecosystem is currently working. You can see where the current health is pretty much good and where the health of the system is degrading so that concern can be put into that sector as soon as possible.
Apache SkyWalking has positively impacted my organization by reducing the time of the team so that they can put in more efforts into their other tasks, saving a lot of time, improving our SLA in resolving any issue, providing good RCA analysis to the leadership team, and helping us in monitoring the entire health in a shorter time span.
Apache SkyWalking can be improved by enhancing a few things. The learning curve is definitely there, so it needs a good learning curve. Your engineers or experts need to be pretty much handy and sound on the technical side to use this platform. If it comes to customization, that is where it needs a deep understanding. The normal configuration you can easily do, but on the heavy customization side, it needs a good learning curve.
Secondly, on storage management, because you are doing the entire health checking and continuously monitoring the entire ecosystem, it occupies a lot of space. You need to either purge that storage and have something that can be recycled or put in some additional space in the archive section, and you can easily retrieve that after a certain period of time. If a million of records or traces are there, it becomes a heavy task. Storage management is something where it can be improved or explored much more to provide much more ease and convenience to the users who are opting for it.
Thirdly, some UI modifications can also be done to make it much more beautified. I would add that Apache SkyWalking should improve the storage complexity and how we can easily manage that, and some customization and heavy customization learning curve can be reduced. Making the UI much more convenient and much more beautified for the end user would be beneficial.
Apache SkyWalking performs well in terms of reliability and uptime. The only issue is with managing its storage complexity, which is the major one. Apart from that, it is pretty good.
Apache SkyWalking integrates with other tools or platforms in my environment easily if you have a good learning curve on that and are able to understand the technicalities. The DevOps team will be able to do that easily.
Apache SkyWalking handles security and compliance requirements in my environment effectively, and I have found it helpful in identifying where the pain point area is while helping us in the RCA.
My advice for others looking into using Apache SkyWalking is that if someone is really interested in identifying the entire health of the application and how the ecosystem is currently working without giving load to the developers to write down any particular code to check health statuses, they can definitely go with Apache SkyWalking. That helps in identifying the entire thing for their application ecosystem, Kubernetes, cloud services, nodes, and monitoring their microservices call, the API calls, and current functioning. Deep down inside the architecture planning, it helps in identifying and monitoring and managing the entire application, helping you to reduce your SLA in solving out your issues, any kind of hot potato incidents or hot fixes or any impediments which the team is getting blocked with. That is where it helps a lot in identifying the area of improvement and also sharing the reports with the higher leadership team members. I would rate this solution an 8.5 out of 10.