Developing Smartphone Games

 

Andy Sjostrom
businessanyplace

January 2003

Applies to:
   Microsoft® Windows® Powered Smartphone 2002

Summary

Get an overview and some tips on developing games for Microsoft Smartphone 2002 Software. (12 printed pages)

Contents

Games Market Overview
The Development of Games
Games Being Ported to Smartphone
Writing Efficient Games
Memory Access and Bandwidth
Memory Manager
Message System
Resource Manager
Do Not Use Floating Point Math
Do Not Use Division
Use Data Types that Provide Just Enough Resolution to Handle Your Problem Domain
Use Lookup Tables for Everything
Do All Your Math in Assembler
Turn Off Hardware Button Click Sound
Game Screenshots
Conclusion

Games Market Overview

The new Microsoft® Windows® Powered Smartphones arrive with some very attractive games, with more to come in the future. Read on to see why these phones attract serious gamers and what you need to be aware of developing successful games.

The computer games market has come a long way since the first simple character-based games back in the late 1970s. From a financial standpoint, some analysts say the games market in the U.S. alone combined console and PC software sales is expected to generate $18 billion by 2004. The total value of the worldwide sales has already surpassed the value of the entire Hollywood movie industry. From a technical perspective, the games market drives the development of consumer hardware, and in the last few years, even the growth of Internet adoption and connectivity. It certainly is not word processors, spreadsheets, or line-of-business applications that demand stronger 3D graphics chips, mass storage and faster processors.

The value of the new wireless gaming industry is forecasted to amount to $4 billion to $15 billion by 2006, depending on who you ask. Regardless of where the games industry is headed, there are nothing but good times ahead.

The Development of Games

Game development for mobile phones is on a path similar to its PC-based counterpart. The advances in PC gaming include moving from non-graphical to graphical, from single player to multiplayer, and from disconnected to Internet-connected. For years, games found in mobile phones have equaled those non to low-graphical PC games. While some mobile phone makers have started including support for better graphics, even today there are mobile phones sold with built-in games that have almost no graphics at all.

With the Smartphone, the mobile phone industry will be able to take giant leaps similar to those made in the PC games market, albeit much faster; the first games that are expected to be released for the Smartphone are existing Windows and Pocket PC games. Because games developers can use the same development tools, programming languages, and operating system APIs (Application Programming Interfaces), the efforts to port those games to the Smartphone is minimal.

Games Being Ported to Smartphone

The table below shows some of the companies involved in developing and porting games for the Smartphone 2002.

Table 1. Companies involved in developing and porting games for the Smartphone 2002.

Companies Games
Ideaworks3D "Rebound!," plus a number of additional games, in some cases with publishing partners such as Eidos Interactive.
Hexacto "Tennis Addict," "Full Hand Casino" and "Slurp." More games will follow shortly including "Lemonade Inc.," "Baseball Addict" and "Bob The Pipe Fitter." Several of these titles will feature multiplayer gaming capabilities, capitalizing on the wireless abilities of the Smartphone.
Incagold "Slamtilt," a 3D pinball game
Pixel Technologies "MobilePlay™ 1," an online games pack which includes multiplayer Chess, Checkers, Poker, Spades, Backgammon and Reversi, etc.
Xen Games "Interstellar Flames," action game where you fly a fighter to defend the earth
Terra Mobile-iobox "Defender," and additional titles later on with wireless functionality

Game development and the quality of the games that are developed is very dependent on the target platform's capabilities and the available game engines. In addition to the APIs in the Smartphone Software Development Kit, the following game engines are expected to be available:

Follow these links to see the rich capabilities that allow for a compelling gaming experience for Smartphone users.

Writing Efficient Games

We have turned to Sven Myhre, CEO and developer at Amazing Games, to get some inside knowledge about writing efficient games.

One common misconception among handheld users and developers is that a modern ARM processor can be compared to the Pentium processors at similar speed. However, the comparison does not reflect well on the ARM processor. An older Pentium will run circles around all current ARM-based Smartphones and Pocket PCs. This is due both to the processor itself and the support systems around it.

The Pentium was superscalar (it could execute more than one instruction per clock cycle), with five parallel execution units and an integrated floating point unit. Normally you would find both an internal L1 cache as well as a generous external L2 cache in most PCs built around it.

Current ARM based Windows Powered Smartphones and Pocket PCs are scalar at best (they can perform one instruction per cycle). But the instruction set is quite limited and only contain the most basic instructions. More advanced instructions, for instance a division, do not exist and must be emulated in software.

Another issue is the ability to feed the processor with instructions and data to keep it running at full speed. Most designs use a single 16-bit bus for fetching code instructions and data. Since all instructions are 32 bit, the bus should run at twice the speed of the processor just to feed the code instruction pipeline. It does not. It is actually the other way around; the bus is slower than the processor. The processor vs. bus speed factor varies from 2x to 4x. A 66-MHz, 16-bit bus can only feed enough code instructions to power a 33-MHz ARM processor at the maximum, not including the data you want to process. To overcome this problem, most ARM processors include an instruction cache and a data cache. Typical size are 8 Kb each for code instructions and data, although some are as large as 32 Kb. As long as the requested code instruction or data is present in the cache, the CPU can fetch it directly from the cache at full speed and it is not necessary to go through the slow memory bus. But as soon as you start accessing code and data that is not already loaded in the cache, you are effectively switching to back to < 33 MHz (given a 66 MHz, 16 bit bus). Actually, it is quite easy to reduce a 132 MHz ARM processor to a 2 MHz just by organizing your data in a very inefficient way so that the cache starts working against you.

But even that is fast compared to what you can do with wrong data types and inefficient coding. Perform a lot of divisions, or start abusing floating point data types and the CPU will struggle to reach 0.2 million code instructions per second.

Memory Access and Bandwidth

Smartphones might be equipped with a range of ARM-based processors, each with different memory access costs, but you can pretty much be sure that you will have too little cache and too slow a memory bus to allow yourself to ignore this aspect.

For instance, consider an example where your processor is running at 132 MHz and your memory bus is 16-bit and running at 66 MHz. Every time you read a byte that is not already present in the processor's cache, the processor first has to fill an entire cache line. A cache line might be 16 words (a word equals 32 bit or 4 bytes in the realm of ARM architecture), which means a cache line is 16 * 4 = 64 bytes. Since your memory bus is 16 bit, it will be occupied for 32 cycles before it is done, and to make matters even worse, the bus is running at half the speed of your processor, so your processor will stall for 64 cycles before you get the single byte you requested. So make sure the cache line fill is worth the wait. Also, ensure that your memory is packed as tight as possible and check your memory access patterns to see if you can rearrange structures to make things run more efficiently. If you regularly access just one data member within a structure and process a lot of these structures, see if you can move this specific data member into its own array.

For the same reason, use bytes (8 bit) or double-bytes (16 bit) whenever possible, since the ARM processor can expand both unsigned and signed bytes and double-bytes to words during memory to register load. However, when storing signed values from registers to a byte or double-byte memory location, the compiler will generate two extra shift instructions to ensure that the value keeps its sign bit even if the value in your register is to large to fit the designated memory location. On ARMs, one of the shifts might come for free — almost every "regular" instruction can be coupled with a shift instruction — depending on how efficient the compiler is. This is information you should at least be aware of for your inner loops. Every time you use a byte or double-byte variable, be sure you use an unsigned data type whenever you can.

Memory Manager

You probably knew this already, but general memory managers and functions like malloc, realloc and new (usually just a wrapper for malloc) are very slow. You are often better off allocating all the memory you need up front and use your own memory manager. This is the single most important subsystem in your game project. During development, you can easily incorporate consistency checks, test for bad pointers and make sure all your free(s) and delete(s) match your alloc(s) and new(s).

It is also much faster to allocate arrays of equal-sized structures and just use a bitmask (or other means) to handle allocation and deallocation.

Message System

The second most important subsystem would be a rock solid messaging system, not referring to the traditional Windows Message pump system. All interaction between game objects should be done through your own private message system. This includes the options to set a time-of-delivery for messages, so you can send messages to yourself or other objects to be delivered at some time in the future. When the game character picks up a power-up item, you can simply post a "de-activate powerup" message to yourself for delivery in 5 secs, for instance. Since objects are not guarantied to still be alive when the message is delivered, you should never use pointers. Handles (static ids) are the way to go. With such a system, it is easy to add replay functionality, since your central message system can simply save all messages to a file during game play. The replay system could then simply read all the messages back from the save file and dispatch all the messages sequentially. This is also great for debugging — simply add a console or a log file where you can watch all messages in real time — or do a replay of the exact same message stream until you reach the point where your game crashes.

Even object creation and deletion should be done with messages — and the same goes for player input. The only thing missing is knowledge of object ownership, and you can extend the message system to bridge to another computer — the basics for a multiplayer system. Since all interaction goes through the message system, your game logic will not distinguish between messages sent by human player (hardware buttons), AI players, or remote network players.

Resource Manager

To reiterate, never access your objects with pointers. All your resources should be referenced with handles. For efficiency, you might allow for locking/unlocking of such resources within a short code segment, but resources should never be held locked over a period of time (never keep a lock over several frames). By adding a use counter, you can reuse read-only resources and save memory. Also, you do not have to really free resources even though the use counter goes back to zero. By keeping the resources in memory, you can archive almost zero loading time the next time you need the resource. A good practice might be to allocate enough memory to hold all the resources you need simultaneously, or 75 percent of free system memory, whichever is most. This way, you can utilize more powerful devices by using the extra memory to cache resources.

Your resource manager should know how to reload each resource item. Armed with this knowledge, it is able to free memory occupied by resource items and reload them without involving the rest of your game code. Freeing all resource memory would be a natural response to losing focus to other applications, as is reloading them when you get focus again. This should happen automatically, and be totally transparent to the rest of your code.

Do Not Use Floating Point Math

Your ARM processor does not have native support for floating point math. All floating point math runs in a special floating point emulator, and it is very slow. It is not uncommon to see floating point functions that need thousands of cycles to complete. This is why a game project would normally use fixed point formats instead. A fixed point is actually just an integer where you assign an imaginary (but fixed) number of bits as the fraction part of a value. It is like saying that all digits below 1000 are the fractional part of that number. To express the number 0.500, you would simply multiply with 1000 and end up with the number 500. The tricky part is to mentally envision this invisible decimal point at all times. Addition and subtraction works fine: 500 + 500 = 1000 (or mentally: 0.500 + 0.500 = 1.000). Multiplication and division are another matter: 500 * 500 = 250000 (or mentally: 0.500 * 0.500 = 250.000) would be incorrect. After a multiplication of two fixed point values, you need to divide the result. If you divide the result by 1000, you are actually fine (250.000 / 1000 = 0.250 is correct). So to multiply, you just perform the usual multiplication and then divide the result to normalize it.

This brings us to another interesting subject. What about data range in the intermediate result, before you normalize it? In the example above, you might exceed the available number of bits when you perform the multiplication, meaning that you get an overflow and lose the most significant part of the result. The trick is to make sure you use a data format in the intermediate result that can hold the largest possible result. When you multiply two 32-bit values, your intermediate value must be 64 bit. After normalizing (and cropping), the number of bits will again be 32.

   int Multiply16_16_by_16_16( int a16_16, int b16_16 )
   {
     __int64 tmp32_32;
     int result16_16;
     tmp32_32 = a16_16;
     tmp32_32 *= b16_16;
     // result is now 32:32
     tmp32_32 >>= 16; // chop off the lower 16 bits
     result16_16 = ( int ) tmp32_32; // chop off the upper 16bits.
     // result is now back at 16:16
     return result16_16;
}

To divide, you do the opposite; multiply first, then perform the divide.

A common fixed-point format would be 16:16, where the first 16 bits are the integer part, and the lower 16 bit are the fractional part. For my current game project, I use a wide range of different formats, in order to cover different ranges of values used in different parts of the game engine. To sum it up, I use 2:30, 8:24, 16:16, 24:8, 28:4, 2:14, 8:8, 11:5, 2:8 and 4:4. Most of them are 32-bit values, but some are 16 bit, 10 bit and some even just 8 bit.

Do Not Use Division

Your game project should not perform a single division. The ARM processor does not have native support for division. Every time you perform a division, it costs you several thousand cycles. A 132 MHz ARM is theoretically capable of performing 132 million instructions (or 264 million instructions if 50 percent of them are shifts). But you are maxing the CPU out at about 70000 divisions per second. If your game runs at 70 frames per second, that means that you are using the processor to its maximum if you perform just 1,000 divisions for every frame you draw.

Try instead to replace all divisions with shifts and/or multiplications. Division by 16 can be written as a shift-right by 4. More complex divisions might be performed as a combination of shifts and/or multiplications.

Divisions can also be performed with lookup tables. However, with both a 32-bit nominator and denominator, you need a two-dimensional lookup table which far exceeds the available amount of memory. One solution could be to reduce the problem domain.

In the expression a / b, you can insert a multiply by one wherever you want without changing the final result. So a * 1 / b yields the exact same result. In addition, you can rearrange the order of your multiplications and divisions, and write it as a * ( 1 / b ). Now, the division is reduced to 1 / b which only needs a one-dimensional lookup table. You can also reduce the lookup table further, by reducing the precision. Lets say that 16 bits of precision is sufficient in most cases, and you only need 64K entries in your lookup table. By finding the MSB (most significant bit) in b, you can use this info to shift both b and the result up or down to fit your 32-bit value and keep at the highest possible resolution. Even if you include the time it takes to fill a cache line from doing a random access to the division lookup table, you will end up with a worst case of less than 100 cycles—a 20-fold improvement over the standard division provided by the compiler—in exchange for a little loss of precision.

Use Data Types that Provide Just Enough Resolution to Handle Your Problem Domain

Memory access is one of the most important aspects for acheiving high performance code. This means you should try to use as small data types as possible to cover your problem domain. Do you really need 16 bit indexes in your meshes? Is it possible to reduce the number of bits to 8? With a maximum of 256 edges, 256 vertices, 256 light/shading values, 256 polygons, 256 normals and 256 texture coordinates, you might have to split some meshes into multiple smaller meshes. The benefit could be much faster memory access.

How do you use your normal vectors? Are they used primarily for light calculations or visibility testing? I use a 2:8 fixed point format for my normal vector components, meaning I have a full -1.99609375 to +1.99609375 range available, with a 0.00390625 resolution. In other words, I have 8 bits of fractional resolution to cover a full 360 degrees with 1.4 degrees resolution. On a tiny screen like 176x220, no one can tell the difference if a point is lit with a (worst case) +/- 0.7 degrees wrong direction to the light. The benefit is that I can store both x, y, and z components in a singe word.

Use Lookup Tables for Everything

Building lookup tables for items that can be pre-computed will only cost you a single memory access. Compared to complex mathematical functions that might take several-thousand instructions to perform, this ia a pretty good trade, even though you will have to give up a little chunk of memory to store the lookup table.

Typical candidates for lookup tables are inverse division (1 / x), sinus computation, color mixing and lighting. In my current game project, almost the entire environment mapping and lighting/shading pipeline in the renderer is implemented with a couple of pre-computed lookup tables.

The cool thing about games is that we are just trying to convince people that they are part of a living, breathing world. As soon as it looks good and feels right, we have reached our goal, no matter what method we used to make it happen. Good enough is our mantra for optimizing.

Do All Your Math in Assembler

If you check the compiler output in the previous code example about multiplication of fixed point numbers, you will discover that even the optimized release code output is not very efficient:

   stmdb     sp!, {r11, lr}  ; stmfd
   mov       r11, r0
   mov       r2, r1, asr #31
   mov       r3, r0, asr #31
   mul       r2, r11, r2
   mul       r11, r3, r1
   add       r3, r2, r11
   umull     r11, r2, r0, r1
   mul       r1, r0, r1
   add       r0, r3, r2
   mov       r3, r0, lsl #16
   orr       r0, r3, r1, lsr #16
   ldmia     sp!, {r11, pc}  ; ldmfd

Since C and C++ code has no way to express the precise goal of our code, the compiler has to take what we write and translate those statements into solid assembler code.

As programmers, we know exactly what we want to archive and how the microprocessor best can reach our goal, and the result is often much more compact:

smull   r2, r3, r0, r1
mov     r0, r3, lsl #16
orr     r0, r0, r2, lsr #16
mov     pc, lr

Four multiplications are replaced with one. Much moving back and fourth between registers can be eliminated, and since we end up using only the four volatile registers (r0-r3), we do not have to setup and restore a stack frame.

Turn Off Hardware Button Click Sound

You might have noticed that many games on Smartphones suffer from a "jagged" frame rate, but if you turn off the hardware button click sound, the games appear to run much smoother.

For some reason, the device seems to freeze for a fraction of a second as the operating system plays the button click sound every time you push a button.

Fortunately, almost every aspect of the user interface can be configured with XML. By setting up a small configuration script, you can tell the Configuration Manager to change almost anything you want.

<wap-provisioningdoc> 
<characteristic type="Sounds"> 
<characteristic type="ControlPanel\Sounds\KeyPress"> 
<parm name="Mode" value="1"/> <!-- 0=none, 1=tone, 2=click --> 
</characteristic> 
</characteristic> 
</wap-provisioningdoc>

Use the DMProcessConfigXML() function to push the above XML configuration data through the Configuration Manager.

Remember that you might lose focus during execution—so be sure to restore the original configuration before giving up control to another application. And since the user might have changed the settings while our application was inactive, read back to settings as we get focus back.

Game Screenshots

Here are some nice screenshots from the Smartphone game development labs. Figure 1 contains some screenshots from Hexacto.

Click here to see larger image

Figure 1. Hexacto games in development.

Figure 2 contains screenshots from Ideaworks3D's Rebound!.

Click here to see larger image

Figure 2. Ideaworks3D games in development.

Figure 3 contains screenshots from games using Fathammer's X-Forge 3D Game Engine. The screenshots are from both Pocket PC and Smartphone versions of the same game.

Figure 3. Fathammer's X-Forge 3D Game Engine games in development

Conclusion

The Smartphone is the first mobile phone with enough processing and graphics power, as well as connectivity, to bring the rich gaming experience we expect on a PC platform to the world of mobile phones. Game on!