Cruz Izu's List of Publications

Journal List

Recent Conferences


Paper by topics:
<-------------- BUBBLE ROUTERS --------------->

On the Design of a High-Performance Adaptive Router for CC-NUMA Multiprocessors

V.Puente, J.A. Gregorio, R. Beivide. and C. Izu,

This work presents the design and evaluation of an adaptive packet router aimed at supporting CC-NUMA traffic. We explot a simple and efficient packet injection mechanism to avoid deadlock,which leads to a fully adaptive routing by employing only three virtual channels. In addition, we selectively use output buffers for implementing the most utilized virtual paths in order to reduce head-of-line blocking. The careful implementation of these features has resulted in a good trade-off between network performance and hardware cost. The outcome of this research is a High-Performance Adaptive Router (HPAR) which adequately balances the needs of parallel applications: minimal network latency at low loads and high throughput at heavy loads. The paper includes an evaluation process in which HPAR is compared with other adaptive routers using FIFO input buffering, with or without additional virtual channels to reduce head-of-line blocking. The evaluation contemplates both the VLSI costs of each router and their performance under synthetic and real application workloads. To make the comparison fair, all the routers use the same efficient deadlock avoidance mechanism. In all the experiments, HPAR exhibited the best response among all the routers tested. The throughput gains ranged from 10% to 40% in respect to its most direct rival which employs more hardware resources. Other results shown that HPAR achieves up to 83% of its theoretical maximum throughput under random traffic and up to 70% when running real applications. Moreover, the observed packet latencies were comparable to those exhibited by simpler routers. Therefore, HPAR can be considered as a suitable candidate to implement packet interchange in next generations of CC-NUMA multiprocessors.

in IEEE Trans. on Parallel and Distributed Systems, Vol. 14, NO. 5, May 2003

The paper is available as PDF 3.6 M

The Adaptative Bubble Router

V. Puente, C. Izu, R. Beivide, J.A. Gregorio, F. Vallejo and J.M. Prellezo

This paper presents The design of a new adaptive cut-through router for torus networks is presented in this paper. With much lower VLSI costs than adaptive wormhole routers, the adaptive Bubble router is even faster than deterministic wormhole routers based on virtual channels. This has been achieved by combining a low-cost deadlock avoidance mechanism for cut-through networks called Bubble flow control with an adequate selection of router's arbiter. A thorough methodology has been employed to quantify the impact that this router design has at all levels, from its hardware cost to the system performance when running parallel applications. At the VLSI level, our proposal is the adaptive router with smallest clock cycle and node delay when compared with other state-of-the-art alternatives. This translates into the lowest latency and highest throughput under standard synthetic loads. At system level, these gains reduce the execution time of the benchmarks considered. Compared with current adaptive wormhole router, the execution time is reduced up to 27%. Furthermore, this is the only router that improves system performance when compared with the simpler static designs.

In Journal of Parallel and Distributed Computing. Vol 61 - n. 9, September 2001 pp. 1180-1208

The paper is available as PDF 744 K

Improving Parallel System Performance by Changing the Arrangement of the Networks Links

V. Puente, C. Izu, J.A. Gregorio, R. Beivide, J.M. Prellezo and F. Vallejo

The Midimiew network is an excellent contender for implementing the communication subsystem of a multiprocessor computer. This network is an optimal 2D topology in the sense there are no other symmetric direct networks of degree 4 with a lower average distance or diameter. In fact, it reduces the diamter of the well known torus by the square root of 2. Although the topology was proposed and analyzed a decade ago the lack of simple deadlock avoidance mechanism prevented its utilization up to date. This study solved this drawback by applying the Bubble switching mechanism, a low cost deadlock avoidance strategy developed by the authors. Moreover, by using routing tables we can configure our Virtual Cut-through adaptive router to implement either a torus or a Midimew network. Thus, we can exploit the topological advantages of Midimew networks by simply changing the disposition of the wrap-around connections of its torus counterpart, without increasing the network implementation cost. To prove this assertion we have carried out a thorough evaluation from the hardware cost of the router to the parallel systyem performance under real loads.

In Proceedings of the 2000 International Conference on Supercomputing. Santa Fe, N ew Mexico. May 2000 pp. 44-53

The paper is available as PDF 228 K

Adaptive Bubble Router: a Design to Improve Performance in Torus Networks

V. Puente, J.A. Gregorio, J. M. Prellezo, R.Beivide,J. Duato, and C. Izu

A router design for torus networks that significantly reduces message latency over traditional wormhole routers is presented in this paper. This new router implements virtual cut-through switching and fully-adaptive minimal routing. Packet deadlock is avoided by providing escape ways governed by Bubble flow control, a mechanism that guarantees enough free buffer space in the network to allow continuous packet movement. Both deterministic and adaptive Bubble routers have been designed in VLSI using VHDL synthesis tools. Adopting a fair quantitative comparison, we demonstrate that Bubble routers exhibit a reduction in base latency values over 40% with respect to the corresponding wormhole routers, without any penalty on network throughput. With much lower VLSI costs than adaptive wormhole routers, the adaptive Bubble router is even faster than deterministic wormhole routers based on virtual channels.

In Proceedings of the 1999 International Conference on Parallel processing. Aizu-Wakamatsu City, Japan. September 1999

The paper is available as PDF 348 K

Impact of the Head-of-Line Blocking on Parallel Computer Networks: Hardware to Applications

V. Puente, J.A. Gregorio, C. Izu and R. Beivide

A fully adaptive router with hybrid buffer at the input and output channels was designed, which improves the throughput of its input buffer counterpart by up to 40% and has only 10% higher base latency. An in-depth analysis of different router buffer organization was carried out for a toroidal network which uses either a deterministic (DOR) or a fully adaptive routing scheme. Each proposal is described in VHDL and evaluated with the Synopsys synthesis tool. Technological restrictions obtained were used to evaluate network performance under both synthetic loads and real applications.

In Proceedings of the EURO-PAR'99 Parallel Processing. Toulouse, France. August 1999 pp. 1222-1230

The paper is available as PDF 364 K

Low-level Router Design and its Impact on Supercomputer System Performance

V. Puente, J.A. Gregorio, C. Izu, R. Beivide and F. Vallejo

Supercomputer performance is highly dependent on its interconnection subsystem design. In this paper we study how different architectural approaches for router design impact into system performance for real parallel applications. A thorough methodology has been employed to quantify this impact. Architectural router decisions have been chosen taken into account the constraints of the underlying VLSI technology. After that, an exhaustive evaluation of the interconnection network under standard synthetic traffic has been carried out. Finally, an execution-driven simulation environment has been used to asses the consequences of low-level design on the performance of the entire machine. We will show that low-level decisions as the adequate selection of router's arbiter significantly reduces the execution time of parallel applications. To illustrate the effects of the router architecture in sytem performance two benchmarks were selected: Radix and MP3D.

In Proceedings of the 1999 International Conference on Supercomputing . Rhodes, Greece. June 1999

The paper is available as PDF 792 K


<------------ CONGESTION CONTROL ------------->

A throughput fairness injection protocol for mesh and torus networks

C. Izu

Direct networks suffer significant network unfairness under non-uniform heavy loads; nodes near high traffic areas are hardly able to inject new packets while nodes at low traffic areas may inject packets at high rates. The reported average throughput does not reflect the large differences in node throughput amongst the network. Age-based arbitration has been proposed to achieve latency fairness and in doing so it will indirectly improve throughput fairness as well. However, as throughput fairness is only an issue at heavy loads, we would like a less intrusive method that can be applied to a range of network designs without limiting their routing or arbitration policies. This paper presents a simple injection fairness protocol that guarantees all nodes can inject at a similar rate at high loads regardless of their location. Tests with a variety of non-uniform loads will prove the success of the protocol for a range of network sizes.

In 16th International Conference on High Performance Computing, HiPC 2009, December 16-19, Kochi, India. IEEE 2009

The paper is available as PDF 856 KB

Improving the performance of large interconnection networks using congestion-control mechanisms

J. Miguel-Alonso, C. Izu, J.A. Gregorio

Interconnection networks in current parallel systems do not only increase in size; their buffer capacity and number of source ports have increased as well. All these factors result in a significant rise of network congestion compared with their predecessors. Consequently, packet injection must be restricted in order to prevent throughput degradation at high loads. This work evaluates, via simulation, three congestion control mechanisms on adaptive cut-through torus networks, using two different deadlock-avoidance methods, under various synthetic traffic patterns. Workload is generated using bursts of data exchanges (instead of a Bernoulli process) to reflect the synchronized nature of data interchanges in parallel applications. Results show that large networks perform their best when most network resources are dedicated to in-transit traffic. Besides, local congestion control mechanisms are nearly as effective as the more costly global ones for both uniform and non-uniform traffic patterns.

Performance Evaluation Volume 65 Issue 3-4, March, 2008.

Effects of Injection Pressure on Network Throughput

C. Izu J. Miguel-Alonso, J.A. Gregorio

Recent parallel systems use multiple injection ports and various injection policies, but little is known about their impact on network performance. This paper evaluates the influence that these injection interfaces have on maximum sustained throughput in adaptive cut-through torus networks by modeling the number of injection queues (1 or 4), and the allocation of new packets to those queues. Network evaluations for medium to large size 2D tori show that designs with multiple injection ports do not improve performance under uniform traffic. On the contrary, they result in more pressure from the injection interface to acquire the scarce network resources of an already clogged system. Interestingly, for small networks, a single injection FIFO queue, with the HOLB it entails, indirectly provides the much needed injection control. For networks with thousands of nodes and multiple injection channels, as those being implemented in current massively parallel processors, this implicit form of congestion control is not enough. In such systems, restrictive injection policies are required to prevent routers from being flooded with new packets for loads beyond saturation.

In 14th Euromicro International Conference on Parallel, Distributed, and Network-Based Processing (PDP'06), pp. 91-98, 2006.

The paper is available as PDF 792 K

Understanding Buffer Management for Cut-Through 1D Rings

C. Izu and R. Beivide

This paper describes the impact that buffer management has on network performance for a cut-through 1D ring. Such network provides only one routing alternative between each pair of nodes. The network is kept simple to illustrate the significance of buffer management for all virtual cut-through networks. Besides, 2D and 3D torus, among other networks, can be seen as a collection of interlinked 1D rings. The simulation results will show that the key to maximum sustained throughput is to reserve half of our buffer capacity for transit packets. This increases link utilization by maximising packets at the sending node and empty buffers at the receiving node.

In Proceedings of the EURO-PAR'04 Parallel Processing. Pisa, Italy. August 2004 pp. 908-915

The paper is available as PDF 792 K

Restrictive Turning: Deadlock freedom and Congestion Control for Oblivious Cut-through Networks

C. Izu

Congestion control techniques which restrict message injection into the network in the presence of network congestion were initially proposed in the context of computer networks and they have been recently applied with success to interconnection networks. Cut-through networks are more prone to congestion than their wormhole counterparts due to their increase buffer population. This paper compares diverse mechanisms that limit packet population in cut-through 2D tori. Congestion control mechanisms avoid deadlock for oblivious routing. Results showed that a buffer occupancy around 50\% results on high peak throughput, and sustains that peak value for loads above saturation.

In 4th Australasian Computer Architecture Conference (ACAC'99) Auckland, January 1999. pp. 251-262.

The paper is available as PDF 200 K

Congestion Control for a k-ary n-cube Static Network with Bimodal Traffic

C. Izu, A. Arruabarrena

This paper studies the impact that congestion control has in network performance under bimodal traffic. Congestion control techniques which restrict the injection of packets into the network in the presence of network congestion were initially proposed in the context of computer networks and they have been recently applied with success to interconnection networks. Two static networks are studied: a cut-through network which packetizes traffic, and a segment network which splits bimodal traffic into two virtual networks (VNs), one for short messages and one for long messages. Both networks exhibited some throughput degradation for loads above their saturation points. Different congestion control policies are investigated and both networks show that the more restrictive the policy the higher the sustained throughput.

In 5th Australasian Conference on Parallel and Real-time Systems (PART'98) Adelaide, September 1998. pp. 84-95.

The paper is available as PDF 180 K

"Restricted Injection Flow Control for k-ary n-cube Networks

C. Izu, C. Carrion, J.A. Gregorio and R. Beivide

The interconnection network is a critical component of a large parallel computer. Network throughput above saturation is an important parameter, because many parallel programs are either bandwidth-limited or bursty and work above the saturation point for significant periods of their execution time. Limiting the injection of new packets when saturation is reached has been proposed in the context of computer networks as an effective policy for reducing throughput degradation at high loads. In this paper we define two new flow control mechanisms based on restricting packets from filling up network resources. When applying these new mechanisms to a low dimensional $k$-ary $n $-cube network, such as the 2D torus, we obtain two families of routers which are evaluated by means of simulation. Varying the restriction level of the flow control policy has an immediate effect on the maximum network throughput. The more restrictions to avoid oversaturation, the faster the transit packets move within the network. On the other hand, a too restrictive policy can reduce the offered load to levels below saturation, limiting the network throughput. We also compare the restricted injection routers with its wormhole and cut-through counterparts.

In Proc. of the ISCA 10th Int. Conf. on Parallel and Distribu ted Computing Systems (PDCS'97), New Orleans, October 1997, pp. 511-518.

The paper is available as PDF 180 K


<--------- OTHER ROUTER DESIGN ISSUES ---------->

High-performance adaptive routing for networks with arbitrary topology

V. Puente, , J.A. Gregorio, F. Vallejo, R. Beivide, and C. Izu

A strategy to implement adaptive routing in irregular networks is presented and analyzed in this work. A simple and widely applicable deadlock avoidance method, applied to a ring embedded in the network topology, constitutes the basis of this high-performance packet switching. This adaptive router improves the network capabilities by allocating more resources to the fastest and most used virtual network, thus narrowing the performance gap with regular topologies. A thorough simulation process, which obtains statistically reliable measurements of irregular network behavior, has been carried out to evaluate it and compare with other state-of-the-art techniques. In all the experiments, our router exhibited the best behavior in terms of maximum/sustained performance and sensitivity to the network topology.

Journal of Systems Architecture, Volume 52, Issue 6, June 2006, Pages 345-358.

Applying Segment to k-ary n-cube Networks

C. Izu, A. Arruabarrena

Communications in a multicomputer system consist of heterogeneous traffic in which messages exhibit a variety of sizes. Network response is highly dependent on message length distribution, as reported in most network evaluation studies. Hence, router design should be optimized for dealing with heterogeneous traffic. This study analyzes the interaction amongst short and long messages in networks with bimodal length traffic: both single packet traffic (20 flits) and multipacket messages (200 flits). Router design determines the access of both traffic classes to the network resources so we have considered multiple alternatives, from the generic CT or WH static router to variations of the segment router proposed in \cite{Kon94}, which maps the two traffic classes into two separate virtual networks. Having independent injection queues for each virtual network and adjusting the channel multiplexing policy to favour short or long messages provides good performance and added flexibility when compared to its cut-through counterpart.

In Proc. of the 12th ACM International Conference on Supercomputing, ICS'98, Melbourne July 1998, pp. 409-416.

The paper is available as PDF 208 K

Ghost Packets: A Deadlock-free Solution for k-ary n-cube Network

C. Carrion, C. Izu, J.A. Gregorio, F. Vallejo and R. Beivide

In Proceedings or the Sixth Euromicro Workshop on Parallel and Distributed Processing. Madrid, Spain. January 1998 pp.133-139

The paper is available as PDF 404 K

"A Fault-tolerant Static Cut-through Torus Network

C. Izu, A. Arruabarrena

Fault tolerance is an important topic in massive parallel systems as the probability of one failure increases with the number of components. The routing scheme plays an important role in network reliability as it determines the resources to be used i.e. the network links, and in doing so it should avoid the possibility of selecting a faulty one. This work introduces a fault-tolerant routing scheme for torus networks based on oblivious routing which support 1-link failures and multiple link failures as far as they occur on different rows or columns. The implementation of this scheme only requires a minor change to the oblivious router. In a non faulty network, XY routing is applied so the network response is not altered. In the presence of a faulty link, the network remains functional with an expected reduction in network throughput.

In 2nd Australasian Computer Architecture Conference (ACAC'97) Sydney, February 1997. pp. 171-184.

The paper is available as PDF 212 K

Impact of message router structure on mesh interconnection networks

J.A. Gregorio, C. Izu, R. Beivide

In Procs of the Australasian Conference on Parallel and Real-time Systems (PART'95) Fremantle, Sept 1995 pp. 29-36.

Packet Multiplexing: an Efficient Router implementation for Adaptive Mesh Networks

C. Izu, R. Beivide, A. Arruabarrena and J.A. Gregorio

In Procs IEEE First Int. Conference on Algorithms and Architectures for Parallel Processing, Brisbane, April 95 pp. 603-608

A Performance Evaluation of Adaptive Routing in Bidimensional Cut-Through Networks

A. Arruabarrena, R. Beivide C. Izu and J. Miguel

In Parallel Processing Letters, vol 3, n. 4, 1993, pp. 469-484.

Analysis and Evaluation of Message Management Functions for Bidimensional Networks

C. Izu, A. Arruabarrena and R. Beivide

In Microprocessors and Microsystems. v.16, n.6, 1992, pp. 301-309.


<---- NETWORK TOPOLOGY AND GRAPH THEORY ---->

Dense Gaussian Networks: Suitable Topologies for On-Chip Multiprocessors

C. Martinez, E. Vallejo, R. Beivide, C. Izu and M. Moreto

This paper explores the suitability of dense circulant graphs of degree four for the design of on-chip interconnection networks. Networks based on these graphs reduce the Torus diameter in a factor of 1/ sqrt(2), which translates into significant performance gains for unicast traffic. In addition, they are clearly superior to Tori when managing collective communications.
This paper introduces a new two-dimensional node's labeling of the networks explored which simplifies their analysis and exploitation. In particular, it provides simple and optimal solutions to two important architectural issues: routing and broadcasting. Other implementation issues such as network folding and scalability by using hierarchical networks are also explored in this work.

To appear in International Journal of Parallel Programming, Vol. 33, No. 3, June 2006.

The paper is available as PDF 860 K

Chordal Topologies for Interconnection Networks

Ramon Beivide , Carmen Martinez, Cruz Izu, Jaime Gutierrez, Jose Angel Gregorio and Jose Miguel-Alonso

The class of dense circulant graphs of degree four with optimal distance-related properties is analyzed in this paper. An algebraic study of this class is done. Two geometric characterizations are given, one in the plane and other in the space. Both characterizations facilitate the analysis of their topological properties and corroborate their suitability for implementing interconnection networks for distributed and parallel computers. Also a distance-hereditary non-disjoint decomposition of these graphs into rings is computed. Besides its practical consequences, this decomposition allows us the presentation of these optimal circulant graphs as a particular evolution of the traditional ring topology.

In High Performance Computing: 5th International Symposium, ISHPC 2003. Tokyo-Odaiba, pp. 385-392, Japan, October 2003.

Distance-Hereditary Embeddings of Circulant Graphs

Carmen Martinez, Ramon Beivide, Jaime Gutierrez and Cruz Izu

In this paper we present a distance-hereditary decomposition of optimal chordal rings of $2k^{2}$ nodes into a set of rings of $2k$ nodes, where $k$ is the diameter. All the rings belonging to this set have the same length and their diameter corresponds to the diameter of the chordal ring in which they are embedded. The members of this embedded set of rings are non-disjoint and preserve the minimal routing of the original circulant graph. Besides its practical consequences, our research allows the presentation of these optimal circulant graphs as a particular evolution of the traditional ring topology.

in IEEE 4th International Conference on Information Technology - Coding and Computing, Las Vegas, April 2003

The paper is available as PDF 70 K


<---- SIMULATION OF INTERCONNECTION NETWORKS ---->

Realistic Evaluation of Interconnection Network Performance at High Loads

F. J. Ridruejo Perez, J. Navaridas, J. Miguel-Alonso and C. Izu

Any simulation-based evaluation of an interconnection network proposal requires a good characterization of the workload. Synthetic traffic patterns based on independent traffic sources are commonly used to measure performance in terms of average latency and peak throughput. As they do not capture the level of self-throttling that occurs in most parallel applications, they can produce inaccurate throughput estimates at high loads. Thus, workloads that resemble the varying levels of synchronization of actual applications are needed to study the performance of interconnection networks. One approach is to use simple, burst-synchronized synthetic workloads that emulate the self-throttling of many parallel applications. To validate this approach, we compare the gains achieved by a restrictive injection mechanism under this workload with those obtained using traces from the NAS Parallel Benchmarks. This study confirms that the burst-synchronized traffic model provides reasonable performance estimates, which could be improved by taking into account dependency chains between messages.

in Eighth International Conference on Parallel and Distributed Computing, Applications and Technologies (PDCAT 2007) pp. 97-104.

Throughput fairness in k-ary n-cube networks

C. Izu

The performance of an interconnection network is measured by two metrics: average latency and peak network throughput. Network throughput is the total number of packets delivered per unit of time. Most synthetic network loads consist of sources injecting at the same given rate, using traffic patterns such as random, permutations or hot spot, which reflect the distribution of packet destinations in many parallel applications. The network is assumed to be fair: all source nodes are able to inject at the same rate. This work will show such assumption is unfounded for most router proposals. All router designs exhibited significant network unfairness under non-uniform loads. Some routers are also unfair under random traffic patterns. At loads above saturation, if the channel utilization is uneven, the injection matrix will become uneven: packet at low used areas will be accepted at a higher rate that those at the busy areas. As synthetic traffic does not reflect the coupled nature of the traffic generated by parallel applications, the impact of this unfairness on application performance could not be measured. New synthetic loads need to be developed to better evaluate network response beyond saturation.

in proceedings of the 29th Australasian Computer Science Conference (ACSC2006), CRPIT volume 48, pp. 137-145.

The paper is available as PDF 2 M

Evaluation of Interconnection Network Performance Under Heavy Non-uniform Loads

C. Izu, Jose Miguel and J.A. Gregorio

Many simulation-based performance studies of interconnection networks are carried out using synthetic workloads under the assumption of independent traffic sources. We show that this assumption, although useful for some traffic patterns, can lead to deceptive performance results for loads beyond saturation. Network throughput varies so much amongst the network nodes that average throughput does not reflect anymore network performance as a whole. We propose the utilization of burst synchronized traffic sources that better reflect the coupled behavior of parallel applications at high loads. A performance study of a restrictive injection mechanism is used to illustrate the different results obtained using independent and non-independent traffic sources.

In Proceedings of the 6th ICA3PP, Melbourne, October 2005, LNCS 3719 pp 396-405

The paper is available as PDF 426 K

A Case Study of Trace-driven Simulation for Analyzing Interconnection Networks: cc-NUMAs with ILP Processors

V. Puente, J.M. Prellezo, C. Izu, J.A. Gregorio and R. Beivide

In Proceedings of the IEEE 8th Euromicro Workshop on Parallel and Distributed Processing. Rhodes, Greece. January 2000

The paper is available as PDF 240 K

Parallel Simulation of Message Routing Networks

J. Miguel, A. Arruabarrena, C. Izu and R. Beivide

In Procs 3d Euromicro Workshop on Parallel and Distributed Processing (PDP'95) Sanremo, Italy Jan 1995. 138-145


<------------ MAD POSTMAN ROUTER ------------->

Mad-Postman: a Look-ahead Message Propagation Method for Static Bidimensional Meshes

C. Izu, R. Beivide and C. Jesshope

In Proc. 2nd Euromicro Workshop on Parallel and Distributed Processing. Malaga, Jan 1994. pp. 117-124.

The MP1 Network Chip and its application to parallel computers

C. R. Jesshope and C. Izu

In BCS Computer Journal, vol 36, n. 8, 1993, pp. 763-777.

"Experimental Evaluation of Mad-postman Bidimensional Routing Networks

C. Izu, R. Beivide, C. Jesshope and A. Arruabarrena

In 19th Euromicro Conference, Open Systems Design, 1993, pp. 33-41.

"The MP1 Network Chip"

C. R. Jesshope and C. Izu

Proc. Euromicro Workshop on Parallel and Distributed Processing. Las Palmas, January 93, pp. 338-348.

<-------------- ENGINEERING EDUCATION --------------->

Engaging weak programmers in problem solving

B. Alexander. and C. Izu,

As with many schools attracting international students, our postgraduate degrees must cater for students with diverse backgrounds and skill-levels. It is not practical to accurately assess students' domain-skills prior to enrollment. Thus, we found a need for a compulsory bridging course with the dual objectives of improving the problem solving and programming skills of the weakest computer graduates whilst still challenging and improving the skills of experienced programmers. This paper describes our experience in refining such a course over six semesters, in order to meet these goals.

Education Engineering (EDUCON), 2010 IEEE , pp.997-1005, 14-16 April 2010.